Clustering is an essential data mining tool that aims to discover inherent cluster structure in data. For most applications, applying clustering is only appropriate when cluster structure is present. As such, the study of clusterability, which evaluates whether data possesses such structure, is an integral part of cluster analysis. However, methods for evaluating clusterability vary radically, making it challenging to select a suitable measure. In this paper, we perform an extensive comparison of measures of clusterability and provide guidelines that clustering users can reference to select suitable measures for their applications. To Cluster, or Not to Cluster: An Analysis of Clusterability Methods
Whats new on arXiv
Towards ontology based BPMN Implementation
Intro to Data Science for Managers
By ActiveWizards
Creating List with Iterator
In the post (https://statcompute.wordpress.com/2018/11/17/growing-list-vs-growing-queue), it is shown how to grow a list or a list-like queue based upon a dataframe. In the example, the code snippet was heavily relied on the FOR loop to do the assignment item by item, which I can’t help thinking of potential alternatives afterwards. For instance, is there an implementation that would enable us to traverse a dataframe without knowing its dimension in advance or even without using the loop?
Interactive Graphics with R Shiny
Well, R is definitively here to stay and made its way into the data science tool zoo. For me as a statistician, I often feel alienated surrounded by these animals, but R is still also the statistician’s tool of choice (yes, it has come to age, but where are the predators ..?)
RFishBC CRAN Release
R Packages worth a look
Analysing Accelerometer Data Using Hidden Markov Models (HMMpa)Analysing time-series accelerometer data to quantify length and intensity of physical activity using hidden Markov models. It also contains the traditi …
High-performance mathematical paradigms in Python
For any data-intensive discipline, one of the major overheads is making the scientific and numerical computations faster. With performance-critical application and data processing pipeline, comes the need for optimal paradigm implementation and choosing the right set of libraries. After working on science(metabolomics) and finance projects, I decided to compile these tips and tricks which I developed and learned.
Beautiful Chaos: The Double Pendulum
This post is dedicated to the beautiful chaos created by double pendulums. I have seena great variety of animated versions, implemented with different tool but never in R.Thanks to the amazing package gganimate
, it is actually not that hard to produce them in R.
Cartoon: Thanksgiving, Big Data, and Turkey Data Science.
A Turkey Data Scientist: “I don’t like the look of this.