I just watched this video the value of theory inapplied fields (like statistics), it really resonated with my previous research experiences in statistical physics and on the interplay between randomised perfect sampling algorithms and Markov Chain mixing as well as my current perspective on the status quo of deep learning. . . .
Data Science “Paint by the Numbers” with the Hypothesis Development Canvas
When I was a kid, I use to love “Paint by the Numbers” sets. Makes anyone who can paint or color between the lines a Rembrandt or Leonardo da Vinci (we can talk later about the long-term impact of forcing kids to “stay between the lines”).
Data Representation for Natural Language Processing Tasks
We have previously had a long look at a number of introductory natural language processing (NLP) topics, from approaching such tasks, to preprocessing text data, to getting started with a pair of popular Python libraries, and beyond. I was hoping to move on to exploring some different types of NLP tasks, but had it pointed out to me that I had neglected to touch on a hugely important aspect: data representation for natural language processing.
Quick overview on the new Bioconductor 3.8 release
Every six months the Bioconductor project releases it’s new version of packages. This allows developers a time window to try out new methods and test them rigorously before releasing them to the community at large. It also means that this is an exciting time �. With every release there are dozens of new software packages. Bioconductor version 3.8 was just released on Halloween: October 31st, 2018. Thus, this is the perfect time to browse through their descriptions and find out what’s new that can be of use to your research.
Document worth reading: “Transfer Metric Learning: Algorithms, Applications and Outlooks”
Distance metric learning (DML) aims to find an appropriate way to reveal the underlying data relationship. It is critical in many machine learning, pattern recognition and data mining algorithms, and usually require large amount of label information (class labels or pair/triplet constraints) to achieve satisfactory performance. However, the label information may be insufficient in real-world applications due to the high-labeling cost, and DML may fail in this case. Transfer metric learning (TML) is able to mitigate this issue for DML in the domain of interest (target domain) by leveraging knowledge/information from other related domains (source domains). Although achieved a certain level of development, TML has limited success in various aspects such as selective transfer, theoretical understanding, handling complex data, big data and extreme cases. In this survey, we present a systematic review of the TML literature. In particular, we group TML into different categories according to different settings and metric transfer strategies, such as direct metric approximation, subspace approximation, distance approximation, and distribution approximation. A summarization and insightful discussion of the various TML approaches and their applications will be presented. Finally, we provide some challenges and possible future directions. Transfer Metric Learning: Algorithms, Applications and Outlooks
The blocks and rows theory of data shaping
We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata
data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.
Whats new on arXiv
Change Surfaces for Expressive Multidimensional Changepoints and Counterfactual Prediction
My two talks in Austria next week, on two of your favorite topics!
Innsbruck, 7 Nov 2018:
Data Notes: Chinese Tourism's Impact on Taiwan
Chinese tourism, US elections, and PyTorch: Enjoy these new, intriguing, and overlooked datasets and kernels
How Data Science Is Improving Higher Education
By Kayla Matthews, Productivity Bytes