Some time ago in June 2013 I gave a lab tutorial on Monte Carlo methods at Microsoft Research. These tutorials are seminar-talk length (45 minutes) but are supposed to be light, accessible to a general computer science audience, and fun.
Clustering debates from UK politicians
What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster. Click on the image below for an enlarged version.
Generating Fibonacci Numbers
This article is about algorithms that can be used to generate Fibonacci numbers.
Emoticons decoder for social media sentiment analysis in R
Jessica Peterka-Bonetta (jessica@today-is-a-good-day.de)
发表于
If you have ever retrieved data from Twitter, Facebook or Instagram with R, you might have noticed a strange phenomenon. While R seems to be able to display some emoticons properly, many other times it doesn’t, making any further analysis impossible unless you get rid of them. With a little hack, I decoded these emoticons and put them all in a dictionary for further use. I’ll explain how I did it and share the decoder with you.
7 tools in every data scientist’s toolbox
There is huge number of machine learning methods, statistical tools and data mining techniques available for a given data related task, from self organizing maps to Q-learning, from streaming graph algorithms to gradient boosted trees. Many of these methods, while powerful in specific domains and problem setups, are arcane and utilized or even understood by few.
Denoising Dirty Documents: Part 9
Now that Kaggle’s Denoising Dirty Documents Competition has closed, it’s time to start posting the secrets to getting a very good score in this competition. In this blog, I describe how to take advantage of the first of two information leakages that I used.
Deep Learning Startups, Applications and Acquisitions – A Summary
Most major tech companies are use Deep Learning techniques in one way or another, and many have new initiatives on the way. Self-driving cars use Deep Learning to model their environment. Siri, Cortana and Google Now use it for speech recognition, Facebook for facial recognition, and Skype for real-time translation.
LOCF and Linear Imputation with PostgreSQL
This tutorial will introduce various tools offered by PostgreSQL, and SQL in general – like custom functions, window functions, aggregate functions, WITH clause (or CTE for Common Table Expression) – for the purpose of implementing a program which imputes numeric observations within a column applying linear interpolation where possible and forward and backward padding where necessary. I’m going to progressively add and explain those constructs, step by step, so no problem if you are new to the scene. I am very much interested in input regarding potential downsides of the implementation and possible improvements.
How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark
Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH.
On the consistency of ordinal regression methods
Fabian Pedregosa
发表于
My latests work (with Francis Bach and Alexandre Gramfort) is on the consistency of ordinal regression methods. It has the wildly imaginative title of “On the Consistency of Ordinal Regression Methods” and is currently under review but you can read the draft of it on ArXiv. If you have any thoughts about it, please leave me a comment!