Roughly speaking, my machine learning journey began on Kaggle. “There’s data, a model (i.e. estimator) and a loss function to optimize,” I learned. “Regression models predict continuous-valued real numbers; classification models predict ‘red,’ ‘green,’ ‘blue.’ Typically, the former employs the mean squared error or mean absolute error; the latter, the cross-entropy loss. Stochastic gradient descent updates the model’s parameters to drive these losses down.” Furthermore, to fit these models, just import sklearn
.
Parallel computation with two lines of code
It’s a naive advice for real beginners, however I’m sure I will copypaste snippets from here over and over again.
Workshop sur le Topic Modeling
J’ai eu le plaisir de mener récemment un workshop sur le topic modeling dans le cadre du Master Méthode computationnelle et analyse de contenu à l’Université Paris Est Marne la vallée.
Python Deep Learning tutorial: Elman RNN implementation in Tensorflow
In this Python Deep Learning tutorial, an implementation and explanation is given for an Elman RNN. The implementation is done in Tensorflow, which is one of the many Python Deep Learning libraries.
Create conda recipe to use C extended Python library on PySpark cluster with Cloudera Data Science Workbench
Cloudera Data Science Workbench provides data scientists with secure access to enterprise data with Python, R, and Scala. In the previous article, we introduced how to use your favorite Python libraries on an Apache Spark cluster with PySpark. In Python world, data scientists often want to use Python libraries, such as XGBoost, which includes C/C++ extension. This post shows how to solve this problem creating a conda recipe with C extension. The sample repository is here.
Normal Distributions
I review — and provide derivations for — some basic properties of Normal distributions. Topics currently covered: (i) Their normalization, (ii) Samples from a univariate Normal, (iii) Multivariate Normal distributions, (iv) Central limit theorem.
Voronoi Diagrams
||
Getting Started with Cloudera Data Science Workbench
Last week, Cloudera announced the General Availability release of Cloudera Data Science Workbench. In this post, I’ll give a brief overview of its capabilities and architecture, along with a quick-start guide to connecting Cloudera Data Science Workbench to your existing CDH cluster in three simple steps.
Transfer Learning for Flight Delay Prediction via Variational Autoencoders
In this work, we explore improving a vanilla regression model with knowledge learned elsewhere. As a motivating example, consider the task of predicting the number of checkins a given user will make at a given location. Our training data consist of checkins from 4 users across 4 locations in the week of May 1st, 2017 and looks as follows:
The Benefits of Migrating HPC Workloads To Apache Spark
Recently we worked with a customer that needed to run a very significant amount of models in a given day to satisfy internal and government regulated risk requirements. Several thousand model executions would need to be supported per hour. Total execution time was very important to this client. In the past the customer used thousands of servers to meet the demand. They need to run many derivations of this model with different economic factors to satisfy their requirements. For example, a financial model may calculate risk to the bank based on many different runs with varying economic factors. This particular model was planned to consume up to 40K CPU cores once in production. The reason for so many cores is simple, they need these jobs to be completed as quickly as possible to make adjustments for the business and sometimes the government to test varying economic factors that affect a financial institution. The cycle that they run these jobs in are very compressed and allows very little room for error.