Distilled News

Contingency Tables in R

In this tutorial, you’ll learn how to create contingency tables and how to test and quantify relationships visible in them.

Implementing ResNet with MXNET Gluon and Comet.ml for image classification

In this tutorial, we will illustrate how to build an image recognition model using a convolutional neural network (CNN) implemented in MXNet Gluon, and integrate Comet.ml for experiment tracking and monitoring. We will be using the MXNet ResNet model architecture and training that model on the CIFAR 10 dataset for our image classification use case.

Learning R: A gentle introduction to higher-order functions

Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…

Pdftools 2.0: powerful pdf text extraction tools

A pdf document may seem to contain paragraphs or tables in a viewer, but this is not actually true. PDF is a printing format: a page consists of a series of unrelated lines, bitmaps, and textboxes with a given size, position and content. Hence a table in a pdf file is really just a large unordered set of lines and words that are nicely visually positioned. This makes sense for printing, but makes extracting text or data from a pdf file extremely difficult. Because the pdf format has little semantic structure, the pdf_text() function in pdftools has to render the PDF to a text canvas, in order to create the sentences or paragraphs. It does so pretty well, but some users have asked for something more low level.

Reusable Pipelines in R

Pipelines in R are popular, the most popular one being magrittr as used by dplyr. This note will discuss the advanced re-usable piping systems: rquery/rqdatatable operator trees and wrapr function object pipelines. In each case we have a set of objects designed to extract extra power from the wrapr dot-arrow pipe %.>%.

Processing Time Series Data in Real-Time with InfluxDB and Structured Streaming

In the data world, one of the major trends which people want to see is how a metric progresses with time. This makes managing and handling a time series data (simply meaning where data values are co-dependent on time) a very important aspect of a Data Scientist’s life.

Introduction to Interactive Time Series Visualizations with Plotly in Python

In this article, we’ll get an introduction to the plotly library by walking through making basic time series visualizations. These graphs, though easy to make, will be fully interactive figures ready for presentation. Along the way, we’ll learn the basic ideas of the library which will later allow us to rapidly build stunning visualizations. If you have been looking for an alternative to matplotlib, then as we’ll see, plotly is an effective choice.

A Comprehensive Guide to Convolutional Neural Networks – the ELI5 way

Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans and machines. Researchers and enthusiasts alike, work on numerous aspects of the field to make amazing things happen. One of many such areas is the domain of Computer Vision. The agenda for this field is to enable machines to view the world as humans do, perceive it in a similar manner and even use the knowledge for a multitude of tasks such as recognition, classification, recreation, etc. The advancements in Computer Vision with Deep Learning has been constructed and perfected with time, primarily over one particular algorithm?-?a Convolutional Neural Network.

The complete guide for topics extraction with LDA (Latent Dirichlet Allocation) in Python

A recurring subject in NLP is to understand large corpus of texts through topic extraction. Whether you analyze users’ online reviews, product descriptions, or text entered in search bars, understanding key topics will always come in handy.

Exploratory Data Analysis, Feature Engineering and Modelling using Supermarket Sales Data. Part 1.

In these series of posts, we’re going to dive deep and fully explore the amazing world of data exploration, feature engineering and modelling. If you are a beginner in machine learning and data science, and need practical and intuitive explanation to these concepts, then this series is for you.

Hands-on Machine Learning Model Interpretation

Interpreting Machine Learning models is no longer a luxury but a necessity given the rapid adoption of AI in the industry. This article in a continuation in my series of articles aimed at ‘Explainable Artificial Intelligence (XAI)’. The idea here is to cut through the hype and enable you with the tools and techniques needed to start interpreting any black box machine learning model. Following are the previous articles in the series in case you want to give them a quick skim (but are not mandatory for this article).

Preprocessing with sklearn: a complete and comprehensive guide

For aspiring data scientist it might sometimes be difficult to find their way through the forest of preprocessing techniques. Sklearn its preprocessing library forms a solid foundation to guide you through this important task in the data science pipeline. Although Sklearn a has pretty solid documentation, it often misses streamline and intuition between different concepts.

Getting Data ready for modelling: Feature engineering, Feature Selection, Dimension Reduction (Part two)

This is part two of Series of Getting Data ready for modelling. If you have not read Part 1, then I suggest you go first through it. As Feature Engineering is the generally the first step.

Artificial Intelligence Framework: A Visual Introduction to Machine Learning and AI

The transformative nature of Artificial Intelligence in business and our society is evident. Like the internet and the smart phone, AI is an enabler technology that will have a far reaching impact to all areas of our life.

The Data Science Workflow

Suppose you are starting a new data science project (which could either be a short analysis of one dataset, or a complex multi-year collaboration). How should your organize your workflow? Where do you put your data and code? What tools do you use and why? In general, what should you think about before diving head first into your data? In the software engineering industry such questions have some commonly known answers. Although every software company might have its unique traits and quirks, the core processes in most of them are based on the same established principles, practices and tools. These principles are described in textbooks and taught in universities. Data science is a less mature industry, and things are different. Although you can find a variety of template projects, articles, blogposts, discussions, or specialized platforms (open-source [1,2,3,4,5,6,7,8,9,10], commercial [11,12,13,14,15,16,17] and in-house [18,19,20]) to help you organize various parts of your workflow, there is no textbook yet to provide universally accepted answers. Every data scientist eventually develops their personal preferences, mostly learned from experience and mistakes. I am no exception. Over time I have developed my understanding of what is a typical ‘data science project’, how it should be structured, what tools to use, and what should be taken into account. I would like to share my vision in this post.

Scientific Data Analysis Pipelines and Reproducibility

Pipelines are computational tools of convenience. Data analysis usually requires data acquisition, quality check, clean up, exploratory analysis and hypothesis driven analysis. Pipelines can automate these steps. They process raw data to a suitable format and analyze it with statistical tools or machine learning models in a streamlined way. In practical terms, a data analysis pipeline executes a chain of command-line tools and custom scripts. This usually provides processed data sets and a human readable report covering topics such as data quality, exploratory analysis etc.

An introduction to Kubeflow

Kubeflow is an open source Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable machine learning workloads.

AI Transformation Playbook: How to lead your company into the AI era

AI (Artificial Intelligence) technology is now poised to transform every industry, just as electricity did 100 years ago. Between now and 2030, it will create an estimated $13 trillion of GDP growth. While it has already created tremendous value in leading technology companies such as Google, Baidu, Microsoft and Facebook, much of the additional waves of value creation will go beyond the software sector. This AI Transformation Playbook draws on insights gleaned from leading the Google Brain team and the Baidu AI Group, which played leading roles in transforming both Google and Baidu into great AI companies. It is possible for any enterprise to follow this Playbook and become a strong AI company, though these recommendations are tailored primarily for larger enterprises with a market cap/valuation from $500M to $500B.

The Story of a Bad Train-Test Split

From our experience it’s hard to incorporate multiple types of features into a unified model. So we decided to take baby steps, and add the thumbnail to a model that uses only one feature – the title. There’s one thing you need to take into account when working with these two features, and that’s data leakage. When working with the title only, you can naively split your dataset into train-test randomly – after removing items with the same title. However, you can’t apply random split when you work with both the title and the thumbnail. That’s because many items share the same thumbnail or title. Stock photos are a good example for shared thumbnails across different items. Thus, a model that memorizes titles/thumbnails it encountered in the training set might have a good performance on the test set, while not doing a good job at generalization. The solution? We should split the dataset so that each thumbnail appears either in train or test, but not both. Same goes for the title.

What’s the fuss about Regularization?

As a newbie to machine learning most people get excited when their training error starts reducing. They try hard further and it starts reducing even further, their excitement knows no bounds. They show their results to master Oogway ( elderly wise tortoise in Kungfu Panda) and he calmly says well not a good model you need to regularize the model and check the performance on validation set. If you are someone who would like to understand what is ‘Regularization’ and how it helps then read on.

Using Markov Chain Monte Carlo method for project estimation

One type of criticism I received for the previous work on project estimation is that the log-Normal distribution has short tails. And this is true, despite all the benefits of log-Normal distribution. The reason is very simple: when fitting the data to the distribution shape we pick the most likely parameters ?? and ??. This approach, however easy it is, always results in short tails, especially for the small amount of data we possess. Indeed, the parameters of the log-normal distribution can be different, than the most likely parameters we’ve got based on five data points. The appropriate way would be to get the joint distribution of the predictions and parameters and then marginalize by the parameters. In case of the Normal distribution we would get Student’s t-distribution with nice long tails. For the log-normal distribution we would get a more complex distribution, but also with long tails.

Time-series Forecasting Flow

A brief introduction on critical steps in demand forecasting.

Like this:

Like Loading…

Related