Distilled News

Real-time Anomaly Detection on Streaming Data

In this paper we present the Random Cut Forest algorithm, which detects anomalies in real-time streaming data. We have implemented this algorithm as a built-in SQL function in Amazon Kinesis Data Analytics, which is a fully managed AWS service that makes it easy to analyze streaming data with SQL in real-time.

Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be

With free, open source tools like Anaconda Distribution, it has never been easier for individual data scientists to analyze data and build machine learning models on their laptops. So why does deriving actual business value from machine learning remain elusive for many organizations? Because while it´s easy for data scientists to build powerful models on their laptops with tools like conda and TensorFlow, business value comes from deploying machine learning models into production. Only in production can a deployed model actually serve the business. And unfortunately, the path to production remains difficult for many companies.

Soft Skills Every Data Scientist Must Possess

5 Soft Skills for a Data Scientist1. Understanding the Business2. Translating the Tech Language3. Syncing the Business with Technology4. Giving Data Analytics a Perspective5. Curiosity for Discovery

Solving Some Image Processing Problems with Python libraries – Part 1

In this article a few popular image processing problems along with their solutions are going to be discussed. Python image processing libraries are going to be used to solve these problems.

How to setup a Data Science function for your organisation

This article will provide a high level understanding of effective ways to set up a data science function in 3 types of organisations: (1) Startups, (2) Medium Size Organisations, and (3) Large Organisations.

Getting Started with NLP: Simple Topic Modeling in R (Part 1)

In this article, I will walk through a simple code to build an accurate topic classification model. Beginner data analysts, data analysts with no experience in NLP or other data scientists who are curious to see other ways of approaching topic modeling will find this interesting.

Structured Streaming Programming Guide

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

Introduction to Machine Learning — evaulating chemical composition of wine

We will walk through an example that involves training a model to tell what kind of wine will be ‘good’ or ‘bad’ based on a training set of wine chemical characteristics. First, we’re going to import the packages that we’ll be using throughout this notebook. Then we’ll bring in the CSV from my desktop. You can get the raw data from UCI’s ML Database.

The next generation of AI assistants in enterprise

Chatbots are the first step toward autonomous organizations: companies whose operations are largely run by many different AI assistants. Analogous to autonomous cars, there are five levels of sophistication for AI assistants. Currently, basic level two AI assistants are mainstream, and Google just showed the world what a level three assistant looks like. Achieving true level five capabilities for AI assistants will result in a significant shift for society, with many implications for businesses and their customers.Level 1: Notification AssistantsLevel 2: FAQ AssistantsLevel 3: Contextual AssistantsLevel 4: Personalized AssistantsLevel 5: Autonomous Organization of Assistants

Introduction to Fraud Detection Systems

Fraud detection is one of the top priorities for banks and financial institutions, which can be addressed using machine learning. According to a report published by Nilson, in 2017 the worldwide losses in card fraud related cases reached 22.8 billion dollars. The problem is forecasted to get worse in the following years, by 2021, the card fraud bill is expected to be 32.96 billion dollars. In this tutorial, we will use the credit card fraud detection dataset from Kaggle, to identify fraud cases. We will use a gradient boosted tree as a machine learning algorithm. And finally, we will create a simple API to operationalize (o16n) the model. We will use the gradient boosting library LightGBM, which has recently became one of the most popular libraries for top participants in Kaggle competitions. Fraud detection problems are known for being extremely imbalanced. Boosting is one technique that usually works well with these kind of datasets. It iteratively creates weak classifiers (decision trees) weighting the instances to increase the performance. In the first subset, a weak classifier is trained and tested on all the training data, those instances that have bad performance are weighted to appear more in the next data subset. Finally, all the classifiers are ensembled with a weighted average of their estimates.

Auto-Keras, or How You can Create a Deep Learning Model in 4 Lines of Code

Automated machine learning is the new kid in town, and it´s here to stay. It is helping us create better and better models with easy to use and great API´s. Here I´ll talk to you about Auto-Keras, the new package for AutoML with Keras. There´s a surprise in the end ;).

Reproducible research and a repository of artifacts, a RFC

This work is still in progres. I think, however, it can already resonate with some people in the community. The communication I am hopeful for should lead to a better design and maybe getting valuable tools faster. The main goal is to extend the base R´s history mechanism (see ?history) which currently gives access to past commands run in R. What if, however, we could browse not only the commands but also the objects (artifacts)? Hence, the repository of artifacts. It is implemented by a number of packages. The two most important are: the repository which provides the basic logic of storing, processing and retrieving artifacts; and the ui which implements a basic, text-only user interface and hooks callbacks into R. The other packages are: storage, defer and utilities. Here are the basic rules of how repository of artifacts works: the state of R session after each command is examined and all R objects and plots are recorded, together with the information about their origin (parent objects). Thus, the complete graph of origin of each artifact can be retrieved from the repository: the complete sequence of R commands and their byproduct artifacts. Further explanation can be found in the current motivation and plan for future work and examples of working with the repository are presented in this tutorial.

How to Create Sankey Diagrams From Tables (Data Frames) Using R

In this post I show how you can use R to create a Sankey Diagram when your data is set up as a table (data frame).

Bayesian models for weighted data with missing values: a bootstrap approach

Many data sets, especially from surveys, are made available to users with weights. Where the derivation of such weights is known, this information can often be incorporated in the user’s substantive model (model of interest). When the derivation is unknown, the established procedure is to carry out a weighted analysis. However, with non-trivial proportions of missing data this is inefficient and may be biased when data are not missing at random. Bayesian approaches provide a natural approach for the imputation of missing data, but it is unclear how to handle the weights. We propose a weighted bootstrap Markov chain Monte Carlo algorithm for estimation and inference. A simulation study shows that it has good inferential properties. We illustrate its utility with an analysis of data from the Millennium Cohort Study.

Like this:

Like Loading…

Related