Distilled News

Imagining an Engineer: On GAN-Based Data Augmentation Perpetuating Biases

The use of synthetic data generated by Generative Adversarial Networks (GANs) has become quite a popular method to do data augmentation for many applications. While practitioners celebrate this as an economical way to get more synthetic data that can be used to train downstream classifiers, it is not clear that they recognize the inherent pitfalls of this technique. In this paper, we aim to exhort practitioners against deriving any false sense of security against data biases based on data augmentation. To drive this point home, we show that starting with a dataset consisting of head-shots of engineering researchers, GAN-based augmentation ‘imagines’ synthetic engineers, most of whom have masculine features and white skin color (inferred from a human subject study conducted on Amazon Mechanical Turk). This demonstrates how biases inherent in the training data are reinforced, and sometimes even amplified, by GAN-based data augmentation; it should serve as a cautionary tale for the lay practitioners.

Relation extraction with weakly supervised learning based on process-structure-property-performance reciprocity

In this study, we develop a computer-aided material design system to represent and extract knowledge related to material design from natural language texts. A machine learning model is trained on a text corpus weakly labeled by minimal annotated relationship data (~100 labeled relationships) to extract knowledge from scientific articles. The knowledge is represented by relationships between scientific concepts, such as {annealing, grain size, strength}. The extracted relationships are represented as a knowledge graph formatted according to design charts, inspired by the process-structure-property-performance (PSPP) reciprocity. The design chart provides an intuitive effect of processes on properties and prospective processes to achieve the certain desired properties. Our system semantically searches the scientific literature and provides knowledge in the form of a design chart, and we hope it contributes more efficient developments of new materials.

Introducing vizscorer: a bot advisor to score and improve your ggplot plots

One of the most frustrating issues I face in my professional life is the plentitude of ineffective reports generated within my company. Wherever I look around me is plenty of junk charts, like barplot showing useless 3D effects or ambiguous and crowded pie charts.

Extended Isolation Forest

This is a simple package implementation for the Extended Isolation Forest method. It is an improvement on the original algorithm Isolation Forest which is described (among other places) in this paper for detecting anomalies and outliers from a data point distribution. The original algorithm suffers from an inconsistency in producing anomaly scores due to slicing operations. Even though the slicing hyperplanes are selected at random, they are always parallel to the coordinate reference frame. The shortcoming can be seen in score maps as presented in the example notebooks in this repository. In order to improve the situation, we propose an extension which allows the hyperplanes to be taken at random angles. The way in which this is done gives rise to multiple levels of extension depending on the dimensionality of the problem. For an N dimensional dataset, Extended Isolation Forest has N levels of extension, with 0 being identical to the case of standard Isolation Forest, and N-1 being the fully extended version. Here we provide the source code for the algorithm as well as documented example notebooks to help get started. Various visualizations are provided such as score distributions, score maps, aggregate slicing of the domain, and tree and whole forest visualizations. most examples are in 2D. We present one 3D example. However, the algorithm works readily with higher dimensional data.

What is Hidden in the Hidden Markov Model?

Hidden Markov Models or HMMs are the most common models used for dealing with temporal Data. They also frequently come up in different ways in a Data Science Interview usually without the word HMM written over it. In such a scenario it is necessary to discern the problem as an HMM problem by knowing characteristics of HMMs.

How to Ensure Safety and Security During Condition Monitoring and Predictive Maintenance : a Case Study

Effective communication from machines and embedded sensors, actuators in industries are crucial to achieve industrial digitalization. Efficient remote monitoring as well as maintenance methodologies helps to accomplish and transform the existing industries to Smart Factories. Monitoring and maintenance leads to the aggregation of the real-time data from sensors via different existing and new industrial communication protocols. Development of user-friendly interface allows remote Condition Monitoring (CM). Context aware analysis of real-time and historical data provides capability to accomplish active Predictive Maintenance (PdM). Both CM and PdM needs access to the machine process data, industrial net-work and communication layer. Furthermore, data flow between individual components from the Cyber-Physical System (CPS) components starting from the actual machine to the database or analyze engine to the real visualization is important. Security and safety aspects on the application, communication, network and data flow level should be considered. This thesis presents a case study on benefits of PdM and CM, the security and safety aspect of the system and the current challenges and improvements. Components of the CPS ecosystem are taken into consideration to further investigate the individual components which en-ables predictive maintenance and condition monitoring. Additionally, safety and security aspects of each component is analyzed. Moreover, the current challenges and the possible improvements of the PdM and CM systems are analyzed. Also, challenges and improvements regarding the components is taken into consideration. Finally, based on the research, possible improvements have been proposed and validated by the researcher. For the new digital era of secure and robust PdM 4.0, the improvements are vital references.

The ultimate guide to starting AI

A step-by-step overview of how to begin your project, including advice on how to craft a wise performance metric, setting up testing criteria to overcome human bias, and more.

Self-Service Analytics and Operationalization – Why You Need Both

Get the guidebook / whitepaper for a look at how today’s top data-driven companies scale their advanced analytics & machine learning efforts.

Managing risk in machine learning

In this post, I share slides and notes from a keynote I gave at the Strata Data Conference in New York last September. As the data community begins to deploy more machine learning (ML) models, I wanted to review some important considerations.

Preview my new book: Introduction to Reproducible Science in R

I’m pleased to share Part I of my new book ‘Introduction to Reproducible Science in R’. The purpose of this book is to approach model development and software development holistically to help make science and research more reproducible. The need for such a book arose from observing some of the challenges that I’ve seen teaching graduate courses in natural language processing and machine learning, as well as training my own staff to become effective data scientists. While quantitative reasoning and mathematics are important, often I found that the primary obstacle to good data science was reproducibility and repeatability: it’s difficult to quickly reproduce someone else’s results.

A Data Lake’s Worth of Audio Datasets

At Wonder Technologies, we have spent a lot of time building Deep learning systems that understand the world through audio. From deep learning based voice extraction to teaching computers how to read our emotions, we needed to use a wide set of data to deliver APIs that worked even in the craziest sound environments. Here is a list of datasets that I found pretty useful for our research and that I’ve personally used to make my audio related models perform much better in real-world environments.

Inferential Statistics basics

Statistics is one of the most important skills required by a data scientist. There is a lot of mathematics involved in statistics and it can be difficult to grasp. So in this tutorial we are going to go through some of the concepts of statistsics to learn and understand inferential statistics and master it.

Hybrid Fuzzy Name Matching

My workplace works with large-scale databases that, amongst many things, contains data about people. For each person in the DB we have a unique identifier, which is composed of the person’s first name, last name, zip code. We hold ~500MM people in our DB, which can essentially have duplicates if there is a little change in the person’s name. For example, Rob Rosen and Robert Rosen (with the same zip code) will be treated as two different people. I want to note that if we get the same person an additional time, we just update the record’s timestamp, so there is no need for this sort of deduping. In addition, I would like to give credit to my co-worker Jonathan Harel who assisted me in the research for this project.

Predicting Probability Distributions Using Neural Networks

If you’ve been following our tech blog lately, you might have noticed we’re using a special type of neural networks called Mixture Density Network (MDN). MDNs do not only predict the expected value of a target, but also the underlying probability distribution. This blogpost will focus on how to implement such a model using Tensorflow, from the ground up, including explanations, diagrams and a Jupyter notebook with the entire source code.

Why and how to Cross Validate a Model?

Once we are done with training our model, we just can’t assume that it is going to work well on data that it has not seen before. In other words, we cant be sure that the model will have the desired accuracy and variance in production environment. We need some kind of assurance of the accuracy of the predictions that our model is putting out. For this, we need to validate our model. This process of deciding whether the numerical results quantifying hypothesised relationships between variables, are acceptable as descriptions of the data, is known as validation.

From Eigendecomposition to Determinant: Fundamental Mathematics for Machine Learning with Intuitive Examples Part 3/3

For understanding the mathematics for machine learning algorithms, especially deep learning algorithms, it is essential to build up the mathematical concepts from foundational to more advanced. Unfortunately, Mathematical theories are too hard/abstract/dry to digest in many cases. Imagine you are eating a pizza, it is always easier and more fun to go with a coke. The purpose of this article is to provide intuitive examples for fundamental mathematical theories to make the learning experience more enjoyable and memorable, which is to serve chicken wings with beer, fries with ketchup, and rib-eye with wine.

Automated Feature Engineering for Predictive Modeling

One of the main time investments I’ve seen data scientists make when building data products is manually performing feature engineering. While tools such as auto-sklearn and Auto-Keras have been able to automate much of the model fitting process when building a predictive model, determining which features to use as input to the fitting process is usually a manual process. I recently started using the FeatureTools library, which enables data scientists to also automate feature engineering.

Finding and managing research papers: a survey of tools and products

As researchers, especially in (overly) prolific fields like Deep Learning, we often find ourselves overwhelmed by the huge amount of papers to read and keep track of in our work. I think one big reason for this is insufficient use of existing tools and services that aim to make our life easier. Another reason is the lack of a really good product which meets all our needs under one interface, but that is a topic for another post. Lately I’ve been getting into a new subfield of ML and got extremely frustrated with the process of prioritizing, reading and managing the relevant papers… I ended up looking for tools to help me deal with this overload and want to share with you the products and services that I’ve found. The goal is to improve the workflow and quality of life of anyone who works with scientific papers.

A New Hyperbolic Tangent Based Activation Function for Neural Networks

In this article, I introduce a new hyperbolic tangent based activation function, tangent linear unit (TaLU), for neural networks. The function was evaluated for performance using CIFAR-10 and CIFAR-100 database. The performance of the proposed activation function was in par or better than other activation functions such as: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), and exponential linear unit (ELU).

Beginner tutorial: Build your own custom real-time object classifier

In this tutorial, we will learn how to build a custom real-time object classifier to detect any object of your choice! We will be using BeautifulSoup and Selenium to scrape training images from Shutterstock, Amazon’s Mechanical Turk (or BBox Label Tool) to label images with bounding boxes, and YOLOv3 to train our custom detection model.

Like this:

Like Loading…

Related