Data Science Glossary

A

A/B Testing:

A mechanism of testing two techniques or two versions to determine which one is better.

Activation Function

A function that maps the input to a neuron, to the output. If the result of this function is above a certain threshold, the neuron is activated. Sigmoid Function and ReLU are some examples of the activation function.

Artificial Intelligence

The mechanism by which a machine can take input from the environment via sensors, process this input using the experiences it has gained and take rational and intelligent decision on the environment using actuators, much like how humans do.

AutoEncoders

A neural network where the output is set to be the same as the input and the goal is to train the hidden layers such that the input can be represented in fewer dimensions by encoding.

B

Backpropagation

A technique used in training deep learning networks to update the weights and biases by calculating the gradient so that the accuracy of the network can be improved iteratively.

Bayes’ Theorem

This theorem gives the probability of an event when we already have information about some other condition related to the event. The relation is given by P(A

B) = P(B

A) / P(A)*P(B)

Bias

A parameter that can be used to tweak the output values towards or away from the decision boundary.

Big Data

Large volume of data that cannot be processed by traditional data processing applications and may be analysed to generate valuable insights from the data.

C

Classification

The process of determining the score for an individual entry in order to find out which class the individual belongs to.

Clustering

The method of grouping a set of data such that all elements of a particular group have some common parameters or characteristics.

Confusion Matrix

A matrix that defines the performance of a classification model on a test data for which the true values are known. This matrix uses the True Positive(TP), True Negative (TN), False Positive(FP) and False Negative(FN) to evaluate the performance.

Cost Function

A function for a model that determines the error in prediction of a dependent variable given an independent variable.

Cross-validation

The process of splitting the labelled data into training set and testing set, and after the model has been trained with the training set, testing the model using the testing set for which we already know what the output should be. This results in determining how well the model is working on previously unseen data.

D

Data Mining

The process of gathering relevant data for the area of interest, generate a model to find patterns and relationships amongst the data, and present the derived information in appropriate and useful form.

Data Science

The method of using appropriate tools and techniques to represent the most efficient algorithm for generating insights from a problem set that we have substantial expertise and data on.

Data Wrangling

The process of acquiring data from multiple sources, cleaning the data (removing/replacing missing/redundant data), combining the data to acquire only required fields and entries, and preparing the data for easy access and analysis.

Decision Boundary

A boundary that separates the elements of one class from the elements of another class. See this link for a good explanation on decision boundary.

Decision Tree

A supervised learning algorithm that models a tree where every branch represents a set of alternatives and leaves represent the decisions. By taking a series of decisions along the branches, we ultimately reach the desired result at one of the leaves. See this link for an in-depth explanation on decision trees.

Deep Learning

A subfield of machine learning that uses the power of neural network to accelerate the processing of algorithms to make computations on huge amount of data.

Dependent Variable

A variable that is under test and changes with respect to the change in an independent variable. In a housing price example, change in area results in change in price. Here, price is a dependent variable which depends on the independent variable area.

Dimensionality Reduction

The process of reducing the number of features or dimensions of a training set without losing much information from the data and increase the model’s performance.

E

Exploratory Data Analysis

A data analysis approach used to discover insights in the data, often using graphical techniques.

F

Feature

The input variables in a problem set that can be measured to provide information.

Feature Selection

Selection of a subset of the most relevant attributes in a problem set. This is done for effective model construction.

G

Gradient Descent

An optimization process to minimize the cost function by finding optimum values for the parameters involved.

Graphics Processing Unit (GPU)

A chip that processes complex mathematical functions rapidly and is generally used for image rendering. Deep learning models generally require lots of processing that are not feasible with time constraints using ordinary CPU and hence GPUs are required.

H

Hyperparameter

Parameters in a model that cannot be directly learnt from the data and is decided by setting different values to determine which works best for the model. Tuning the learning rate for a model is an example of hyperparameter.

I

Independent Variable

A variable that is used to change a dependent variable and experiment its values. See dependent variable for example.

L

Label

The output variables in a problem set, whose values are either provided or needs to be predicted.

Latent Variable

A hidden variable that cannot be computed directly, but are measured by computation of other variables that can be measured directly. The other measurable variables ultimately impact the hidden variable. See this link for good examples of latent variable.

Learning Rate

A hyperparameter that is used to adjust the weights of a network with respect to the gradient loss. Small learning rate could result in taking a long time for the model to converge whereas large learning rate could result in the model never being converged. So, an optimal learning rate must be found for best results.

Linear Regression

A regression technique where the linear relationship between a dependent variable and one or more independent variables is determined. The dependent and independent variables are continuous in nature.

Logistic Regression

A classification technique where the relationship between a binary dependent variable and one or more independent variables is determined.

M

Machine Learning

The process that uses data and statistics to learn and generate algorithms and models to perform intelligent action on previously unseen data.

Model Selection

The process of selecting a statistical model from a set of alternative models for a problem set that results in the right balance between approximation and estimation errors.

Model Tuning

The process of tweaking the hyperparameters in order to improve the efficiency of the model.

N

Natural Language Processing

A field of computer science associated with the study of how computers can understand and interact with humans using the natural language as spoken by humans.

Neural Network

A layered interconnection of neurons that receive input, process the input and use the activation function to generate an output. This output will be the input that is passed onto the next layer and so on until the final output layer. Also known as Artificial Neural Network, it is inspired from how the human brain works.

O

Outlier

Unusual observations in data that lie at an abnormal distance from where majority of the data points are located.

Overfitting

A model is overfitted when it is trained with lots of data such that it learns even the noise parameters in the given data and does not work well with data it has not seen before.

P

Perceptron

A single layer neural network.

Predictive Analysis

A process of predicting unseen events using historical data and statistical techniques derived from data mining and machine learning.

R

Random Forest

A combination of many decision trees in a single model. The predictions are closer to the average in case of random forest as predictions from multiple decision trees are taken into account.

Rectified Linear Unit (ReLU)

An activation function for which the output is 0 if the input is less than 0, otherwise the output is equal to the input. So, it takes the max of zero, i.e. f(x) = max(x,0)

Regression

A technique of measuring the relationship between the mean value of a dependent variable and one or more independent variables.

Reinforcement Learning

A machine learning technique where an agents learn by taking action in an environment and observing the results. The agent can start with no knowledge of the environment and it learns from experience.

S

Sigmoid Function

An S-shaped activation function for which the output approaches 0 if the input approaches -infinity and the output approaches 1 if the input approaches +infinity. The sigmoid function is given by sigmoid(x)=1/(1+e^(-x))

Softmax

A function used in multiple classification logistic regression that calculates the probability of each target class over all possible target classes. The probabilities of each class will be between 0 and 1, and the sum of the probabilities will be equal to 1.

Supervised Learning

A machine learning technique where we provide the model with both the input and the actual output data, and train the model such that the predicted output closely resembles the actual output so that it can later make predictions on unseen data.

Support Vector Machine (SVM)

A supervised machine learning technique which outputs an optimal hyperplane that can classify new examples into multiple classes.

T

Testing Set

A portion of already acquired data we have separated to test the accuracy of the model after the model has been trained with training set. To split 20% of the total available data as test set is considered good but may be changed as per model requirements.

Training Set

A portion of already acquired data we have separated to acquire parameters and train the model which may later be tested with the test set. To split 80% of the total available data as training set is considered fair.

U

Underfitting

A model is underfitted when it is not able to capture the parameters from the given data and hence does not work well even on previously seen data.

Unsupervised Learning

A machine learning technique where we provide the model with only the input data and train the model such that the model learns to determine the pattern in the data and later make predictions on unseen data.

W

Weight

The strength of connection between two neurons of two successive layers in a neural network.

You can also find more terms related to data science and machine learning from the following links: