Distilled News

Cornell Statistical Consulting Unit News Archive

Cornell Statistical Consulting Unit produces materials about statistical methods and tools, handouts and recommended articles.

A Framework for Intelligence and Cortical Function Based on Grid Cells in the Neocortex

Despite the massive amount of detail neuroscientists have amassed about the neocortex, how it works is still a mystery. In this paper, we propose a novel theoretical framework for understanding what the neocortex does and how it does it. Our proposal is based upon grid cells. The study of grid cells has been one of the most exciting areas of neuroscience over the past decade. Found in the entorhinal cortex, grid cells are used in navigation and are a powerful neural mechanism for representing the location of a body in the environment. We propose that the same mechanisms in the entorhinal cortex and hippocampus that originally evolved for learning the structure of environments are now used by the neocortex to learn the structure of objects. The mechanism involves pairing location signals with sensory input over time. The framework suggests mechanisms for how the cortex represents object compositionality, object behaviors and even high-level concepts. It leads to the hypothesis that every part of the neocortex learns complete models of objects. Unlike traditional hierarchical ideas where objects are learned only at the top, the paper proposes that there are many models of each object distributed throughout the neocortex. We call this hypothesis the Thousand Brains Theory of Intelligence.

Introduction to Cyclical Learning Rates

Neural network is no longer an uncommon phrase to the Computer Science society or lets say to the society in general. The main reason that makes it so cool is not just the amount of real-world problems it is solving, but also the kind of problems it is solving. How can they be so varied? Be it in the field of Cognitive Psychology, be it in the domain of Cyber Security, be it in the area of Health-care (You are not considering Computer Vision, Computer Graphics, Natural Language Processing, etc. for the time being.). Let’s name the more uncommon ones! Almost each and every industry is getting tremendously benefited by the intelligence and automation a neural network has to offer.

Supercharge Your Subqueries

In this tutorial, you’ll learn how to create subqueries in SQL to better analyze and report data. Once you become familiar with SQL, you realize that all of the really cool analyses require multiple steps. For example, suppose you want to create a histogram of the amount of time each user spends on your website. First, you’ll need to calculate the amount of time spent per user. Then, you’ll want to count the number of users who spend a certain amount of time on your site. There are three options for creating these subqueries:

Machine Learning Benchmarking with SFA in R

Regression analysis is the most demanding machine learning method in 2018. One group of regression analysis for measuring economic efficiency is stochastic frontier analysis (SFA). This method is well suited for benchmarking and finding improvements for optimization in companies and organizations. It can, therefore, be used to design companies so they generate more value for employees and customers.

New Research Uncovers $500 Million Enterprise Value Opportunity with Data Literacy

According to a major academic study commissioned by Qlik®, on behalf of the newly launched Data Literacy Project, large enterprises that have higher corporate data literacy experience $320-$534 million in higher enterprise value (the total market value of the business). Corporate data literacy is the ability of a company workforce to read, analyze, utilize for decisions and communicate data throughout the organization. Beyond having a data literate workforce, organizations must ensure these skills are used for decision making across the business to compete in the fourth industrial revolution. Despite clear correlation between enterprise value and data literacy, there is a gap between how companies perceive the importance and relevance of data, and how they are actively increasing workforce data literacy. While 92% of business decisions makers believe it is important for employees to be data literate, just 17% report that their business significantly encourages employees to become more confident with data.

Modularize your Shiny Apps: Exercises

Shiny modules are short (well, usually short) server and UI functions, that can be connected to each other by a common namespace, and be embedded within a regular Shiny app. You can’t run a Shiny module without a parent Shiny app. The modules can contain both inputs and outputs, and are usually centered around a single operation or theme. The biggest advantage of modules is the ability to efficiently reuse Shiny code, which can save a great deal of time. In addition, modules can help you standardize and scale your Shiny operations. Lastly, even if not reused, Shiny modules can help with organizing the code and break it into smaller pieces – which is very much needed in many complex Shiny apps. Some more information on Shiny modules can be found here.

Running R scripts within in-database SQL Server Machine Learning

Having all the R functions, all libraries and any kind of definitions (URL, links, working directories, environments, memory, etc) in one file is nothing new, but sometimes a lifesaver. Using R function source is the function to achieve this. Storing all definitions, functions, classes on one place can help enterprises achieve faster installation and safer environment usage. So the idea is simple. Stack all the needed functions, classes, libraries and configurations you want to use in a specific environment and save it in a R file. Create a file with all the needed setups of libraries and functions, as seen below. We will call this file as util.R.

Warm Starting Bayesian Optimization

Hyper-parameter tuning is required whenever a Machine Learning model is trained on a new data-set. Nevertheless, it is often foregone as it lacks a theoretical framework which I have previously tried to demystify here:

Knowledge Plus Statistics: Understanding the Emerging World of Deep Probabilistic Programming…

The use of statistics to overcome uncertainty is one of the pillars of a large segment of the machine learning market. Probabilistic reasoning has long been considered one of the foundations of inference algorithms and is represented is all major machine learning frameworks and platforms. Recently, probabilistic reasoning has seen major adoption within tech giants like Uber, Facebook or Microsoft helping to push the research and technological agenda in the space. Specifically, probabilistic programming languages(PPLs) have become one of the most active areas of development in machine learning sparking the release of some new and exciting technologies.

Deep Learning Performance Cheat Sheet

Simple and complex tricks that can help you boost your deep learning models accuracy The question that I get the most from new and experienced machine learning engineers is ‘how can I get higher accuracy?’ Makes a lot of sense since the most valuable part of machine learning for business is often its predictive capabilities. Improving the accuracy of prediction is an easy way to squeeze more value from existing systems.

Data Science for Startups: PySpark

Spark is a great tool for enabling data scientists to translate from research code to production code, and PySpark makes this environment more accessible. While I’ve been a fan of Google’s Cloud DataFlow for productizing models, it lacks an interactive environment that makes it easy to both prototype and deploy data science models. Spark is a great tool for startups, because it provides both an interactive environment for performing analysis, and scalability for putting models into production. This post discusses how to spin up a cluster on GCP and connect to Jupyter in order to work with Spark in a notebook environment.

How we built Data Science Web App “Route Planner X” on AWS infrastructure

Nowadays it is easier than ever to find companies with million or billion rows of data. However, the problem is NOT ‘how to create more data’, but the problem is ‘how to make use of this huge amount of data’. One way to solve this problem is to migrate data to the data warehouse, which is the database built for analytics task. That being said, building a data warehouse is an expensive and time-consuming process. The emergence of the public cloud such as AWS, GCP, or Azure comes with the better way to approach this problem. In this blog post, I will tell you about how our team of 5 graduates built an end-to-end data science web application to handle 150 million rows of data.

Self Learning AI-Agents Part I: Markov Decision Processes

This is the first article of the multi-part series on self learning AI-Agents or to call it more precisely?-?Deep Reinforcement Learning. The aim of the series isn’t just to give you an intuition on these topics. Rather I want to provide you with more in depth comprehension of the theory, mathematics and implementation behind the most popular and effective methods of Deep Reinforcement Learning.

Taking Deep Q Networks a step further

oday’s topic is … well, the same as the last one. Q Learning and Deep Q Networks. Last time, we explained what Q Learning is and how to use the Bellman equation to find the Q-values and as a result the optimal policy. Later, we introduced Deep Q Networks and how instead of computing all the values of the Q-table, we let a Deep Neural Network learn to approximate them. Deep Q Networks take as input the state of the environment and output a Q value for each possible action. The maximum Q value determines, which action the agent will perform. The training of the agents uses as loss the TD Error, which is the difference between the maximum possible value for the next state and the current prediction of the Q-value (as the Bellman equation suggests). As a result, we manage to approximate the Q-tables using a Neural Network. So far so good. But of course, there are a few problems that arise. It’s just the way scientific research is moving forward. And of course, we have come up with some great solutions.

Training machine learning models online for free(GPU, TPU enabled)!!!

Computation power needed to train machine learning and deep learning model on large datasets, has always been a huge hindrance for machine learning enthusiast. But with jupyter notebook which run on cloud anyone who is has the passion to learn can train and come up with great results.

BeakerX and Python for Data Visualization

Jupyter Notebooks provide data engineers with a formidable tool to extract insights from troves of data on the fly. Typically, Pythonistas use the notebooks for quickly compiling code, testing/debugging algorithms, and scaling program executions; a robust Javascript kernel (here) is also now available for the notebooks as IJavascript, but of course even notebook-used Javascript still adheres to the single-assignment constraint for variable declarations. As a result of Jupyter’s ascension with Python installation packages, Jupyter Notebooks have rapidly grown to be the default playground for Python learning and data science, and data visualization is becoming more of an accessible window into large-scale datasets and predictive mappings. BeakerX is an open-source project initiated by Two Sigma, an investment management firm that foregrounds machine-learning and distributed computing for their investment decisions. One of Two Sigma’s many open-source projects, BeakerX is by far the most robust and community-contributed, with a multifaceted objective to expand the Jupyter ecosystem.

Realtime prediction using Spark Structured Streaming, XGBoost and Scala

In this article we will discuss about building a complete machine learning pipeline. The first part will be focused on training a binary classifier in a standard batch mode and in the second part we will do some realtime prediction. We will use data from the Titanic: Machine learning from disaster one of the many Kaggle competitions. Before getting started please know that you should be familiar with Scala, Apache Spark and Xgboost.

Like this:

Like Loading…

Related