By Stephen Smith, Eckerson Group.
The hype over big data is on the wane. The cloud, Hadoop and their variants have solved it. But ‘big data’ is where a lot of people are still spending a lot of money building bigger infrastructures to process, hold and curate these immense databases. This blind pursuit of ‘big’is generating some sizable and avoidable costs in infrastructure andhuman resources.
It is now time to change the discussion from ‘big data’ to ‘deep data’. Instead of collecting all the possible data to achieve ‘big data’ we now need to be more thoughtful and judicious. We need to now let some data fall on the floor and seek out variety over volume and quality over quantity. This will have many long term benefits.
The Myths of Big Data
To understand this transition from ‘big’ to ‘deep’ let’s first look at some of mistaken notions we have about big data. Here are some of the big myths:
-
All data can and should be captured and stored.
-
More data always helps build a more accurate predictive model.
-
The storage costs of more data is nearly zero.
-
The computation costs of more data is nearly zero.
Here are the realities:
-
Data from the IoT and web traffic still overwhelms our ability to capture it all. Some data must be let drop to the floor at ingestion. We need to be smart. We need to triage our data based on value.
-
The same data example repeated a thousand times doesn’t improve the accuracy of a predictive model.
-
The costs of storing more data is not just the dollar per terabyte that Amazon Web Services charges you. It is also the additional complexity of finding and managing multiple data sources and the ‘virtual weight’ of moving and using that data by your staff. These costs are often higher than the storage and computation charges.
-
AI algorithms need for computational resources can quickly outstrip even an elastic cloud infrastructure. Computational resources grow linearly while computational needs can grow super linearly or even exponentially if not expertly managed.
The problem with believing these myths is that you will architect your information systems in ways that look good on paper or in the long run but are too cumbersome in the immediate time frame to be useful.
Four Problems with Big Data
Here are four problems with blindly believing that ‘more is better’ when it comes to data:
-
More of the same doesn’t help. In building machine learning models for AI, diversity of training examples is critically important. The reason for this is that the models are trying to determine concept boundaries. For example,if your model is trying to define the concept of a ‘retired worker’ by using age and occupation then repeated examples of 32 year old Certified Public Accountants does the model little good since none of them are retired. It is much more helpful to get examples at the concept boundary of age 65 and to see how retirement varies with occupation.
-
Noisy data can hurt a model.If new data has errors in it, or is imprecise it will only serve to muddy the boundary between two concepts that an AI is trying to learn. More data, in this case, won’t help and could actually decrease the accuracy of your existing models.
-
Big data slows everything down. Building a model on a terabyte of data may take one thousand times longer than building a model on a gigabyte of data. Or it might take ten thousand times longer depending on the learning algorithm.Data science is all about rapid experimentation. Better to be nimble and imperfect. Fail fast. Fail forward.
-
Big data implementable models.The end goal of any predictive model is to create a highly accurate model that can be deployed for the business. Sometimes using more obscure data from the dark recesses of the data lake may result in higher accuracy but the data used might be unreliable for actual deployment. Better to have a less accurate model that runs quickly and can be used by business.
Four Things to Do Better
There are several things that you can do to combat the ‘dark side’ of big data and move towards a deep data mindset:
-
Understand the accuracy / execution trade off. Too often data scientists assume that more accurate models are the goal. Start your project off with explicit ROI expectations based on both accuracy and speed of deployment.
-
Build every model with a random sample.If you’ve got big data then there is no reason to use all of it. If you have a good random sampling function, then you can accurately predict the accuracy of a model built with the entire database from small samples. Work rapidly with small samples then build the final model with the entire database.
-
Drop some data.If you are overwhelmed with streaming data coming in from IoT devices and other sources feel free to be smart about throwing some data away. Maybe throw a lot of data away. You can’t buy enough disk to store it all and it will gum up everything you are working on in later stages of your data science production line.
-
Look for more data sources.Many of the recent breakthroughs in AI have come not from larger datasets but from the ability of the machine learning algorithms to tap into data that was previously unavailable to them. For example the large text, image, video and audio datasets that are commonplace today didn’t exist twenty years ago. Be constantly on the lookout for these new data opportunities.
Four Things that Get Better
If you focus on deep data rather than just on big data you will enjoy many benefits. Here are some of the key ones:
-
Everything will be faster.With smaller data your data movement, experimentation, training and scoring of models will all be much faster.
-
Less storage and compute required. A focus on deep data means that you will be smarter about efficiently using a smaller disk and compute footprint in the cloud. This directly translates into lower infrastructure costs. Go hire more data scientists and AI experts with the money you save!
-
Less strain on IT and happier data scientists.With a deep data culture, your IT team will find itself less likely to be running errands for the data science team or having to kill runaway jobs that are soaking up all the cloud resources. Likewise your data scientists will be happier when they can spend more of their time building and testing models rather than moving data around or waiting for long training runs to complete.
-
Harder problems can be solved. Building an AI model is not a magical experience that can only be executed by wizard-like researchers. It is much more about logistics than magic. It is similar to the story of an art teacher who told half his class that their grade would be based on the number of pieces of art they produced and the other half of the class would be graded on the quality of their best piece. Not surprisingly the quantity students produced the most works of art. Shockingly they also produced the highest quality items. Quantity sometimes begets quality. In our case, more models tried under the same resource constraints can mean a better best model.
Big data and the technological breakthroughs that have supported it have greatly advanced many companies drive to become data-driven in their decision making processes. With the rise of AI and our ability to saturate even those powerful resources we now need to be more precise in what we need from our data. Building a culture of understanding of deep data and not just big data is what is needed now.
Bio: Stephen Smith is a well-respected expert in the fields of data science, predictive analytics and their application in the education, pharmaceutical, healthcare, telecom and finance industries.
Resources:
Related: