We’ve all heard that data will drastically change the future of business for the better. We’ve discussed how predictive analytics will drive sustainable business growth. But how will all this happen? What kind of data will produce these wonderful changes?
Wikipedia says, “Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks."
In other words, ML models learn from data. Without data, you’ll never have a good model. Correction: without sufficient quantity and quality of data, you’ll never have a truly powerful and useful model.
For now, I’m going to focus on data quantity and save the conversation about data quality for a later post. That means for this discussion we’ll assume we always have data of sufficient quality.
Ok, so in very simple terms, a ML model builds a prediction by using a training dataset. Really, it learns much like a newborn baby, but without all the crying and pooping.
For instance, a baby might learn that when it cries (“input data”) its caregiver picks it up (“predicted label”). Imagine how large of a training dataset is needed to teach the baby what it needs to survive in the real world. Some might consider this a lifetime of data. 😉
Seriously though, if our goal is to train an excellent model, our training dataset needs enough information to describe all the variances that might exist for the given prediction task.
With an infinite amount of data about a particular information space, we would only need a simple search algorithm to perfectly predict tasks. Although this is something that could happen only in my wildest dreams…
Alright, so back to the real world…what does it mean to have a “sufficient quantity of data”? How much data is enough?
Imagine we want to predict housing prices. Some basic information we’d need is location, square footage, yard size, number of bedrooms and bathrooms, and year built. Perhaps we can get a more accurate price with more detailed information. So, we add utilities, house condition, nearby schools, walking score, and whether any celebrities live nearby.
Every piece of information we add is a feature of the task. Each feature will have its own value range. So as the number of features increases, the amount of data needed increase dramatically.
To cover the possible variances of all features with their full range of values, we need more rows of data. As the task becomes more complex, the matrix of rows and columns grows.
This is the curse of dimensionality. Or at least part of it.
The more dimensions we have, the more data we need. But the curse is not that straightforward. To handle a high dimensional dataset, we need a powerful model.
For a very simple task and dataset, a linear regression might be enough. However, a complex task with a very high dimensional dataset, will require more nonlinearity.
Deep learning—like a neural network—is one solution here. This domain developed very fast in the past decade, providing some incredibly large models.
We use the number of parameters to describe the complexity and power of a model. For instance, ImageNet Classification was developed in 2012 with 60 million parameters. In 2020, the GPT-3 model had 175 billion parameters! And of course, there are other models with even more parameters.
However, this power jump is not free. The neural network model is super greedy when it comes to how much data it consumes. This is especially true for very large models.
Every parameter in the model needs to be optimized by the available data. ImageNet Classification used 15 million images to train the model. And GTP-3 used 45 Terabytes of text data.
So, in summary, to adequately describe a complex task, we need a lot of data with many rows and many columns. And to accurately understand this high-dimensional dataset, we need a very powerful model. Then to train this model to attain acceptable performance, we need even more data.
As I said, these models are greedy! But this is the reality of the machine learning world.
So again, back to the question at hand. How much data is a “sufficient quantity of data”? I’m afraid I can’t give you a clear answer. Though I very much want to…
Because, you know, the real world. While more data might be helpful for creating a powerful model, it also costs more. So there are tradeoffs.
Collecting, analyzing, and maintaining large-scale datasets is expensive and time-consuming. Data scientist and machine learning engineers spend 40% to 60% of their time working with data. And possibly more if the data is complex and dirty.
Ultimately, choosing the best approach for you is a question of your data strategy. As I’ve mentioned, the right data strategy for you will depend entirely on what option best meets your business needs.
For instance, sometimes more data will easily solve a problem without spending too much time on modeling. But sometimes—because of limits on budget or time—we must use a small amount of data to create a model.
Check out Tienan's next post:
The Value of Quality Data