Occam’s razor and the law of diminishing data returns

December 12, 2023

Most people have heard of the law of diminishing returns, but many don’t know what it means. Wikipedia defines diminishing returns as:

the decrease in the marginal (incremental) output of a production process as the amount of a single factor of production is incrementally increased, while the amounts of all other factors of production stay constant.

In layman’s terms, nothing lasts forever and/or it is possible to have too much of a good thing. In the business world, the law of diminishing returns looks something like this: As you put more effort into an initiative, build a bigger factory, or work longer hours, instead of being proportionately more successful initiatives will begin to fail and you’ll eventually hit a wall.

What does this mean for data in your operations? The same pattern applies: There is a limit to how much additional insight you can get by adding more data to your analyses or reports. This is counterintuitive, so worth repeating: Adding more data might not help improve your organization’s data analytics. In fact, it might actively undermine analytics usefulness and accuracy.

The question then is, how do you know when you have enough data to perform useful, robust analyses, but not too much?

Born and raised in a world of data, we at 3AG have some ideas on how to deal with this tricky balancing act. And we will spend the rest of this article discussing how you can be sure you have just enough data to succeed.

Overfitting, underfitting and ill-fitting

In the world of KPIs and data analysis, there are 2 critical concepts worth understanding and then avoiding: overfitting and underfitting.

Overfitting occurs when analysis production “corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.” In other words, you’ve created a metric, KPI, model or some kind of mathematical relationship that too closely resembles your original data set.

This may not seem like a problem; having the “best fit” for your data sounds positive. However, by overfitting you might also incorporate things like noise into your model, which can skew results in all sorts of inconvenient and unhelpful ways. (“Noisy data” is data that has become distorted, corrupted or inaccurate.)

If each measurement you look at could vary either up or down due to noise, then using average values would be more effective; but since you don’t know whether specific data points in this set are too high or too low, creating a model that fits them exactly will add that noise to your model. Why does this matter? Various statistical measures tend to look very good (regardless of actual quality) when overfitted, leading to misplaced confidence in their accuracy.

Conversely, underfitting occurs when your model is overly simple and the associated mathematical relationship is such that approximation doesn’t take enough data into consideration.

While both overfitting and underfitting are problematic, they at least speak to concerted efforts to properly develop a model. The problem is, good intentions don’t necessarily produce good results; it’s unfortunately very common for companies to jump to incorrect conclusions when looking at such mishandled data.

Company X case study

Imagine examining production numbers generated from different shifts on the shop floor. Let’s say Company X has 3 crews each working 4 shifts per week, Monday to Saturday. Crew A is the top performer by a wide margin, followed by Crews B and C. Performance quality tends to be at its worst during the Friday and Saturday shifts.

The executive team constantly argues about the underlying cause of these inconsistencies . The CEO is certain Crew A is doing something better than the other crews. If so, a solution might be to work towards having Crews B and C copy Crew A’s methods, approaches, etc. The COO argues, the issue must instead be the predictable phenomenon of performance dropping off as the weekend approaches.

Contributing further to this corporate melee, the site GM promotes his QA manager’s theory. Having noticed Crews B and C don’t have consistently bad performance and that they spend more time on maintenance than Crew A does, she posits this as the cause of B and C’s lower production output. If the QA manager is right, Crews B and C are why equipment is in such good working order when Crew A takes over on Mondays, which is the main reason they’re able to be so productive.

Who’s right? Unless each camp can back up their hypothesis with statistical modeling mathematically demonstrating the correlation between proposed cause and effect, none are correct. Data may seem to suggest all 3 possible explanations are correct. However, one or more of these explanations may just be coincidence.

It’s entirely possible anyone could come up with a plausible explanation for any data as presented. As humans, we crave explanations and strive to find patterns in all manner of chaos. A famous 1978 study showed that just adding the word “because” to a reason statement aimed at getting people to perform a specific task resulted in significantly more compliance.

People are more likely to believe something when given reasons to do so; and too often, the mere fact that a reason exists is more powerful than the quality of that reason. This is bad news for organizations relying on data accuracy. And this effect is amplified by the relative power of the source of the reason: Applying this to our earlier example, the CEO’s assessment of Crew A being superior probably carries more weight than the QA manager’s hypothesis, though the latter is totally feasible.‍

There are 2 points arising from this example:

More data can help differentiate between legitimate and merely power-driven explanations.
Statistical measurements are critical to determine whether or not data is relevant.

So more data is better?

More data is always going to be better insofar as it provides opportunities to work with additional relevant data sources; but that’s the only reason more data is better. How, then, can you tell whether or not the data you have is relevant? The answer goes back to model fitting: By applying statistical measurements to potential source data, you can quantitatively tell whether it’s worth your time to use it in the first place.

Think about this for a minute. Adding pre-analyzed data to ensure it is relevant is a good thing. Adding data that has no correlation to the underlying phenomena, on the other hand, is at best unhelpful, and at worst leads you to measurably wrong conclusions or creating hypotheses that sound good but are based on spurious connections. Basically every conspiracy theory becomes more convincing the more data points it offers as proof; similarly, high data volume divorced from pre-analysis, and proper collection and storage, is never going to work in your organization’s favor.

Then where should I be measuring the quality of my data?

Ideally, you should be analyzing data quality everywhere you have it. Knowing there are ways to measure data quality before using it, now is the time to abandon the old-school garbage-in/garbage-out approach to data management.

What does this mean in practice? At minimum, you should apply statistical rigor to every model you develop, especially for those used for predictive purposes of any sort. This means it’s best practice to pre-analyze any financial model, any maintenance model (predictive or preventative), and of course any digital simulation or digital twin of your operation.

Next, you should apply the same statistical rigor to analysis or investigation. Want to determine if there is a relationship between shift performance and shift supervisor? Don’t simply plot a graph and make a snap decision. Apply some stats to determine exactly how correlated these 2 variables are, and if the interaction is either solid or easily dismissed due to randomness or noise.

Finally, you should apply statistical rigor to all your measured data. While such an exacting approach might not warrant fully discarding questionable data, it should at least clearly identify data of dubious quality—before company stakeholders blindly apply it to their hypotheses.

Recognizing the diminishing returns resulting from looking at data in bulk regardless of quality will help you focus on the specific, important indicators that apply to your business (and hopefully keep conspiracy theories from developing in the virtual lunch room). 3AG exists to help organizations parse their abundant and invaluable data so it works in everyone’s favor, from those on the shop floor up to the C-suite. Let’s discuss how we can do that for your organization.

‍

Looking to learn more about data engineering? Check out our Guide to Data Engineering with helpful resources on this topic.

Occam’s razor and the law of diminishing data returns

Overfitting, underfitting and ill-fitting

Company X case study

So more data is better?

Then where should I be measuring the quality of my data?

Ready to Transform Your Data Management?