The Problem With Messy COVID Data

In the early days of the pandemic, when the big Covid-19 dashboards were filled with graphs and upward sloping lines, the impression was that we had the data under control. Numbers were being tracked, plotted, and forecast. But that wasn’t the reality.

The data was messy: some locations were reporting more slowly than others; some had different methodologies for testing; sometimes whole data points were missing. The data was clean on the charts, but anyone who had to dig into it knew it wasn’t. I learned a lot about AI that year.

The first lesson is that data is often messy in a crisis

In any crisis, the primary concern is not with data collection. Healthcare workers are scrambling to save lives and governments are scrambling to make decisions. It takes time to ramp up any kind of data reporting, and the result is a messy hodgepodge of different locations, different methodologies, and different frequencies.

Throw in changing rules and definitions (what is a “case” anyway?), not to mention human error, and you’ve got a dataset that’s far from tidy. It’s not anyone’s fault. It’s just the nature of a fast-moving world colliding with processes that aren’t designed to keep up.

The second lesson is that missing values are hard to deal with

One of the biggest problems with Covid-19 data wasn’t that it was wrong. It was that much of it was missing. Different locations had different levels of ability to report. In some cases, whole regions couldn’t test. Sometimes data was missing because it was late.

And sometimes, data was just never collected at all. When you see a zero in a data set, is that a zero, or is it just the absence of data? This is a critical question for analysts and for AI systems. A zero in a data set is just an empty box in a spreadsheet. But it represents something very real: a gap in the narrative.

Lessons Learned: Noisy COVID-19 Data and the Limits of AI

AI got a big dose of real-world reality during COVID-19, and in some ways, it came up a little short. Predictive models that performed beautifully in lab conditions were suddenly confronted with reporting lags, data discrepancies, and discrepancies based on who was doing the counting. And in some cases, the forecasts weren’t quite as accurate as expected, not because of the AI, but because of the data going into it. As we’re constantly reminded, AI recognizes patterns in the data it’s trained on, and if the data is noisy, the results will be, too.

This brought to light something AI developers probably already knew, but hadn’t experienced on a global scale: that training AI isn’t just about the AI itself, but the data it’s trained on. And if that data is incomplete or biased, the AI will be, too. In a way, the noise of the COVID-19 data taught us that better AI starts with asking better questions about the data we trust.

Data Cleaning, Labeling, and Validation

As soon as the COVID-19 data started coming in, it became clear that raw data wasn’t quite ready for prime time. Data points needed scrubbing, categories needed standardized labels, and anomalies needed checking before anyone could trust the insights coming out of AI models.

Think of it like a recipe: you can have the greatest recipe in the world, but if your ingredients are dirty or mislabeled, the finished product isn’t going to taste quite right. Data is the same way.

That’s why data cleaning and validation became essential to AI development. Were there duplicate entries in the data? Were the same number of cases being reported from different states or countries using the same criteria?

Were missing values accounted for rather than simply being filled in? Those kinds of questions may not be as sexy as AI algorithms, but they’re where good AI starts. In my opinion, COVID-19 reinforced that good AI isn’t just about good code, it’s about the more mundane and often painstaking work done on the data behind it.

What We Can Do Instead

What does better AI training look like? It starts by asking better questions about data. Where did the data come from? Is it representative or skewed by region or demographics? Are there any data blind spots that will subtly skew the outcome?

These are questions that data scientists must grapple with long before training begins. It’s not sexy, but it’s a crucial step toward producing reliable AI models instead of those that are confidently wrong.

Better AI also recognizes that data isn’t always perfect, especially during a pandemic. Rather than assuming data is complete and pristine, better models are designed to account for uncertainty, alert users to anomalies, and insert human judgment when needed.

In my view, this combination of rigorous data hygiene, well-designed models, and a dash of data skepticism is what differentiates AI that’s impressive in a demo from AI that delivers in the wild.

Conclusion: Better Models Start With Better Data

The pandemic was a wake-up call for the tech community. AI is a powerful tool, but it won’t correct for bad data. When data is incomplete, inconsistent, or biased, even the most advanced AI models will spit out seemingly authoritative answers that don’t reflect reality. It was a reminder for data scientists, analysts, and businesses that the performance of an AI model is only as good as the data that trains it.

As we move forward, we need to remember that better AI isn’t about building bigger models; it’s about paying attention to the inputs. Is the data clean? Is it representative? Does it reflect the nuance of the real world rather than an oversimplified abstraction?

These are simple questions, but they’re where good AI starts. In many ways, the pandemic was a global master class in data humility: Sometimes the best way to build a better algorithm is to give it better data to chew on in the first place.