Editor’s note: this article was originally published on the Iteratively blog on December 14, 2020.
At the end of the day, your data analytics needs to be tested like any other code. If you don’t validate this code—and the data it generates—it can be costly (like $9.7-million-dollars-per-year costly, according to Gartner).
To avoid this fate, companies and their engineers can leverage a number of proactive and reactive data validation techniques. We heavily recommend the former, as we’ll explain below. A proactive approach to data validation will help companies ensure that the data they have is clean and ready to work with.
Reactive vs. proactive data validation techniques: Solve data issues before they become a problem
“An ounce of prevention is worth a pound of cure.” It’s an old saying that’s true in almost any situation, including data validation techniques for analytics. Another way to say it is that it’s better to be proactive than it is to be reactive.
The purpose of any data validation is to identify where data might be inaccurate, inconsistent, incomplete, or even missing.
By definition, reactive data validation takes place after the fact and uses anomaly detection to identify any issues your data may have and to help ease the symptoms of bad data. While these methods are better than nothing, they don’t solve the core problems causing the bad data in the first place.
Instead, we believe teams should try to embrace proactive data validation techniques for their analytics, such as type safety and schematization, to ensure the data they get is accurate, complete, and in the expected structure (and that future team members don’t have to wrestle with bad analytics code).
While it might seem obvious to choose the more comprehensive validation approach, many teams end up using reactive data validation. This can be for a number of reasons. Often, analytics code is an afterthought for many non-data teams and therefore left untested.
It’s also common, unfortunately, for data to be processed without any validation. In addition, poor analytics code only gets noticed when it’s really bad, usually weeks later when someone notices a report is egregiously wrong or even missing.
While all these methods may help you solve your data woes (and often with objectively great tooling), they still won’t help you heal the core cause of your bad data (e.g., piecemeal data governance or analytics that are implemented on a project-by-project basis without cross-team communication) in the first place, leaving you coming back to them every time.
Reactive data validation alone is not sufficient; you need to employ proactive data validation techniques in order to be truly effective and avoid the costly problems mentioned earlier. Here’s why:
- Data is a team sport. It’s not just up to one department or one individual to ensure your data is clean. It takes everyone working together to ensure high-quality data and solve problems before they happen.
- Data validation should be part of the Software Development Life Cycle (SDLC). When you integrate it into your SDLC and in parallel to your existing test-driven development and your automated QA process (instead of adding it as an afterthought), you save time by preventing data issues rather than troubleshooting them later.
- Proactive data validation can be integrated into your existing tools and CI/CD pipelines. This is easy for your development teams because they’re already invested in test automation and can now quickly extend it to add coverage for analytics as well.
- Proactive data validation testing is one of the best ways fast-moving teams can operate efficiently. It ensures they can iterate quickly and avoid data drift and other downstream issues.
- Proactive data validation gives you the confidence to change and update your code as needed while minimizing the number of bugs you’ll have to squash later on. This proactive process ensures you and your team are only changing the code that’s directly related to the data you’re concerned with.
Now that we’ve established why proactive data validation is important, the next question is: How do you do it? What are the tools and methods teams employ to ensure their data is good before problems arise?
Let’s dive in.
Methods of data validation
Data validation isn’t just one step that happens at a specific point. It can happen at multiple points in the data lifecycle—at the client, at the server, in the pipeline, or in the warehouse itself.
It’s actually very similar to software testing writ large in a lot of ways. There is, however, one key difference. You aren’t testing the outputs alone; you’re also confirming that the inputs of your data are correct.
Let’s take a look at what data validation looks like at each location, examining which are reactive and which are proactive.
Data validation techniques in the client
You can use tools like Amplitude Data to leverage type safety, unit testing, and linting (static code analysis) for client-side data validation.
Now, this is a great jumping-off point, but it’s important to understand what kind of testing this sort of tool is enabling you to do at this layer. Here’s a breakdown:
- Type safety is when the compiler validates the data types and implementation instructions at the source, preventing downstream errors because of typos or unexpected variables.
- Unit testing is when you test a specific selection of code in isolation. Unfortunately, most teams don’t integrate analytics into their unit tests when it comes to validating their analytics.
- A/B testing is when you test your analytics flow against a golden-state set of data (a version of your analytics that you know was perfect) or a copy of your production data. This helps you figure out if the changes you’re making are good and an improvement on the existing situation.
Data validation techniques in the pipeline
Data validation in the pipeline is all about making sure that the data being sent by the client matches the data format in your warehouse. If the two aren’t on the same page, your data consumers (product managers, data analysts, etc.) aren’t going to get useful information on the other side.
Data validation methods in the pipeline may look like this:
- Schema validation to ensure your event tracking matches what has been defined in your schema registry.
- Integration and component testing via relational, unique, and surrogate key utility tests in a tool like dbt to make sure tracking between platforms works well.
- Freshness testing via a tool like dbt to determine how “fresh” your source data is (aka how up-to-date and healthy it is).
- Distributional tests with a tool like Great Expectations to get alerts when datasets or samples don’t match the expected inputs and make sure that changes made to your tracking don’t mess up existing data streams.
Data validation techniques in the warehouse
You can use dbt testing, Dataform testing, and Great Expectations to ensure that data being sent to your warehouse conforms to the conventions you expect and need. You can also do transformations at this layer, including type checking and type safety within those transformations, but we wouldn’t recommend this method as your primary validation technique since it’s reactive.
At this point, the validation methods available to teams include validating that the data conforms to certain conventions, then transforming it to match them. Teams can also use relationship and freshness tests with dbt, as well as value/range testing using Great Expectations.
All of this tool functionality comes down to a few key data validation techniques at this layer:
- Schematization to make sure CRUD data and transformations conform to set conventions.
- Security testing to ensure data complies with security requirements like GDPR.
- Relationship testing in tools like dbt to make sure fields in one model map to fields in a given table (aka referential integrity).
- Freshness and distribution testing (as we mentioned in the pipeline section).
- Range and type checking that confirms the data being sent from the client is within the warehouse’s expected range or format.
A great example of many of these tests in action can be found by digging into Lyft’s discovery and metadata engine Amundsen. This tool lets data consumers at the company search user metadata to increase both its usability and security. Lyft’s main method of ensuring data quality and usability is a kind of versioning via a graph-cleansing Airflow task that deletes old, duplicate data when new data is added to their warehouse.
Why now is the time to embrace better data validation techniques
In the past, data teams struggled with data validation because their organizations didn’t realize the importance of data hygiene and governance. That’s not the world we live in anymore.
Companies have come to realize that data quality is critical. Just cleaning up bad data in a reactive manner isn’t good enough. Hiring teams of data engineers to clean up the data through transformation or writing endless SQL queries is an unnecessary and inefficient use of time and money.
It used to be acceptable to have data that are 80% accurate (give or take, depending on the use case), leaving a 20% margin of error. That might be fine for simple analysis, but it’s not good enough for powering a product recommendation engine, detecting anomalies, or making critical business or product decisions.
Companies hire engineers to create products and do great work. If they have to spend time dealing with bad data, they’re not making the most of their time. But data validation gives them that time back to focus on what they do best: creating value for the organization.
The good news is that high-quality data is within reach. To achieve it, companies need to help everyone understand its value by breaking down the silos between data producers and data consumers. Then, companies should throw away the spreadsheets and apply better engineering practices to their analytics, such as versioning and schematization. Finally, they should make sure data best practices are followed throughout the organization with a plan for tracking and data governance.
Invest in proactive analytics validation to earn data dividends
In today’s world, reactive, implicit data validation tools and methods are just not enough anymore. They cost you time, money, and, perhaps most importantly, trust.
To avoid this fate, embrace a philosophy of proactivity. Identify issues before they become expensive problems by validating your analytics data from the beginning and throughout the software development life cycle.