Editor’s note: this article was originally published on the Iteratively blog on February 8, 2021.
At some point, you will be working with a messy, disorganized tech stack. Maybe your organization started using new products before considering how they interacted with others. Or you inherited someone else’s code. Mapping data dependencies will show you and your team how data flows and interacts with the systems in your stack.
Companies lose money from the proliferation of data and become more susceptible to security vulnerabilities and costly regulations.
Having a data dependency map will not only help you better understand your tech stack, but will also allow you to make more informed decisions going forward.
Here’s what you can do to help clean things up.
The benefits of dependency mapping
At first, it might seem like a lot of extra work to set up—and it can be—but there are clear reasons why you should create a data dependency map.
Data dependency maps offer a holistic view of your data, allowing data teams to design better tracking plans. They can also ensure that it won’t break any of the tracking systems when the analytics code is updated or removed. This is especially important when you are changing code at source and its implications on downstream systems. Tracking the implications that change might have on depending systems will save you and your team time as you can see where depending systems might break due to changes.
Sounds great, right? There are a lot of benefits that come from making a data dependency map.
Better understanding of the technology environment
A well-designed map allows anyone to easily see how the systems interact, helping you track which systems interact with data and where the data goes, step-by-step.
This helps in planning future products or components as well, as you can see where they can aid in data integration or migration.
Mapping out your data dependencies will help you maintain data accuracy as that data moves from source to destination. And that goes a long way in building confidence in the quality of your data center.
By giving your team a complete view of your infrastructure and dependencies, you can track how each component works with the others.
You can also use a data dependency map to identify the root causes of application disruptions. If you’re having an issue with an application, you can start from where it originated and move back along the map to see if there’s a specific root cause. Is it in the infrastructure? An application? An outside threat?
Easier to identify risks
Mapping out your data dependencies gives users clear visibility into your tech stack, which can help determine possible failure points that put your business at risk. If done properly, data mapping can be an effective tool for your organization, as it typically helps a company in the following areas:
- Data quality: As the sheer volume of data sources increases, data mapping is more complex than ever. Mapping out data dependencies closes the gap between data models, ensuring decision makers can analyze when data is moved throughout your stack.
- Cyber attacks and data breaches: As companies drive insights from data, protecting users’ information has become a must. A data map can help an organization identify where key data sets are stored, processed, and transmitted. Once organizations figure this out, they can take the necessary steps to protect sensitive information from ending up in the wrong hands.
What to consider before dependency mapping
Sure, you can make a physical map with sticky notes, but there are many digital tools out there that can help you and your team create a digital version. But before getting started with data mapping, there are two things you should consider:
First, determine the directionality of dependency
When starting with dependency mapping, it’s crucial to know how things will fail. By determining where things will fail, you identify vulnerabilities within your stack. When you can identify failures faster within your organization, you can find the quickest way to solve the problem at hand. This will not only save your workers time but will also save your organization money in the long run.
Keep it simple
While data maps should be comprehensive to account for many data sources, they shouldn’t be complicated to understand. Data maps should contain information relevant to your organization and be updated regularly, but there is no need to go overboard when mapping out your dependencies. A complicated data map can be more hurtful than helpful for your organization.
A data map should be simple enough for a layperson to understand, so next time there is a problem within your stack, a colleague can easily find the root of the problem and solve it in a reasonable amount of time.
The three most common data dependency mapping techniques
While data mapping varies by the complexity of your organization’s tech stack, these three data dependency mapping techniques are the most common among companies.
1. Manual mapping
Most data systems have grown to a point where they are now too complicated to track manually. However, manual mapping is a great place to start if your data system is small, and you don’t expect your system to grow.
With manual mapping, developers use languages such as SQL, C++, XSLT, and Java. While this solution does require a lot of work upfront, it can be done, but it will not be as effective as schema or automated mapping.
2. Schema mapping
Schema mapping software compares data sources to the target schema, generating connections. After that is complete, a developer must manually go into the software and verify the information is correct and make changes where needed.
Once the data map is complete, the software generates code to load the data. This is often referred to as a semi-automated strategy as it relies on teams to double-check the work done by the software before moving forward.
3. Automated mapping
Automated solutions have become increasingly popular since they don’t require coding experience. These software users drag and drop lines between databases, making it easier to map out relationships in a reasonable amount of time. While these solutions do most of the heavy lifting, users would still do well to check for any human errors.
Tools for mapping data dependencies
Fortunately, there are many tools available that can aid you when mapping out your data dependencies. Here are a few we recommend:
- Datafold: This data lineage company helps businesses visualize their data ecosystem. It assures companies that a change to the schema of one table will not affect functionality elsewhere. While the company offers a free version for businesses, their paid solution offers various benefits, including Slack integration and live in-product chat support.
- Monte Carlo: A fully automated data lineage solution that covers your whole data stack, Monte Carlo alerts your organization when data breaks. That means you can fix the problem before it reaches the end user. It is a fully automated solution that covers your whole data stack. Monte Carlo is a paid solution that allows businesses to start with a free trial.
- Datadog: Datadog’s APM tool enables organizations to understand service dependencies while monitoring them in real time and alerting users when a system is down. The company offers a free trial for up to 14 days.
- Prometheus: This open-source solution enables you to monitor application performance. The solution is known for its high reliability and uptime. Prometheus will alert you to any major changes in behavior in your applications, so you can immediately investigate the cause.
Why data dependency mapping might be right for you
Any company that is truly data-driven should be mapping out its data dependencies. Data that is poorly mapped or not mapped at all will eventually lead to issues downstream as data travels from end to end within your organization. Mapping out your data dependencies is a scary task for businesses, especially when you rely on data to make informed business decisions.
Think of mapping your data dependencies as a task that future you will thank later on. We are not perfect—data is bound to break at some point regardless of how flawless we think our current solution is, and you know what? That is okay. The process of mapping out your data dependencies will ensure that when data does break, it does not lead to a bigger problem down the line. Take the time to map out your data dependencies; it will save you lots of time hunting down what other systems were affected by the failure. When done correctly, data mapping ensures your organization’s data is not only correct but also reliable.
Has your organization started mapping your data dependencies? Do you have any lessons you’d like to share? Join the Amplitude Community.