TL;DR:
Data contracts are formal agreements between data producers and consumers that define how data should look and behave. This post explains how data contracts prevent pipeline breakages, ensure trust in dashboards, and shift accountability upstream, using tools like dbt and Elementary to enforce those expectations in code.
---
You’ve built a data model for an important product launch. The dashboards are live, metrics look great, until executives flag that something’s off. After hours of debugging, you realize engineering changed the grain of a core table… and no one told you.
This is exactly the kind of issue data contracts are designed to prevent.
This is the exact situation that data contracts help us avoid.
What are data contracts?
Data contracts are agreements between data producers and data consumers.
They clarify expectations about how the data should be produced and then used. In our example, a contract would be set between the engineers (producers) and the analytics engineers (consumers). They can also be set between the analytics engineer and data analyst when defining a data model and how that affects dashboard development.
In this scenario, a data contract would protect the integrity of the source data. It would set enforced expectations between engineers and data, causing a break in the data pipeline when the expectations aren’t met. Data contracts prevent downstream data from running in the case of the contract being broken. This means that the analytics engineers would be the first to know of any breaking changes rather than business stakeholders discovering a data quality issue in a dashboard.
You can think of data contracts as pushing the discoverability of any data issues upstream to data and engineering teams. This is exactly how it should be, as it helps maintain trust in the data.
Why Every Team Needs Data Contracts
Data contracts help enforce expectations upstream. They ensure data produced matches expectations and that the data is being used as it should. Without them, it can be easy for mistakes to go undetected in your data pipeline.
In our example, we imagined we were analytics engineers modeling data based on our expectations of how it should look in a table. When the format of that table changed, the data model delivered inaccurate data that the business used to make decisions. Instead of us alerting the business, the business alerted us.
Contracts help you to always be alerted to these changes before the business detects data quality issues. They stop your data pipeline in its tracks, preventing dependencies from running and therefore protecting the integrity of your data. While this seems like an inconvenience at the moment, these contracts end up saving teams time, allowing data products to scale, and forcing communication between teams when necessary.
Instead of data quality always being the responsibility of the data team, it is being moved as far upstream as possible with the help of data contracts.
Core Components of a Data Contract
Not every data contract will look the same, and that is ok. How they look depends on what you are setting an agreement on.
The most important things to have in a contract are as follows:
- Owner: Who owns the contract and is responsible for keeping it? This person should be clearly stated on any contract. A clear owner makes it easy to assign responsibility to the right person when the contract is broken and needs to be mended or reassessed. This ends up saving tons of time at the most critical moment! We will soon talk about how we use Elementary to assign ownership of our dbt tests.
- A clear expectation: What do you expect the data to look like? This needs to have a clear answer, as contracts are very binary in nature. They either match your expectations or they don’t. For example, I expect the field `created_at` to be a timestamp would be a clear expectation.
- Documentation: You need to document why a contract exists and what it is protecting against. This tells future users why it is important and needs to exist. This can easily be done in dbt in the form of descriptions and lineage graphs, or in an external documentation tool. Your future self and teammates will thank you.
Tools to manage data contracts
dbt and dbt packages like Elementary are powerful ways to add contracts to your existing analytics projects. Together, they create a robust way to ensure your data matches your expectations, stopping your data pipeline in its tracks and alerting whoever is responsible when it doesn’t.
dbt model contracts
Model contracts in dbt enforce data types, as well as different conditions of your data, such as nullness, uniqueness, and keys. Under the hood, dbt includes the specified constraints in the DDL statements submitted to your warehouse, enforcing them when your model is built or updated. dbt checks that these statements run before building any model, maintaining the highest quality of data. dbt data contracts are a great way to check that your data matches your expectations before your model is built.
Model contracts are written in the YAML files of your dbt project and defined on a model. Every contract must include every column’s name and data type.
models:
- name: dim_orders
config:
contract:
enforced: true
columns:
- name: order_id
data_type: int
constraints:
- type: not_null
- name: ordered_at
data_type: timestamp
dbt contracts will always enforce nullness, however, they cannot enforce other conditions like uniqueness and keys, depending on your data warehouse. Because of this, we recommend defining tests that check for these characteristics instead.
While dbt contracts enforce the structure of your data models, they can’t always catch changes in the behavior of the data—like volume drops or unexpected values. That’s where tools like Elementary come in.
Elementary
Data tests in Elementary offer more flexibility for what you can test. They are ideal to define expectations of your data not related to the shape of the data model, but rather the field values and their frequency.
They offer anomaly tests on:
- Dimensions
- Columns
- Volume
- Freshness
Schema change test
Elementary also offers a schema change test for when you don’t want the strict enforcement of a dbt contract. With this test, dbt will still build the model but offer you a warning or a stopping error depending on the severity you have set. This gives you the flexibility of data continuing to flow normally, while checking in with engineering on expectations and how they may have changed.
version: 2
models:
- name: login_events
tests:
- elementary.schema_changes:
tags: ["elementary"]
config:
severity: warn
Elementary’s schema change test will alert you to any changes in a column, including if one has been added or deleted, or if the datatype has changed. This can help understand what fields are no longer important and what additional information may need to be modeled downstream. As an analytics engineer, it’s helpful to see what data sources are being worked on in real-time, even if it isn’t causing breaking changes.
Ownership and subscribers config
My favorite feature that Elementary offers, when it comes to data contracts, is the ownership and subscribers configuration. Before discovering Elementary, I found there was always a gap between engineering and analytics when it came to data quality testing. Analytics engineers would own the data quality tests in dbt, despite not being the ones in charge of fixing the issue. They would then have to tag the engineer responsible for fixing the issue in the source data. This created a time-consuming back-and-forth conversation about who owned what and whose responsibility it was to maintain high-quality data. When it came to models, it’s the responsibility of analytics engineers. When it comes to source data, it’s the data engineers’ responsibility.
By tagging owners on my data quality alerts using Elementary, I am able to ensure the right person responsible is alerted of a failure, whether it's the responsibility of the analytics or data engineer. This also prevents me from needing to track down the right person as issues occur, as I’ve already discovered the person who owns the data.
tests:
- not_null:
meta:
owner: ["@jessica.jones"]
Here, Jessica’s Slack username is defined as the owner of the test. When it fails, she will be notified. This acts as a contract between Jessica, the engineer, and the analytics engineer who wrote the test on the model. The analytics engineer expects the field to not be NULL, and if it doesn’t meet that expectation, Jessica is the one expected to fix it.
Benefits of Implementing Data Contracts
There are many benefits to adding data contracts to your data environment. These benefits include higher data quality, a more dependable data pipeline, and time saved communicating between teams.
A protected data pipeline
Data contracts often prevent downstream dependencies from running, like in the case of dbt contracts. They alert you of any issues before they have the ability to affect the data you are sharing with stakeholders. Not only does this prevent stakeholders from using inaccurate data, but it also gives the engineers and the data team more time to investigate what broke the contract and put the proper fix in place.
Clear owners of data tables, models, dashboards, etc.
You’d be shocked how much time is wasted trying to hunt down the person responsible for a certain data product. If you don’t have clear owners defined on tests, models, and contracts, you need to open a line of communication between different teams to help you do so. This wastes everyone's time!
Data contracts save the hassle of hunting down the right people in the most critical of times (when your data is failing!). They create a clear process for getting the issue in front of the person who can solve the problem, giving you more time to resolve it.
Challenges and Pitfalls in Data Contracts
The challenges with contracts come when deciding whether one makes sense at all, and then defining the terms of the contract.
Knowing when to add a contract
With anything related to data quality, it’s always a fine balance between adding value and adding noise. When considering adding a data contract, you want to make sure you are asserting expectations that would ruin the quality of your data or a particular outcome, if proven wrong.
For example, it doesn’t make sense to add a data contract to a model that isn’t being used downstream by the business or in reporting. Instead, expectations can be asserted when the model is used.
It’s not worth the time and effort to define a contract if the conditions being contracted don’t matter much in the end. If a contract can be removed in the case of failure, without any downstream disruptions, it’s a good sign that it should have never been added at all.
Communicating expectations so everyone is on the same page
Data contracts involve two parties- the producer and the consumer. The hardest part is often getting one another to agree on what the data should look like. This can involve many meetings on things like database design, data modeling, and dashboard design. Both parties need to be clear on what they need and what is and isn’t possible. Sometimes it can be hard to meet in the middle when both parties are seeing things differently.
This is a common occurrence when defining contracts between engineering and data teams, especially when defining new data tables. Engineers may build things one way, but data may need it another way to properly model the data. Data contracts can only be defined when the data is stable enough for data teams to begin working with it.
FAQ
How do data contracts improve data reliability?
Data contracts alert you when your expectations of your data have changed. In many cases, they prevent the models from actually running on the data that failed the contract. This prevents any issues from making their way downstream, alerting metrics in reports or dashboards used to make key business decisions.
What is the difference between an SLA and a data contract?
Service level agreements are typically between a service provider and a customer. They define expectations of the service that the provider is expected to deliver to the customer. Both SLAs and data contracts define expectations between two parties. Data contracts are focused specifically on maintaining data quality and exist between a data producer and a data consumer.
How do data contracts and observability work together?
Data contracts and observability work together to ensure you provide reliable data downstream to business stakeholders. Data contracts enforce expectations about your data, while data observability catches anomalies and data quality issues in the data pipeline.
TL;DR:
Data contracts are formal agreements between data producers and consumers that define how data should look and behave. This post explains how data contracts prevent pipeline breakages, ensure trust in dashboards, and shift accountability upstream, using tools like dbt and Elementary to enforce those expectations in code.
---
You’ve built a data model for an important product launch. The dashboards are live, metrics look great, until executives flag that something’s off. After hours of debugging, you realize engineering changed the grain of a core table… and no one told you.
This is exactly the kind of issue data contracts are designed to prevent.
This is the exact situation that data contracts help us avoid.
What are data contracts?
Data contracts are agreements between data producers and data consumers.
They clarify expectations about how the data should be produced and then used. In our example, a contract would be set between the engineers (producers) and the analytics engineers (consumers). They can also be set between the analytics engineer and data analyst when defining a data model and how that affects dashboard development.
In this scenario, a data contract would protect the integrity of the source data. It would set enforced expectations between engineers and data, causing a break in the data pipeline when the expectations aren’t met. Data contracts prevent downstream data from running in the case of the contract being broken. This means that the analytics engineers would be the first to know of any breaking changes rather than business stakeholders discovering a data quality issue in a dashboard.
You can think of data contracts as pushing the discoverability of any data issues upstream to data and engineering teams. This is exactly how it should be, as it helps maintain trust in the data.
Why Every Team Needs Data Contracts
Data contracts help enforce expectations upstream. They ensure data produced matches expectations and that the data is being used as it should. Without them, it can be easy for mistakes to go undetected in your data pipeline.
In our example, we imagined we were analytics engineers modeling data based on our expectations of how it should look in a table. When the format of that table changed, the data model delivered inaccurate data that the business used to make decisions. Instead of us alerting the business, the business alerted us.
Contracts help you to always be alerted to these changes before the business detects data quality issues. They stop your data pipeline in its tracks, preventing dependencies from running and therefore protecting the integrity of your data. While this seems like an inconvenience at the moment, these contracts end up saving teams time, allowing data products to scale, and forcing communication between teams when necessary.
Instead of data quality always being the responsibility of the data team, it is being moved as far upstream as possible with the help of data contracts.
Core Components of a Data Contract
Not every data contract will look the same, and that is ok. How they look depends on what you are setting an agreement on.
The most important things to have in a contract are as follows:
- Owner: Who owns the contract and is responsible for keeping it? This person should be clearly stated on any contract. A clear owner makes it easy to assign responsibility to the right person when the contract is broken and needs to be mended or reassessed. This ends up saving tons of time at the most critical moment! We will soon talk about how we use Elementary to assign ownership of our dbt tests.
- A clear expectation: What do you expect the data to look like? This needs to have a clear answer, as contracts are very binary in nature. They either match your expectations or they don’t. For example, I expect the field `created_at` to be a timestamp would be a clear expectation.
- Documentation: You need to document why a contract exists and what it is protecting against. This tells future users why it is important and needs to exist. This can easily be done in dbt in the form of descriptions and lineage graphs, or in an external documentation tool. Your future self and teammates will thank you.
Tools to manage data contracts
dbt and dbt packages like Elementary are powerful ways to add contracts to your existing analytics projects. Together, they create a robust way to ensure your data matches your expectations, stopping your data pipeline in its tracks and alerting whoever is responsible when it doesn’t.
dbt model contracts
Model contracts in dbt enforce data types, as well as different conditions of your data, such as nullness, uniqueness, and keys. Under the hood, dbt includes the specified constraints in the DDL statements submitted to your warehouse, enforcing them when your model is built or updated. dbt checks that these statements run before building any model, maintaining the highest quality of data. dbt data contracts are a great way to check that your data matches your expectations before your model is built.
Model contracts are written in the YAML files of your dbt project and defined on a model. Every contract must include every column’s name and data type.
models:
- name: dim_orders
config:
contract:
enforced: true
columns:
- name: order_id
data_type: int
constraints:
- type: not_null
- name: ordered_at
data_type: timestamp
dbt contracts will always enforce nullness, however, they cannot enforce other conditions like uniqueness and keys, depending on your data warehouse. Because of this, we recommend defining tests that check for these characteristics instead.
While dbt contracts enforce the structure of your data models, they can’t always catch changes in the behavior of the data—like volume drops or unexpected values. That’s where tools like Elementary come in.
Elementary
Data tests in Elementary offer more flexibility for what you can test. They are ideal to define expectations of your data not related to the shape of the data model, but rather the field values and their frequency.
They offer anomaly tests on:
- Dimensions
- Columns
- Volume
- Freshness
Schema change test
Elementary also offers a schema change test for when you don’t want the strict enforcement of a dbt contract. With this test, dbt will still build the model but offer you a warning or a stopping error depending on the severity you have set. This gives you the flexibility of data continuing to flow normally, while checking in with engineering on expectations and how they may have changed.
version: 2
models:
- name: login_events
tests:
- elementary.schema_changes:
tags: ["elementary"]
config:
severity: warn
Elementary’s schema change test will alert you to any changes in a column, including if one has been added or deleted, or if the datatype has changed. This can help understand what fields are no longer important and what additional information may need to be modeled downstream. As an analytics engineer, it’s helpful to see what data sources are being worked on in real-time, even if it isn’t causing breaking changes.
Ownership and subscribers config
My favorite feature that Elementary offers, when it comes to data contracts, is the ownership and subscribers configuration. Before discovering Elementary, I found there was always a gap between engineering and analytics when it came to data quality testing. Analytics engineers would own the data quality tests in dbt, despite not being the ones in charge of fixing the issue. They would then have to tag the engineer responsible for fixing the issue in the source data. This created a time-consuming back-and-forth conversation about who owned what and whose responsibility it was to maintain high-quality data. When it came to models, it’s the responsibility of analytics engineers. When it comes to source data, it’s the data engineers’ responsibility.
By tagging owners on my data quality alerts using Elementary, I am able to ensure the right person responsible is alerted of a failure, whether it's the responsibility of the analytics or data engineer. This also prevents me from needing to track down the right person as issues occur, as I’ve already discovered the person who owns the data.
tests:
- not_null:
meta:
owner: ["@jessica.jones"]
Here, Jessica’s Slack username is defined as the owner of the test. When it fails, she will be notified. This acts as a contract between Jessica, the engineer, and the analytics engineer who wrote the test on the model. The analytics engineer expects the field to not be NULL, and if it doesn’t meet that expectation, Jessica is the one expected to fix it.
Benefits of Implementing Data Contracts
There are many benefits to adding data contracts to your data environment. These benefits include higher data quality, a more dependable data pipeline, and time saved communicating between teams.
A protected data pipeline
Data contracts often prevent downstream dependencies from running, like in the case of dbt contracts. They alert you of any issues before they have the ability to affect the data you are sharing with stakeholders. Not only does this prevent stakeholders from using inaccurate data, but it also gives the engineers and the data team more time to investigate what broke the contract and put the proper fix in place.
Clear owners of data tables, models, dashboards, etc.
You’d be shocked how much time is wasted trying to hunt down the person responsible for a certain data product. If you don’t have clear owners defined on tests, models, and contracts, you need to open a line of communication between different teams to help you do so. This wastes everyone's time!
Data contracts save the hassle of hunting down the right people in the most critical of times (when your data is failing!). They create a clear process for getting the issue in front of the person who can solve the problem, giving you more time to resolve it.
Challenges and Pitfalls in Data Contracts
The challenges with contracts come when deciding whether one makes sense at all, and then defining the terms of the contract.
Knowing when to add a contract
With anything related to data quality, it’s always a fine balance between adding value and adding noise. When considering adding a data contract, you want to make sure you are asserting expectations that would ruin the quality of your data or a particular outcome, if proven wrong.
For example, it doesn’t make sense to add a data contract to a model that isn’t being used downstream by the business or in reporting. Instead, expectations can be asserted when the model is used.
It’s not worth the time and effort to define a contract if the conditions being contracted don’t matter much in the end. If a contract can be removed in the case of failure, without any downstream disruptions, it’s a good sign that it should have never been added at all.
Communicating expectations so everyone is on the same page
Data contracts involve two parties- the producer and the consumer. The hardest part is often getting one another to agree on what the data should look like. This can involve many meetings on things like database design, data modeling, and dashboard design. Both parties need to be clear on what they need and what is and isn’t possible. Sometimes it can be hard to meet in the middle when both parties are seeing things differently.
This is a common occurrence when defining contracts between engineering and data teams, especially when defining new data tables. Engineers may build things one way, but data may need it another way to properly model the data. Data contracts can only be defined when the data is stable enough for data teams to begin working with it.
FAQ
How do data contracts improve data reliability?
Data contracts alert you when your expectations of your data have changed. In many cases, they prevent the models from actually running on the data that failed the contract. This prevents any issues from making their way downstream, alerting metrics in reports or dashboards used to make key business decisions.
What is the difference between an SLA and a data contract?
Service level agreements are typically between a service provider and a customer. They define expectations of the service that the provider is expected to deliver to the customer. Both SLAs and data contracts define expectations between two parties. Data contracts are focused specifically on maintaining data quality and exist between a data producer and a data consumer.
How do data contracts and observability work together?
Data contracts and observability work together to ensure you provide reliable data downstream to business stakeholders. Data contracts enforce expectations about your data, while data observability catches anomalies and data quality issues in the data pipeline.