Data Quality Best Practices: Avoiding Data Fires

A few years ago, when I started my career as an analytics engineer, I discovered a major issue in how the engineering team collected data. I discovered that they weren’t collecting any data at all, in fact, for one of the main sources in the data team’s sessions model. The business depended on this model for attribution and understanding traffic on our site.

The worst part was that it took me so long to find this out. At the time, I had no dbt tests in my project, let alone any kind of data quality measures. I learned about data quality issues after the fact, when analyzing data sources for other reasons.

When I discovered the issue, it took the engineering team and I two weeks to fully discover the extent of it, understand the issue, and then fix it. Why did it take so long? There were three main reasons:

  1. Nobody knew who was responsible for fixing the issue.
  2. We lacked proper tests and alerts, making it impossible to know when there were any issues.
  3. These types of problems weren’t visible across the company.

In the end, we lost two weeks’ worth of precious data. Had we had the proper anomaly checks in place, we would have been able to understand and fix the root of the problem in a few hours.

In this article, I’ll walk you through what you need to do to ensure you don’t find yourself in the same situation I was in. We will discuss why data quality always needs to be the first thing you think about and the best practices for monitoring data quality with dbt, such as adding tests and assigning resource owners.

Anticipating data quality issues

Data quality should be the first thing you think about when building out your dbt project and data pipeline. The biggest mistake many data teams make is waiting to think about data quality once everything is built out. However, everything should be built to optimize data quality.

When building out your project and pipeline, you need to consider the areas where things can go wrong and how to prevent them.

When evaluating data quality, I always consider these five factors:

  • Accuracy (Do your values look as you expect?)
  • Freshness (Is the data arriving on time?)
  • Governance (Is your data secure? Is PII properly handled?)
  • Completeness (Are there unexpected NULL records?)
  • Uniqueness (Are there duplicates?)

Building your project and pipeline with these five things in mind will save you from future pain. You will make decisions you wouldn’t have made otherwise, such as creating different environments within your dbt project or adding uniqueness tests on all your sources.

When it comes to data quality, it’s not a matter of if there will be problems, it’s a matter of when. Do things the right way and you will know about them sooner rather than later, saving you from potential data loss.

To properly monitor data quality, you’ll want to implement a few best practices such as adding dbt tests, assigning owners to resources, and sending alerts to the right location.

Best Practice #1: Adding dbt tests

dbt offers the ability to add freshness tests to all data sources. Freshness tests allow you to configure the cadence at which you expect data to arrive in your data warehouse. For example, if you ingest data from Postgres hourly, you will expect data to be ingested within the last hour.

Freshness tests

Freshness tests, like all dbt tests, are added to the YAML files where the source or model is defined. With freshness tests, you define them on the source name itself and specify the timestamp field you want to check for freshness (loaded_at_field). A good rule of thumb is to use the timestamp that represents when the record in the table was created.

  - name: example_source
        count: 3
        period: hour
        count: 10
        period: hour
    loaded_at_field: event_created_at

Notice here that I also have different configurations for warnings and errors. Here, dbt will warn me about my test failing if my source hasn’t had new data within 3 hours. However, it will error and prevent downstream models from building after 10 hours.

If you find dbt’s freshness tests don’t give the coverage you need on a model, you can add an anomaly test from a dbt package like Elementary. This compares the freshness of the different time buckets rather than looking for the most recent record.

  - name: example_model
      - elementary.freshness_anomalies:
          timestamp_column: event_created_at
            period: hour
            count: 24

Generic tests

Other generic tests such as uniqueness, not_null, and relationship tests are also available directly within dbt. Each of these is applied to fields within a source or model. I recommend always adding these on primary and foreign keys at a minimum.

  - name: example_source
		description: example source to show how generic tests work
			- name: id 
				description: PK of the source table
					- not_null 
				  - unique

Schema change detection

Schema change detection is ideal for any data team but especially one disconnected from the engineering team. These tests help alert you of any changes at the source such as fields being added or dropped and data type changes.

Elementary makes adding this test seamless with one simple line:

  - name: example_model 
      - elementary.schema_changes

Volume anomaly test

In the scenario mentioned in the introduction, a volume anomaly test would have properly alerted me of a data volume issue. We were getting data each day, but not nearly as much as we should have been. Similar to Elementary’s freshness anomaly test, their volume anomaly test works by using time buckets and comparing the count of rows in each bucket. Had I implemented this at the time, we would have quickly identified the lack of data being collected.

  - name: example_model
      - elementary.volume_anomalies:
          timestamp_column: event_created_at
            period: hourly
            count: 24

Best Practice #2: Assigning owners to sources and models

I mentioned in the introduction that data quality issues take a while to resolve when nobody knows who is supposed to fix the issue. One way to remove confusion is to assign owners to data sources and models. This removes the guessing and ambiguity that occurs when something fails.

I recommend assigning both a technical and business owner if possible. The technical owner should be the person who wrote the model, or in the case of a data source, the engineer who wrote the code that produces the data. The business owner should be the person who relies on this data model and is heavily involved in the business logic encoded into it. This person will help in evaluating the impact of data quality failures.

Elementary offers a few different ways to add owners to your models, but the easiest way is using a meta block like so:

- name: example_model
    owner: ["", "@madison"]

Here, I assigned two owners: one is my email and the other is my Slack email prefix. This makes it so the owner is directly tagged in Elementary alerts on Slack, ensuring they are always notified when a test fails.

Best Practice #3: Sending alerts to the right place

Data quality alerts aren’t very useful unless they are sent to a place the data and engineering team frequently check. To determine where to send alerts, think about how your company and data team operate. What platform is the norm? Where will they see the alert first?

If your company primarily uses Slack, for example, it wouldn’t make sense to send data quality alerts by email. If you are a frequent Slack user like me, I recommend creating an #analytics-monitoring channel where you send all data alerts across different parts of your data pipeline.

This is ideal because it consolidates all failure alerts from different tools in your pipeline to one central location. For example, Airbyte sync failures, Elementary test failures, and Hightouch alerts would all be sent to this one Slack channel where engineers can easily look.

As I showed in the owner code from Elementary, you can tag someone’s Slack prefix and they will automatically be tagged in alerts where they are the owner. You can read more about setting up alerts in Elementary here.

What’s next?

Losing two weeks’ worth of data due to an issue with how we were collecting data was devastating. Why was it so devastating? Because it could have easily been prevented. All of the action steps in this article don’t take a long time to implement, they just take proper planning and an understanding of data quality.

Had we implemented tests built into dbt and anomaly tests like the ones provided by Elementary, we wouldn’t have lost more than a few hours’ worth of data. We would have had all of the testing, and alerting needed to put out the fire fast.

Don’t wait until something goes wrong to think about data quality. Data quality should be at the forefront of everything you do. When you build your dbt project, data pipeline, and data team processes around this, you are bound to create a stable foundation for a data-driven organization.

Talk to an expert about your specific needs