This page explains the expect_column_distinct_count_to_be_less_than
test from the dbt-expectations package. This test checks that the number of distinct values in a specified column is less than a certain threshold. It's useful when you need to enforce a maximum limit on the variability or diversity of the data within a column.
How it Works
The expect_column_distinct_count_to_be_less_than
test verifies that the number of unique entries in a column does not exceed a pre-set limit. This function is important for cases where high diversity might indicate incorrect data grouping, duplicate entries, or other inconsistencies.
Steps and Conditions:
- Column Selection: Choose the column you want to evaluate for distinct value count.
- Define a Maximum Threshold: Establish a maximum allowable number of distinct values using the
value
parameter. - Optional Configuration:
- Quote Values: Use the
quote_values
argument to determine if values should be quoted, default istrue
. - Group By: Utilize the
group_by
parameter to specify grouping by one or more columns, which is valuable for evaluating distinct counts in segmented datasets. - Row Condition: Employ the
row_condition
parameter to filter the rows that will be included in the test.
- Quote Values: Use the
- Execution: After applying any row filters and groups, the test counts the distinct values in the target column and compares this count against the set threshold.
- Outcome:
- Pass: The test passes if the number of distinct values is less than the threshold, confirming control over data variability.
- Fail: If the distinct count exceeds the threshold, the test fails, signaling potential issues in data collection or categorization.
Example Usage: Fintech
For a Fintech company, monitoring the number of unique transaction types processed through an application is critical to ensure that all transactions are within expected boundaries and to prevent any unusual activities.
Consider a scenario where the transaction_types
table records types of transactions processed, and a column type_id
identifies each transaction type.
In this example, the expect_column_distinct_count_to_be_less_than
test is used to ensure that the type_id
column in the transaction_types
model does not have more than 5 distinct transaction types since January 1, 2023. This setup helps maintain confidence in the consistency and categorization of transaction types handled by the fintech platform.