kb3 labs

Top 3 Data Quality Issues and How to Solve Them

Kerry Kurcz • October 26, 2024

Data Quality is one of the most overlooked and expensive issues an organization can have.

Data Quality can be defined as the mismanaged expectations between the data producer and the data consumer [1]. To maximize Data Quality, consider whether the dataset is:

 

1.      Accurate

2.     Reliable

3.     Contracted

 

I call this the ARC framework.

 

First, what does it mean to be Accurate? Who knows all the test answers? Who is the subject matter expert (SME) that would know? Are they sufficiently technical to dive into the data themselves to find out? If not, identify a technical resource to form a partnership with the SME to identify and document accuracy levels. Meanwhile, everyone wants 100% accurate data, but the truth is, the world is messy and total accuracy does not exist. So, who can understand the appropriate accuracy threshold for this use case?

 

Here's where Reliability comes to play. Once data consumers can be confident in the accuracy of their data, they should also be rest assured that the data will be available to use when needed. Stale data, information that is no longer relevant but is still accessible, may influence decision-making. Make sure the data is current by implementing checks on expected new inbound records, and highlighting alerts when that number is below an acceptable threshold.

 

Last, it is true that datasets can be complex and even encrypted or encoded to save storage space or reduce risk. But someone should now how it works, and it should be sufficiently intuitive to Document, and Document it you should, securely.

 

Data Contracts are a binding contract among all data producers and consumers, defining and communicating important events like:

 

  • Expected number of new, inbound records in a given time frame
  • Ex. 100 per minute, 1000 a day
  • Schema and it’s changes
  • Ex. when adding or dropping a column
  • Changing semantic meaning of the data
  • Ex. kilometers  to miles, a timestamp changing from the local time to UTC

 

If your business is moving so fast that documenting all of this seems infeasible, it can be as simple as data mapping document or a data definition document. Just make sure to share it with all data producers and consumers so that all parties can provide their sign-off.

 

[1] Definition by Chad Sanderson, on The SuperDataScience Podcast with Jon Krohn here.

Share This Post!

Share by: