To get the most from a BI investment, make sure the data pipeline is in order first.
There’s an old saying that is often applied to analytics: “Garbage in, garbage out.” Results are only as good as the data which feeds them. In fact, preparing that data is 80% of the analytics process. Taking shortcuts with data quality is a fast way to undercut business intelligence efforts.
This checklist is a useful guide for evaluating the existing process and making plans for future infrastructure.
Why is Data Preparation Important?
Data comes in many formats, especially when coming from different sources. When everything is funneled into a communal database there may be blank fields, differences in field labels and numbers, and variations in numerical formats that read differently to a computer (dates are one example of this). Depending on the databases similar records may be duplicated or merged into a single entry.
Messy input like this can produce null or even misleading results. When the data can’t be trusted, it negates the advantage of business intelligence. Data has to be organized into a consistent format before analysis.
Data Quality Checklist
There must be enough data to warrant analysis. All critical fields should be full and there should be an acceptable percentage of non-critical fields filled in as well.
Data should be validated and come from a reliable source. “Reliable” has different meanings based on the type of data, so use good judgement when it comes to choosing sources. Consider who owns or manages the source as well as how the data is collected.
Low cost cloud storage has enabled businesses to store more data than ever before. That can be an advantage- as long as it can potentially be used to answer business questions. Also, check whether the data is still current or if there’s more up-to-date data available.
Prepare data for analysis in an appropriate format (such as CSV). Data scraped from PDFs and other file types may be in an unstructured state that needs more work to be usable. Follow common text and numerical conventions. Currency and dates, for example, are noted differently in the US versus Europe. Check for duplicates and contradictory data as well; this is a common issue when importing from different sources.
All concerned end users should be able to access the company’s data, providing it’s legal and ethical for them to do so (for example, HIPAA records should be protected). Make sure this can happen in real or near-real time; when staff has to wait days for data requests to come back they tend to move ahead with less informed choices instead.
Make sure there’s a designated data steward who is empowered to maintain the data pipeline. It doesn’t have to be a separate position, but they should be able to speak to company leadership when there’s an issue.
Think in terms of “data lakes” as opposed to “data silos”, too. Data lakes put the entirely of the company’s data in the hands of those looking for innovative ways to improve operations. They can make decisions based on all available information without worrying that some hidden bit of data might derail their plans. (Automaker Nissan has seen great success from this strategy.)
Options for Data Preparation
When it comes to data preparation, the options boil down to manual versus automated techniques.
Manual data preparation is when employees go through data to check its accuracy, reconcile conflicts, and structure it for analytics. It’s suitable for small batches of data or when there are unusual data requirements, but the labor investment is high.
- Less obvious investment (labor goes up instead of a technology outlay)
- Low training burden
- Granular control
- In-house data security
- Staff could be working on more high-value tasks which are harder to automate
- Prone to human error
- Expensive when labor is considered
With automated data preparation, software is used to sort, validate, and arrange data before analysis. Automation can handle large datasets and near real-time processing.
- Fast enough to prepare data for streaming analytics
- Highly accurate
- Removes labor burden
- Works on both the front and back end of collection
- Staff must be trained on the software
- Initial investment required
- Working with outside vendors requires extra vigilance for security purposes
Data quality may be the least exciting part of business intelligence, but it’s the only way to get reliable results. Take the time to build a strong foundations for your data intelligence process and you’ll be rewarded with more reliable, better-targeted insights.
Having doubts about your data quality? Set up a free consultation with Concepta to assess where you are in the business intelligence process and how to get where you’re going.