
Data quality is the iceberg of BI and Information Management. It is in your best interests to know that it's there and start breaking it up, because the chances of avoiding it totally are remote.
We have known about data quality issues for over fifteen years, since the inception of data warehousing and BI. Data quality is a potential data warehouse killer – poor data in equals useless information out. As businesses become more and more automated, and high volumes of data are used for analysis in more strategic ways, business decisions are more dependent on the quality of underlying data. Generally understanding of the need to look at data quality is growing, but due to the increase in complexity of systems, corporate change in areas like M&A, data quality continues to present problems.
Defining terms and resources
We are running out of words in the English language to describe all the types of technology and business processes that we need. ‘Data quality’ is a term that can be used in a variety of ways. Here it is used in its broadest sense – to refer to all the software tools, technologies and business processes that contribute to making sure data is clean, consistent, correct and timely.
These attributes of high data quality are defined below
‘Clean’ means that there are no obvious defects, for example surnames in English have capital letters, genders are male or female only, postcodes are in the correct format, there are no duplicate records etc.
‘Correct’ means that Mr A Smith is not Mr B Smith – the data is an accurate reflection of the actual circumstances.
‘Consistent’ means that different data items that relate to each other are correct. This could mean that Mr A Smith has gender male, or that Mr A Smith is only referred to as Mr A Smith and not as Mr Alan Smith. Consistency also refers to the need for codes to be matched when integrating data between different systems. For example to ensure that product information from a legacy application can be matched with the new ERP system, even if the systems use different product codes.
‘Timely’ means that the data is updated accurately – closed accounts are reflected as closed, new credit statuses are reflected quickly etc.
The above examples demonstrate the range of problems that data quality can present. Although some of these problems can be solved mainly within the IT department, such as different product codes between systems, others require the involvement of the business. The broad definition of data quality must include business processes.
How to approach data quality
A data quality programme will tread a fine line between fitting into an enterprise-wide strategy, and delivering small, frequent chunks of business benefit. Data quality is best addressed on the back of information related projects such as BI, Corporate Performance Management or Master Data Management, as the business benefit of the core project can be extended to the data quality. These projects also make it easy to prove to the business that data quality is an issue, whereas in isolation this message is harder to convey.
Once the company finds out that it has a data quality problem, reactions differ widely. When IT has a good reputation with the business, it may be asked to address data quality and propose further investigations into other areas of data. However where IT is treated more as a service department than a strategic element, the business may be reluctant to accept the news that IT can't fix data quality on its own. Allocating business resource and responsibility in this way needs to be done with a senior and influential business sponsor.
IT can fix existing data in batch by defining business rules for its correction and applying them in batch. However in order to ensure that data created on an ongoing basis is of a high quality, the business has to be involved in ensuring that new data is good. This can include setting rules to validate data in front-end systems and on how to handle any exceptions.
Ongoing data quality requires data stewardship. This means that business people take ownership of specific data items in order to ensure ongoing quality. Data is matched to business owners who have a specific interest in the high quality of that information. For example, customer account managers are made responsible for the data on their own customers and updating addresses, account information and customer groupings. Financial managers reconcile invoicing and payment totals.
Data quality should be part of any BI project plan unless the data is specifically known to be of high quality due to previous activities. Note that "I don't see how the data could be of low quality" does not count as previous activity! By default data is low quality, because data stores build up from the need to work with the application, not the data.
The data quality software market
In the last five years, market maturity for data quality has improved. There is now broad awareness of data quality in the BI space. However customers who have higher levels of business involvement may find that this is a problem to convey to the business. To IT staff, the fact that the call centre staff get bored of selecting M or F and just leave the Gender field blank is understandable though annoying. To a call centre manager it can be shocking and unwelcome news.
Software for data quality falls into three categories:
Data profiling – this is used to automatically analyse data from legacy systems in order to integrate it with other systems. It can suggest matches between tables and columns in different environments and suggest primary and foreign key relationships. This is often the first stage of ensuring consistency between systems, particularly where the data in one or more systems is not well understood.
Data cleansing – allows users to set up simple rules and run them on datasets. Data cleansing routines are applied in two ways – one is through the correction of bulk data in batch, the other is through the application of routines in real-time at the point of data entry. Tools vary from those used by end-users to set up simple rules, and those which typically require IT skills to program more involved routines.
Data integration – allows IT to set up complex transformations for any form of cleansing or integration. These tools usually run in both batch and real-time as required.
The leading European companies are slightly in advance of North American ones in terms of data quality. They are more aware that software is only a silver bullet for very small-scale problems and understand that addressing data quality is a business process. They are also aware of the need to integrate data across language barriers, for example handling different postcode systems and different character sets. However, as businesses become more and more automated, and huge volumes of data are used for analysis in more strategic ways, business decisions are more dependent on the quality of underlying data. Data quality will continue to present problems that organisations need to address strategically, while remaining focussed on the information that is important in order to make the best use of resources.
Alys Woodward is program manager for BI and Analytic Applications in Europe
She has been an industry analyst since 2003, providing research and analysis on such diverse areas as software licensing, software asset management and RFID, in addition to BI. Prior to her analyst work she spent eight years as a technical BI consultant working in blue-chip organisations.