Saturday, May 24, 2008

The Impact of Data Quality on New Media Applications

Many of the new media applications discussed in this blog make intensive use of data stored in enterprise repositories. Such repositories can include not only the traditional customer databases, and site visitor data mining stores but also blogs and wikis, which are stored in relational databases. Supply chain applications make intensive use of data stores such as inventories and suppliers.


Data Quality problems impact a wide variety of information technology projects, and of course this includes those involving new media. In a 2005 report, Gartner estimated that data quality problems will compromise 50% of data mining projects or result in their outright failures.

Donald Carlson, director of data and configuration at Motorola discusses data qulaity problems with supply chain projects, "We have had major [supply chain] software projects fail for lack of good data." Craig Verran says that “We see 20% duplicate supplier records." He is assistant vice president for supply chain solutions at The Dun & Bradstreet Corp. His group assists clients with improving the data quality of their supplier data files. [see ComputerWorld]


How should the project manager protect the social media project from data quality torpedoes? At minimum, a data cleanse phase should be part of the project. Depending on the criticality of the project and the extent of the problem, such an effort should be a separate, preliminary project. A business case must be made that justifies the extent of project clean-up effort.

What types of data errors might a project manager encounter? Jack Olson (2003) in his book “Data Quality” explains the concept of data profiling in great depth. Here are is his error typology:

  1. Column Property Analysis: Invalid values


  2. Structure Analysis: Invalid combinations of valid values, in this case how fields relate to each other to form records.


  3. Simple Data Rule Analysis: Invalid combinations of valid values, in this case how values across multiple fields in one file relate together for valid sets of values.


  4. Complex Data Rule Analysis: Invalid combinations of valid values, in this case how values across multiple fields in several files relate together for valid sets of values.


  5. Value Rule Analysis: Results are unreasonable.

What are our options with bad data? There are three choices with the bad data: 1.) delete it; 2.) keep data as is; and 3.) fix it. There may be statutory or standards reasons that preclude you from deleting the data. Not all data is fit for use by the business or operational process you are improving or introducing with your project, so you can’t keep it as it is. The cost of fixing the data or the staff time it would take may be prohibitive. Where you draw the line depends on the impact of bad data.


References
Gartner (2005). “Salvaging a Failed CRM Initiative”; Gartner: SPA-15-4007.

Olson, Jack (2003). Data Quality; Morgan Kaufmann Publishers.

No comments: