George's New Media Review: Data Mining

Showing posts with label Data Mining. Show all posts

Monday, July 20, 2009

ColdFusion Program to Read XML File

Markup encodes and transfers metadata about information such as its structure and format. XML is a markup language, derived from the much earlier SGML, which uses strings of short words to surround the data it is describing. These strings of short words are known as tags. An example would be <name>George</name> <phone>555-1212</phone>.

With XML, a structural model of the data in a file can be encoded along with the data. The structure can be as simple as name and phone or much more complex, with fields like name and phone embedded in other fields like employee. A style sheet, an XSL, can be used to transform the tagged data in an XML file into a Web page. Likewise, a schema file, an XSD, can communicate database information in database operations.

ColdFusion provides a subset of functions that enable a programmer to operate on XML files. A typical operation would be to read an XML file, work on it and write it, perhaps to a database. A datatype in ColdFusion has been created for XML and is known as the XML document object. By doing this, Adobe extends the reach of its already existing ColdFusion structure functions to encompass XML data as well.

Structures consist of objects, properties, and objects embedded in other objects. Name and phone embedded in employee is an example. The dot operator is used to delimit what object in a structure you want to access. The general syntax is <object>.<object>, <object>.<property>. These can be combined to get for instance <object>.<child object>.<property>, so to access my phone number, we would use the dot operator: <employee>.<phone>

The following ColdFusion code uses some of the XML functions to read an XML file. It starts by validating the XML data file to ensure that it is consistent with its schema.

<cfset myResults= XMLValidate ("wits.xml", "wits.xsd")>
<cfoutput> Is Valid? #myResults.status#!
</cfoutput>

Next, a file handle is obtained.

<cffile variable="gmrXML" file=" “wits.xml" action="read">

Then we get a ColdFusion XML Document Object datatype using our file handle and the xmlParse() function:

<cfset myxml=" xmlParse(gmrXML)">

Now we are ready to loop through the data, and display different field values. This test file consists of a series of incidents embedded in an incident list. In addition to its own properties, an incident will have an embedded list, or object, of one or more incident types. In addition to this, it will have, an embedded facilities list, of zero or more facilities involved in the incident. These embedded lists correspond to embedded objects or to child tables in a database.

<cfoutput>
<cfloop index="i" from="1" to="#arrayLen(myXML.incidentList.Incident)#">
Incident 
#myXml.incidentlist.incident[i].ICN# 
#myXml.incidentlist.incident[i].Subject# 
#myXml.incidentlist.incident[i].Summary# 
#myXml.incidentlist.incident[i].IncidentDate# 
Event Type 

<cfloop index="j" from="1" to=
"#arrayLen(myXML.incidentList.Incident[i]. EventTypeList.EventType)#">
#myXml.incidentlist.incident[i].EventTypeList.EventType[j]# 
</cfloop>
 Facilities 
<cfif StructKeyExists(myXml.incidentlist.incident[i].FacilityList,"Facility")>

<cfloop index="j" from="1" to
"#arrayLen(myXML.incidentList.Incident[i]. FacilityList.Facility)#">
#myXml.incidentlist.incident[i].FacilityList.Facility[j].FacilityType# 
#myXml.incidentlist.incident[i].FacilityList.Facility[j].Indicator# 
</cfloop>
<cfelse>
No facility for this one
</cfif>
</cfloop>
</cfoutput>

The ColdFusion arraylen is used in the “To” parameter of our loop to return the number of Incidents in this incidentlist. There is one incidentlist per file. Our dot operator starts with the file handle, then refers to the incidentlist we know to be in file (because we validated), and the particular incident is referenced with our index variable [i]:

myXml.incidentlist.incident[i].ICN

The incident number property, ICN, is accessed from the record just read from the file. For EventType, it is very similar. Since we know we will have at least one Event Type in every incident (one or more), the following code is sufficient:

myXml.incidentlist.incident[i].EventTypeList.EventType[j]

On the other hand, we may not have a facility (zero or more) and will get a null pointer exception if we try to dereference a facility in an empty facilitylist. We need to use the ColdFusion StructKeyExists function to test if a facility object is embedded in the facilitylist for the current record.

<cfif StructKeyExists(myXml.incidentlist.incident[i].FacilityList,"Facility")>

If so, then we loop through the facilities involved in the incident, else we indicate no facility.

Friday, September 5, 2008

My Coke Rewards

There has always been a rolling might to Coca Cola marketing and they are now stealing a march with the My Coke Rewards loyalty program as well. The enormous success of the program (see Promo) has created an equivalently enormous data store chock full of useful information. In awarding it the 2007 Interactive Marketing Award for "Best Loyalty Marketing,” Promo.com also noted that Coke has unleashed advanced technology to exploit this new information.

"Coke has invested in the collection and mining of consumer information. This data is already fueling customization on the site, and is also being used for e-mail and mobile promotions and other types of communication."

Coke is using Enterprise Decision Management (EDM) software with its new data store to automate operational decisions concerning promotional activities. Radan and Taylor (2008, ¶ 3) describe EDM as a new approach, that integrates Business Intelligence data analysis with business processes, combining operational and analytical processing. This is in contrast to the separation of data from business process inherent in data warehousing.

EDM is avant-garde and Coke is being applauded for its vision and mastery (see EDM Blog). Taylor (2007, ¶ 2) notes that data generated from loyalty programs can be infused with energy from an EDM “to improve marketing, store-layout and many other decisions.” One example he gives is its application to decide what rewards or rebates actually result in a change in customer behavior.

References
Radan, N. and J. Taylor (June 2008). Enterprise decision management uses BI to power up operational systems. Teradata Magazine. Retrieved on September 3, 2008 at (http://www.teradata.com/tdmo/v08n02/Viewpoints/EnterpriseView/Choices.aspx

Taylor, J (July 26, 2007). Growing your business with decision management. Retrieved on September 3, 2008 at http://edm.findtechblogs.com/default.asp?item=656748

Hallmark Crown Rewards

Duncan (2004, p 226) says that Hallmark does not rely much on demographics in its marketing analysis but instead “places much more emphasis on psychographics.” An artist with Hallmark explained that the relationship rather than the age is the essential element in their work. A program like Crown Rewards can help build an informative database of customer attributes and behavior patterns and add supersonic energy to their creative work, marketing communications, and strategic planning. It can even keep the company's market share intact, like it did for a Hallmark that was troubled in the early 90s.

They were hurting in the 1990s (see Hallmark History) because the world had changed and caught them unawares. They “had fallen victim to changing buying patterns in particular among women, who still bought 90 percent of all cards sold.”

Since implementing the program in 1994, the company has avoided the dire decline. Hallmark gains twice the revenue from Crown Rewards members than from general customers. Here is an internal study by Phillip Morris on Hallmark and its use of the consumer database and the uplifting effect the Crown Rewards program had on the Hallmark company (see Phillip Morris on Crown Rewards Database).

In addition to helping Hallmark, the Crown Rewards consumer database also supports the marketing efforts by Hallmark retail franchise stores, such as Mark’s Hallmark Stores (see iPass Case Study). Besides access to the Crown Rewards database, Hallmark also sells access to its high-speed data communications network named Hallmark/iPass. It is also useful for Hallmark subsidiaries such as Crayola.

When we are creating an account for the Crown Rewards Program, Hallmark asks if its affiliated companies can e-mail about special offers (see Hallmark Registration). This extends the psychographic profiling capabilities of Hallmark to companies such as Crayola, which probably could not afford to maintain such sophisticated data analytics functionality on their own. (see Crayola History)

Hallmark is a great study because it shows a hidden motive – the data motive- in loyalty programs.

References
Duncan, Tom (2005). Principles of Advertising and IMC. New York: McGraw-Hill Irwin.

Tuesday, August 19, 2008

Requiem for a Rolodex

Denise Schoenbachler, Geoff Gordon, Dawn Foley and Linda Spellman published attractive and informative guidance (see http://web.cba.neu.edu/~fsultan/Database%20Marketing.pdf ) for creating a customer database useful in Database Marketing. As a first step, they recommend that a vision document be prepared to explain the corporate need, to record user profiles, and to designate a project sponsor.

If the primary corporate use is list management, they suggest that a service bureau, such as All Media (see http://www.allmediainc.com/index.html) would be the most cost effective approach. On the other hand, if the corporate need is for segmenting customers, correlating customer traits with purchasing behavior, or which customer personas are the most profitable then a database development effort is necessary.

Sources of Feed Data
To build such a repository both internal and external sources of data must be filtered and merged. What types of information are typically stored and where can we get them? Spiller and Baier (2005, p73) list the usual suspects:

Demographics
Psychographics
Financial history
Prospect interactions
Prospect Interaction Dates
Address and phone number
Profitability or net financial value

Psychographics is synonymous with life-style and personality. Such a profile can add supersonic energy to a direct marketing campaign by segmenting prospects for more effective communications (see Spiller and Baier, 2005, p 39). It may be derived from the appearance of a customer on various lists from different publications of note. A fine and handy source, however, remains The Lifestyle Selector by Equifax. It is fed by responses to surveys and completed product registration cards.

Spiller and Baier (pp 37-39 ) further disclose other sources of powerful information. These include the CensusCD Neighborhood Change Database. Census data (p 38) is a good source of demographic data about our prospects. This is the usual necessary demographics: age, gender, education level, income level, occupation, and type of housing. They (p 13) suggest that customer lists from prominent publications may have relevance not only as data to overlay the customer database but as channels for direct response print advertising.

There is also (p 13) the implication that web site audit logs are useful sources for data mining to see where visitors to the web site sourced from, as well as what landing pages they reviewed and for how long at they stayed at the web site. Their CGI headers will contain their IP address, and the Internet Service Provider (ISP) can be traced from that. It may be possible to purchase a list to get the demographic data and other monitoring data held by the ISP.

Additionally, there are compiled lists from third parties such as the Department of Motor Vehicles, Birth records, and other state and county court house data. Regarding information from governmental sources, Phelps and Bunker (2001, p 34) caution that

“Although the individual-level nature of public records information increases its utility for marketers, the use of such individual-specific information has contributed to consumer privacy concerns.”

Is all lost? Direct Marketers may be able to use access statutes to retrieve the desideratum. The Freedom of Information Act (FOIA) is a popular choice and there is no specific exclusion for commercial use of the information (p 35).

However, they summarize legal arguments (p 44) on the topic with the assertion that while FOIA may not restrict use based on motivation for obtaining the information, it does not guarantee access for commercial purposes. They conclude (p 46) that

“the creation of provisions that discriminate based upon the motivation of the requester seems a slippery slope that legislatures should avoid starting down. Marketers must realize, however, that public opinion, like gravity, can work to push legislation down this slope.”

How to Grow the Database
Spiller and Baier (2005, p 52) suggest starting with a simple name and address database and then incorporate geographic, demographic and psychographic data about each prospect. They explain (pp 75-79) how to calculate the Life Time Value (LTV aka PAR) of a customer and to decide if the cost of obtaining and maintaining the customer’s data in our database is a worthy exercise. The PAR Ratio is cost/PAR. If this ratio is less than one, then the customer’s record pays its freight.

According to them (p 7) the process should start with defining personas for the prospects themselves:

Who they are
What they need and want
How the college fit into that
Where they are located
When they are ready to interact with the college
Why they interact
Their level of commitment
What channels for distribution are most efficient.

Crucial facts from the databases would be the recency, frequency and monetary value of interactions with prospects (p 10). This will help profile an expected relationship with the prospect and the your company. It is vitally important the data obtained externally be guaranteed fresh (p 44). Stale data results not only in wastage but in antagonism to non-prospects getting spammed.

Lost Your Keys?
Keys are a critical part of joining data from different sources so you can be sure that you are adding the correct information from one source record to its corresponding record in another source (see Olson, 2003, p 176). Keys are fields or attributes that uniquely identify a record such as SSN but not like name. However, having such keys across all lists is a luxury, although Spiller and Baier have a remedy to take the place of unavailable keys (p 65) – a match code.

Match codes are formulated from Name and Address information and can be used to match records from different source so they can be properly merged (p 66) . They can also be used to purge duplicates. A counter of the duplicates discovered should be incremented on the retained record before the duplicate is deleted (p 68) because showing up on two or more lists is an indicator of the strength of interest for that prospect.

The database should also be updated with results from our own integrated communication campaigns to appealingly increase efficiencies. Schoenbackler, et al say that the different promotions or specific motivational actions and the responses to such, are

“perhaps, the most valuable benefit of the database. Marketers no longer have to
accept John Wanamaker’s lament that ‘half of what I spend on advertising is
wasted – the problem is which half?’”

They also recommend that a marketing analysis be conducted on the customer database to sub-serve two important functions: 1.) profitability; and 2.) Trends. Trends analysis is a dramatic synthesis of our most productive through least productive customer profiles – i.e. who are the whales.

What problems can confront us? Our old friend data quality is a usual suspect. See http://gmrwvu.blogspot.com/2008/07/impact-of-data-quality-on-new-media.html . Traditional project management failings can torpedo this type of effort but they can work their powerful wreckage on any type of information technology effort.

Schoenbachler, et al sum up that database marketing is a necessity not an option.

References

Olson, Jack (2003). Data Quality; Morgan Kaufmann Publishers.

Phelps J., and M. Bunker (Winter 2001). DIRECT MARKETERS’ USE OF PUBLIC
RECORDS: CURRENT LEGAL ENVIRONMENT AND OUTLOOK FOR THE FUTURE. Journal of Interactive Marketing.

Spiller, L. and M. Baier (2005). Contemporary Direct Marketing. Pearson/Prentice-Hall.

Thursday, July 31, 2008

Prisoner’s Dilemma in Web Data Mining and Direct Marketing

Morse and Morse have recommended a virtue theory framework to govern the impact Internet data mining and direct marketing have on privacy. They argue that corporate financial interests must be moderated by obligations to the society that gives the corporations their existence.

Their solution is for business to be temperate, a key virtue factor in their proposed framework. Temperate behavior for Internet data mining and direct marketing involves a balance between the two roles inherent in the new business empowering technologies. These technologies act as both social agents and as economic agents (Morse and Morse, 2002, p 93).

In a standard MBA text on ethics, Business Ethics, DeGeorge (2005, pp 498-500) reinforces the Morse and Morse argument for temporance with the observation that the assumption of anonymity is not valid with the Internet yet it is the intuitive expectation of consumers based on their brick and mortar experience.

Additionally, like them DeGeorge advocates informed consent as a primary prerequisite for Internet marketing activity in general, and this of course includes data tracking and data mining in particular. Both DeGeorge, and Morse and Morse conclude that privacy risks on the Internet are serious enough to have precedence over economic benefits.

I believe that the Prisoner’s Dilemma Model can provide a complimentary perspective to clarify aspects of this argument. Data mining and its resultant direct marketing have created a Prisoner’s Dilemma between corporate marketing and individual’s who surf the Web.

Game Theory has been integrated into economics and allows for the expression of social goods in equilibrium analysis. Velasquez (1992, p 321), gives a good definition of one model in Game Theory, the Prisoner’s Dilemma, which is a relationship between at least two parties where each party is faced with two choices.

They can cooperate with the interests of the other party(s) or they can compete with the interests of the other party(s). The payoff is no gain if everyone competes with each other, a stalemate. If all parties cooperate with each other there is a moderate but steady gain. If one party chooses to cooperate but the other party chooses to compete instead, the competitor makes a substantial gain.

This seems to be the state of Internet tracking and data mining today. Consumers are implicitly cooperating with Internet data mining companies. These companies on the other hand are choosing to compete and risking the cooperator’s privacy for the sake of greater, one-sided financial reward.

According to Cooter (see Berkley), repetitive transactions with informed consent are less likely to result in competitive behavior. Instead, they will have the more efficient long-term solution of both parties cooperating. I believe that here is where informed consent will change the nature of the tracking and data mining activities on the Internet.

As this relationship matures, consumers will become more aware of the risks data mining is taking at their expense. Tracking and data mining are repetitive transactions, with a potential benefit and risk. The benefit is an economic good while the risk is a social good. When both sides are aware of what is taking place, the cooperative option of modest but mutual gain is favored in the Prisoner’s Dilemma model. The temperance suggested by Morse and Morse is the favored outcome that grows out of full disclosure.

References

De George, Richard (2005). Business Ethics. Pearson/Prentice-Hall.

Morse, John and Suzanne Morse (2002). Teaching Temperance to the “Cookie Monster”: Ethical Challenges to Data Mining and Direct Marketing. Business and Society Review, Vol. 107, No. 1, pp. 76-97, Spring 2002

Velasquez, Manuel (January 1992). International Business Morality and the Common Good. Business Ethics Quarterly. Retrieved from Taking Sides, 9th Edition. Newton, Lisa and Maureen Ford Editors. McGraw-Hill

Saturday, May 24, 2008

The Impact of Data Quality on New Media Applications

Many of the new media applications discussed in this blog make intensive use of data stored in enterprise repositories. Such repositories can include not only the traditional customer databases, and site visitor data mining stores but also blogs and wikis, which are stored in relational databases. Supply chain applications make intensive use of data stores such as inventories and suppliers.

Data Quality problems impact a wide variety of information technology projects, and of course this includes those involving new media. In a 2005 report, Gartner estimated that data quality problems will compromise 50% of data mining projects or result in their outright failures.

Donald Carlson, director of data and configuration at Motorola discusses data qulaity problems with supply chain projects, "We have had major [supply chain] software projects fail for lack of good data." Craig Verran says that “We see 20% duplicate supplier records." He is assistant vice president for supply chain solutions at The Dun & Bradstreet Corp. His group assists clients with improving the data quality of their supplier data files. [see ComputerWorld]

How should the project manager protect the social media project from data quality torpedoes? At minimum, a data cleanse phase should be part of the project. Depending on the criticality of the project and the extent of the problem, such an effort should be a separate, preliminary project. A business case must be made that justifies the extent of project clean-up effort.

What types of data errors might a project manager encounter? Jack Olson (2003) in his book “Data Quality” explains the concept of data profiling in great depth. Here are is his error typology:

Column Property Analysis: Invalid values

Structure Analysis: Invalid combinations of valid values, in this case how fields relate to each other to form records.

Simple Data Rule Analysis: Invalid combinations of valid values, in this case how values across multiple fields in one file relate together for valid sets of values.

Complex Data Rule Analysis: Invalid combinations of valid values, in this case how values across multiple fields in several files relate together for valid sets of values.

Value Rule Analysis: Results are unreasonable.

What are our options with bad data? There are three choices with the bad data: 1.) delete it; 2.) keep data as is; and 3.) fix it. There may be statutory or standards reasons that preclude you from deleting the data. Not all data is fit for use by the business or operational process you are improving or introducing with your project, so you can’t keep it as it is. The cost of fixing the data or the staff time it would take may be prohibitive. Where you draw the line depends on the impact of bad data.

References
Gartner (2005). “Salvaging a Failed CRM Initiative”; Gartner: SPA-15-4007.