Predictive Analytics Data Prep Types                  SCL Home

Created by Eugene Asahara
Created: 12/27/2010
Last Update: 01/27/2010

Overview  

This is a growing catalog providing very brief discussions of data prep types. Data Prep is the ETL (Extract, Transform, Load) of data mining/predictive analytics. It could be considered "ETL+" since many of the same ETL concepts and techniques apply, but there are more sorts. It's said that ETL takes up about 80% of a BI development effort. The same goes for Predictive Analytics.

If you have other data prep types, please feel free to submit them to me at eugene@softcodedlogic.com , along with a subject line, text of around 250-500 characters describing the technique, the utilization (when to use it), and any challenges. Optionally please submit your name, a web site and/or other contact information - if you'd like me to add a link to you and properly credit you for the contribution, and further information on your submission.

Gini Coefficient

Added: 01/27/2011  Last Update: 01/27/2010  Contributed by: Eugene Asahara

Description

The Gini Coefficient is a value that reflects the inequality of a distribution. It is a value ranging from 0, which means even distribution within the population, and 1, which means everything is distributed to a single entity in the population. It's widely used to reflect the wealth distribution of a population. For example, the value for the wealth distribution in the United States as of 2009 is 46.8, Contrast that to a communist country where the value would theoretically be near 0 or a dictatorship where the value would theoretically be near 1 (like my home where Laurie owns everything).

Utilization

Uneven distribution of a measure among an entity population would usually be a good indicator of behavior. Thinking of wealth distribution again, I would be able to guess things about how an "average" person from a country with a low value, a moderate value, and a high value would feel about given topics. What about products within a category where there is uneven distribution of sales? Perhaps one is very dominant. Or a group of doctors where one or two see the bulk of the "tougher" cases?

Challenges

There isn't at this time (1/27/2011) an out of the box Gini calculation in T-SQL, MDX, DMX, Excel, or .NET. C# code would be only moderately complex.

Aggregation

Added: 01/03/2011  Last Update: 01/03/2010  Contributed by: Eugene Asahara

Description

"Aggregations" are totals for entities and/or their attributes. For example, the total sales for salesman, the highest (max) value of a heartrate, or the number of tickets purchased.

Such values are the sort retrieved through the browsing of an OLAP cube.

The may also be discretized into a number of buckets.

Utilization

 

Challenges

 

Discretization (Bucketizing)

Added: 01/03/2011  Last Update: 01/03/2010  Contributed by: Eugene Asahara

Description

Grouping a range of values into a set of ranges.

Utilization

 

Challenges

 

Baseline Comparison

Added: 12/27/2010  Last Update: 12/27/2010  Contributed by: Eugene Asahara

Description

The main idea is to capture changes in behavior and patterns.

For example, the average visits for a customer to a supermarket from last quarter to this quarter may signify an imminent attrition. We don't know why, but it is a big clue. Perhaps a competitor opened nearer to that customer. Initially, the customer may go there out of convenience, but is unable to find many things he needs. However, over time that newer store learns how to better serve that customer's needs. Eventually, that customer may stop coming all together.

Utilization

 

Challenges

 

Cluster Entities

Added: 12/27/2010  Last Update: 01/27/2010  Contributed by Eugene Asahara

Description

Segmenting the entities involved in a data mining system can simplify a model by distilling many attributes into a single attribute. For example, all of the characteristics of a "Nascar Dad" or a "Soccer Mom" can be distilled into a single attribute that captures the essence of the customer.

See Cluster to Find the Relevant Categorizations.

Utilization

 

Challenges

 

Categorization

Added: 12/27/2010  Last Update: 01/01/2010  Contributed by Eugene Asahara

Description

The hope of categorization is to group entities of a system in the hope that we can focus on the salient aspect of the entity. For example, if we track animals, for some applications we may only care about its role as a predator. If we didn't create the categorization, the activities of predators would be spread so thin as to not be noticeable.

On the other hand, if we over-categorize, we may lose the aspects of individual entities that are relevant. For example, if bears, cougars, and eagles were only known as predators, we wouldn't notice aspects of avian predators (eagles).

It's interesting to ponder as well that all attributes of entities are a categorization or some sort. The street address of a person isn't just a bit of data. It categorizes me to a particular geographic area, a household unit, etc.

Categorization differs from Clustering Entities only in that the latter uses a segmentation algorithm to automatically find and/or measure the degree of similarities between entities that were previously unknown.

Utilization

 

Challenges

 

Relabel

Added: 12/27/2010  Last Update: 01/01/2010  Contributed by Eugene Asahara

Description

This is the task of modifying like values. For example, in a Product table, a product may be named many different ways (ex: Coke, Coca Cola).

This is really a data cleansing activity as opposed to a transformation. It also may sound similar to Categorization since one could think of this as categorization - and it could be. The difference is that by categorizing as we're not grouping entities by similarities. For relabeling we're stating that two things are really the same thing.

Utilization

 

Challenges

 

Object/Value Pairs

Added: 01/01/2011  Last Update: 01/01/2011  Contributed by Eugene Asahara

Description

Unpivot columns from across multiple tables into a single table of object/value pairs.

 

Utilization

 

Challenges