Predictive Analytics FAQ                    SCL Home

Created by Eugene Asahara
Created: 01/01/2011
Last Update: 01/09/2011


This is an growing catalog of mini-blogs that lists FAQs relating to Predictive Analytics. At least for my contributions, these items reflect questions I've encountered multiple times through my experience implementing Predictive Analytics Systems. Many of these FAQs are of course just the opinion of the contributor (which will mostly be Eugene). But it's important to keep in mind that Predictive Analytics is a very new field in its infancy. What I write about is based on what I've actually experienced, which may differ from pure theory.

These FAQs are intended to be easily consumable by readers as well as easily writeable by the contributors (not long length, providing an example of two). Many of these posts will perhaps someday become a full-blown article or blog. Since full-blown blogs take many hours, this forum provides a way share ideas quickly.

If you have data mining or predictive analytics FAQs that you've encountered through your implementation experience, please feel free to submit them to me at, along with a subject line, text of around 250 characters, and optionally your name, a web site and/or other contact information - if you'd like me to add a link to you and properly credit you for the contribution.

What is Self-Service Predictive Analytics?

Added: 01/09/2011  Last Update: 01/09/2011  Contributed by: Eugene Asahara

"Self-Service" is the ability to do for yourself the simpler things that we would have normally asked experts to do for us. In the case of self-service Predictive Analytics, Information Workers can perform a wide range of analytics without the need for IT to prepare the data for analysis or the need for PhDs to perform the analysis. We can't all be experts at everything, but for vital services (and the ability to provide superior best guesses based on empirical data on your own is vital if you wish to compete as an information worker these day) we can turn to technologies that let us independently do 95% of what we would like to do. We don't need to become experts and can focus on our real jobs. The ability to perform our own analysis lets us become better doctors, farmers, engineers, managers, etc.

That other 5% beyond our capabilities can be triaged to the experts, or it may provide an answer that could be better, but really doesn't add much value. Triaging that 5% we can't do on our own allows the experts to focus on only what really requires their expert knowledge. Or if we consider the notion of "good enough" answers and not "the best possible answers", it allows us to take care of more things without being bogged down on getting the best result, when good enough is good enough.

I was told once that what it takes to get an A+ isn't worth it in the real world; an A is good enough for the vast majority of purposes. We can be super-optimized getting one A+ and two Bs or we can get three A's and be better equipped to handle the unpredictable stuff life throws at us. Is the welll-rounded fighter with brown belts in judo and karate a better fighter than one with just a 2nd degree black belt in judo? Of course, it depends, but you get my meaning. Note as well that on the other hand the guy with the just a 2nd degree black belt in only judo would probably whip the guy with orange belts (beginner) in judo and karate.

In the end, an enterprise where everyone's best guesses are at least "good enough" (no more shots in the dark) is better than those where only the guys with quants at their service can do better than taking shots in the dark (hopefully, the guys with the quants make excellent decisions).

Cluster (Segment) vs Classify (Decision Tree)

Added: 01/03/2011  Last Update: 01/03/2011  Contributed by: Eugene Asahara

These are similar in how they are actually utilized. We use them to consider a set of characteristics of something and infer (predict) another characteristic. They differ in that the cluster algorithm actually invents the predicted characteristic.

A good example of segmenting is if we were a car salesman and we engaged a customer roaming the floor, we don’t know anything about him except for what is outwardly apparent; how he dresses, the fact it’s a “he”, how be carries himself, how he speaks (his accent, vocabulary). Our minds then plug in these attributes which compare them to those of others we’ve encountered. How we’ve come to develop these templates of people depends upon our unique experiences. We each invent our unique classifications of people: "Just like John", "Reminds me of Mom", etc.

Decision Trees are more formal. We generally don't invent the classifications. A good example of classifying is the process by which a doctor diagnoses our problem. We start with our complaints; cough, fatigue, can't sleep, etc. The doctor plugs these symptoms into her brain and comes up list a list of possibilities matching those symptoms. She then iteratively narrows the list through a series of tests. Those tests take the form of questions posed to us, simple tests such as taking our blood pressure, and expensive tests such as EEGs. Eventually the list of possibilities is narrowed to a diagnosis.

Because the Clustering actually invents attributes, it’s usually used earlier in the modeling process. Segmenting our customers into distinct groups helps us identify what drives their behavior. These behavior drivers are attributes plugged into other data mining models.

Data Mining vs Predictive Analytics

Added: 01/03/2011  Last Update: 01/03/2011  Contributed by: Eugene Asahara

Think of the difference of Data Mining vs Predictive Analytics as the difference between a gold miner digging through thousands of cubic yards of granite to find gold towards the goal of becoming financially secure and a car salesman modifying his tactics to best adapt to the tastes of the customer in front of him.

Data Mining is the exploration of a set of data for insights that may help towards goals. For example, we attempt to identify different segments of customers so that we can market to these groups in customized fashions. We utilize many technologies to assist in this exploration. Of course, we need a way to retrieve, store, and access the data (databases, Excel Spreadsheets). This data can also be massaged to extract helpful values such as sums, highest and lowest values, statistics, etc. The most sophisticated of the "massaging" are the data mining algorithms such as Segmenting, Classifying, Associating, and Forecasting. We also need tools to present the data in ways we can readily understand. These include visualizations such as bar charts, line graphs, and grids (spreadsheets) and tools to filter such as OLAP browsers or the many functions in Excel.

Predictive Analytics are best guess values for factors involved in a decision. For example, a dentist in a private practice may be thinking about raising her fees. Mathematically, that will increase revenue, but it will also drive away a number of customers. But how many? What will play into their decision? What is the threshold by which they will balance their familiarity with this dentist versus saving some money? Or balance their desire for non-emergency care (cleaning) vs saving money? These are fuzzy values playing into a decision that will help towards some goal.