Bookmark and Share

How to do Predictive Analytics – Part 3

This post originally appeared on Applied Insights’ blog. Foviance acquired Applied Insights in November 2008, with Neil Mason joining us as Director of Analytical Consulting. As part of this acquisition, we’ve incorporated Applied Insights’ blog into our own.

Step 2 – Data Understanding

The 2nd understanding step (the 3rd CRISP step) is about getting your hands on the data to understand how it can potentially be analysed to meet the business objectives.

Data: you can’t analyse it if you haven’t got it

For us? as consultants – this is often the point when we look at a client’s data for the first time and there can be some problems with that. If we follow CRISP to the letter we might be tempted to ignore the data until we’ve finished the business understanding.

Data, however, is rather important to a data-driven process like predictive analytics (or data mining) and you might get a nasty shock when you come to look for it.

I can think of a couple of occasions down the years where, when we got to the data cupboard – it was all but bare; in one instance there was just about enough “data” in paper format which we managed to manually enter into an electronic format. In one other instance there really wasn’t enough volume or detail to proceed (Luckily we hadn’t spent too much time in business understanding at that point).

Hence the lesson for us is to try and find out enough about the data in the initial understanding step; formats, volumes and fields should give enough up-front information to see whether the available data is likely to be sufficient. Better still look to perform the data understanding in parallel with the business understanding.

Slowly, slowly …

There is also a tendency in predictive analysis, data mining and statistical analysis to not spend too much time on this step – rather to jump ahead to CRISP step 4 and start to build predictive models? this can have consequences; you might miss something important that would significantly improve your understanding, and hence the modelling, or – in the worse case you might end up producing a model which is invalid because there was something wrong with the data, or at least your understanding of it.

For me this step is mainly about data exploration. John Tukey was the pioneer in this field, he probably coined the phrase Exploratory Data Analysis (EDA), at the very least he wrote the book.

The types of analyses we are talking about in this sense are actually close to the ones that are undertaken most often in the world at large.

Most Business Intelligence/Reporting tools allows to view data summaries and graphics which we might describe as predominately exploratory e.g. BI Software from the likes of Business Objects (including Crystal Reports these days), Microsoft (Proclarity), Cognos, etc. If you are new to the data examining the reports contained in these tools (typically they are installed and configured) is a good starting point.

Where there isn’t a BI tool in place, and even if there is, there is often value in looking at pertinent reports emanating from the incumbent operational tools e.g. CRM, ERP, HR, tools and applications. Here we are thinking of the usual suspects; SAP, Siebel, ePiphany, Peoplesoft, Softscape, etc. as well as any tools from smaller niche vendors and internally developed applications.

When looking at visitor and customer behaviour on web sites there are a number of similar tools (which deal with the specificity of that data) that we talk about in other blogs. I tend to think of these as BI tools for digital channels.

This isn’t a hard and fast rule but because predictive analysis tends to look at the data in a different way we almost always end up using other tools to explore the data. If you are planning to use either SPSS or SAS modules for the predictive effort then they will probably have all the exploratory gadgets you need for this step. I often think of these tools as the Swiss army knives of analytics in that they contain a wide array of data management and exploratory techniques, as well as the more multivariate/predictive ones. The trade-off is that they require a level of expertise to use which is somewhat more than you would need to use a pre- configured BI tool. In my experience this is usually within the reach of numerate researchers and business users.

For me this exploratory phase has 3 prime objectives:

  1. Identify any potential issues like inordinately high or low values (outliers/extremes). These may or may not be errors in the data but it they will potentially influence the analysis inordinately
  2. Begin the detective work of trying to establish which factors (or variables) may influence the predictive outcome we are interested in
  3. More generally ‘get to know’ the data. Once we know it better we can identify which modelling approaches are appropriate e.g. does the data have certain properties which might make specific modelling techniques appropriate, or inappropriate

More specifically this means we would aim to:

  • Identify in detail the data sources, fields(variables), formats and definitions
  • Examine the distributions of individual fields
  • Look for relationships between the fields
  • Identify and test transformations that might be more powerful expressions of the data, as well as those which might be needed to clean the data

All of the above can be enacted by generating summary metrics/statistics and by graphical visualisations. Ideally by both. In the 2 illustrations below we see anonymous data from 2 mobile operators showing monthly call levels (the vertical axis shows the number of customers at each monthly call level shown in the horizontal axis). The mean call level for both is the same – approximately 130 calls a month – but we can see that the distribution (shape) is quite different.

Call pattern 1has two inordinately high (probably extreme) sets of call values. As a rule of thumb we should always look at the median (the middle value) as well as the mean in this case the median is closer to 120 which is arguably a better measure of the central/typical value than the mean in this example – and that can often be the case! We may investigate the higher values and find that they are data errors or special customers who are not interesting in the context of our analysis. Either way, if we include them, they are likely to inordinately influence any analysis that we perform.

Call pattern 2 is a different animal altogether (though the mean value is more-or-less the same). It looks like we may have 2 specific groups of callers; one centred around 100 calls per month and the other around 230. If we couldn’t explain this ourselves we would need to dig deeper into the data, refer back to the business understanding already gleaned, or consult the domain experts further to try and understand the phenomena. It may well be that we want to classify these 2 groups separately and treat them as such in any ensuing analysis. For example we may need 2 separate models for them depending on our objectives.

This is a simple example but the message is that it pays to look at the data in a variety of ways; in tables and graphics, to really get to understand its shape and to identify the characteristics which may prove problematic, or conversely may help us in our predictive quest. Another potentially interesting observation from these distributions is that – once we address the data issues discussed – it would appear that we have some evidence of the presence of normal distributions in the calling patterns. This in itself opens up the possibility of using statistical analyses where this shape is one of the assumptions, certain significance tests for example. We’ll discuss different modelling techniques in more detail later but some make more assumptions about the data than others.

How clean is my data?

The sources for this data exploration can vary from flat(text/csv) files and spreadsheets to data warehouses? and all points in-between. The more ‘raw’ the data is the more that integrity can be a major issue for the predictive effort. It can mean that the next step – data transformation – is a significant one.

One thing to bear in mind is that we don’t usually need to clean every single data item for this kind of analysis as some are more important than others. Part of this step is to make that assessment.

At the other end of the scale a well structured – and almost inevitably already cleaned – data warehouse is usually the best place to start as a lot of the work has been done. Even here, however, we need to exercise some caution. In a data warehouse the data will have been cleaned with specific operational/analytical functions in mind. When we start to look at the data with a view to prediction we are, in effect, holding up a new lens to the data. In other words we shouldn’t assume that even the most well ordered and diligently cleaned data warehouse is totally clean for our purposes.

Is there a data expert in the house?

I mentioned domain expertise in the previous blog as an essential ingredient in the analytical process. This subject matter expertise, which we tend to think of in the context of understanding the business problem space, also extends to the data. Depending on the context you might be the person who knows the data best in which case it is you. Often, though, there are those in the business; the could be Database Administrators (DBAs) or data analysts for example, who know the data more intimately. If so they should be an integral part of the team.

Statistically speaking

Another good way to think of data understanding/exploration is that, in the analytical sense, we are typically looking at a small number of variables at a time. In statistical terms we are more often thinking of univariate (one variable – e.g. just looking at the distributions of call volume as we saw earlier) and bivariate (two variable – e.g. investigating patterns of call volume over time) analyses, rather than multivariate (multiple variable) analyses. Multivariate analytical techniques are typically used later when we get into the main predictive effort. That is not to say that multivariate techniques cannot be exploratory – cluster analysis is both – but in the main we tend to explore data using techniques which allow us to visualise and quantify patterns looking at one, two and perhaps three variables at a time.

Can I do my analysis now please?

So by now our appetite for the main task – building and deploying predictive models – has been whetted. However, up to now, we have been looking at the data in an ad-hoc way. It is almost inevitable that we will need to use the learning from this phase to build a data set, or sets, ready for analysis. Hence before we get to the fun part we need to do a bit more work to transform the data into the right shape for that step.

Comments

  1. Just found this blog, and am favorably impressed! I especially like the amplification of the CRISP-DM process related to data. I’ve found the exact same problem with the data at some companies: they think they have data, but when you look, either (1) there really isn’t any data populated that is interesting, or (2) because the warehouse wasn’t built with data mining in mind, you can’t join together the tables (and hence the key fields) you would like for data mining.

    As with most of the data mining process, it is iterative and you must continually look ahead and then back. Of course, this makes explaining the process to novices more challenging.

    Dean Abbott
  2. Thanks for the amplification Dean. I certainly concur with your second point as well. I can think of quite a few cases where we have had to go back to the operational data feeds coming into the warehouse to get at some of the data for mining. In fact i’m on a project right now which is somewhat like that.

    John McConnell