By Richard Lewis (Director @ Model Citizen)
Last week I was fortunate enough to be invited to talk at the inaugural Data!Data!Data! networking event. It was a great evening and I’d like to thank everyone for raising some really interesting points, challenges and questions. I’ve not captured all of this below, but just created a very brief high level summary main part of the talk.
Data!Data!Data! is a quote from Sherlock Holmes; the concluding part of the quote is “I can’t make bricks without clay”. There’s a seemingly obvious meaning behind it, but to the data scientist it’s more complex. Why should we want bricks anyway? To make houses? (“I can’t make houses without bricks!”) Why should we want houses? To sell; to live in?
In our world of data science there is invariably more than one layer forming the “data to solution” parallel. We sometimes need multiple conclusions from previous insight (treated as new raw data) to solve a new problem. This is perhaps easier to discuss with a more practical example; a problem I call “The Pomelo Problem.”
Imagine a scenario in a supermarket, where we have run a very simple pairwise correlation analysis (or product affinity analysis). The analysis has identified a correlation between vanilla pods (an infrequently purchased item) and eggs. This correlation shows something useful: maybe the customer has a baking related shopping mission. I am able to see beyond the pattern and begin to form my conclusion.
Unfortunately, the analysis also shows exactly the same numbers for pomelos and milk. This is a problem; Pomelos are an obscure fruit from S.E. Asia. It’s only in season for one or 2 months a year and it’s likely to only be added as a top-up item to big shops. The trouble is, the analysis so far cannot differentiate between the two examples.
We need to try and improve our algorithm. A second tier of correlation is introduced. Vanilla pods are also found to be correlated with flour. This is good news as it further supports a shopping mission I can believe in. Yet my analysis also correlates (with exactly the same numbers) pomelos and bread. These two are only likely to appear to be correlated due to the randomness of additional items that are added to big baskets–and I’m back in the same position.
The challenge the Pomelo is posing to my algorithm is that it produces a useless scenario with exactly the same numbers as a useful scenario. The challenge is to be able to improve my algorithm to the point where it can distinguish between the two without needing additional human interaction. I need to provide the algorithm with additional data that’s not fed from a source system. I need to give it my (human) knowledge.
One possible method is to provide the algorithm with additional association data. For example, I can link vanilla pods and baking, and eggs and baking as some meta-data. Or, as one delegate suggested on the evening, maintain recipes in the data. This could be great—but it causes additional problems. Who is going to write this data? Who is going to maintain it? Who is going to look after data quality, data consistency? Across tens of thousands of product lines this could be an exceedingly costly task. Would the benefit gained from uses of this data (and algorithm improvement) be likely to be able to business case this investment?
There are, of course, countless different ways in which the basic algorithm in The Pomelo case could be improved to resolve this issue. The point is not the specific techniques we should be using, but that there comes a point where no algorithm can differentiate between scenarios—they reach their limit.
We can never say the algorithm is complete or the solution optimised (past tense). Partly because of the issue The Pomelo Problem highlights, but also because it must be assumed your competition will also have implemented the basic algorithm, meaning competitive advantage has not yet been attained, only competitive parity. The challenge for the data scientist is to recognise The Pomelo Problem, locate examples and identify solutions to them.
This is the essential human aspect of analytics and more of an art than a science. Introducing the human element also introduces potential bias to the results and needs to be considered very carefully. But we’re not yet in a position where an algorithm can be allowed to run indefinitely with the assumption that it’s the best it can be. There may be a disadvantage to introducing the human element, but doing so will enable better and faster progression. In my view it remains an essential input to any model.