Oh data that I may see you, use you and wait I missed that one…

The amount of data at your fingertips is staggering. However, the reality of data is its use not its presence. What data do you want to measure. What data do you want to capture. What data do you need to analyze. Then map those three into a four quadrant system that places the data in the quants released to the “time” allotted for retrieving it.

Like any system not all data can be available all the time. So in the system you are capturing and measuring what data do you need to have available in the Real Time. What data do you need to make available in the near real-time, but some lag allowed. You get the idea the next level of data is available but there is noticeable lag and the of course archival data. Those doomed to having relived history once want archived data so they don’t relive history twice. The problem is the reality of the amount of data you can collect very quickly and the impact that has on your ability to get at the data you need.

Data scientists always talk about “Big Data” as data larger than the system can handle in the time required. I.E. if an action is required in less than 10 seconds and it takes more than a minute to process the data you have a big data analysis problem. You don’t have a problem with data that is big, rather too big for the analysis system to handle in the required response time. Now, I would argue and I’ve heard data scientists also argue that without automation “big data” is more like thief image here. Too big for the human mind to process is a timely fashion. So effectively data presents two problems in the measure and capture range. How quickly can the automated analysis be completed compared to the required time of data production. Are there human bottlenecks in the system? And finally does the overall system have any areas that can be speed up?

The reality of data however is critical, in that if you speed a system that system must benefit from that change. I have a great example of this that is a real situation. There is a company that builds engines. There are certain parts of the engine they do not build in house. Of those parts they do not build, some of them require lubrication of a maximum age. I.E they cannot have lubricant that is older than 4 days or has sat on the part for more than 4 days. However in speeding up a supply chain the part in question arrived to factory earlier and sat on a warehouse floor. The manufacturing process for the entire factory had to be modified at a significant cost to add a step, removing and reapplying lubricant to the part in question. Where the old system worked, cost less and delivered the same product in less time and less steps speeding up one part of the process actually delayed everything and added complexity.

Now the third dimension of data is revealed. Process improvement has to be along the entire chain. Speeding up a process because you can doesn’t help if you force process change throughout the system. Data delivered before the system is ready, sits. If that data has a TTL for its relevance you may in building a faster delivery system foci the data to become obsolete before its used.

A systems view of “input, process, output” requires that you take a systems view of improvement. Input to sitting and waiting isn’t always desirable. The balance of what you capturing (measuring) of the data you are measuring what do you want to capture and finally what needs analysis. Not all data needs to be analyzed to be used effectively. But all data that requires analysis isn’t useful if the system delivers data at the wrong time.


Too much data in my inbox to wade through…