Pros and cons of powering your data pipeline with CSV files

data validation

 

Invented or at least encountered as early as 1972 (according to Wikipedia) CSV or Comma Separated Values files format still remains an important part of data integration. Because even early software written in Fortran supported them and they are easily human-readable, they remain an all too often go-to way of exporting and importing structured data.

Today most PC Operating Systems propose Excel as a default application for opening them and Excel native files can also be easily exported as CSVs. In doing so, the human factor is introduced in the data processing chain with all the error potential that it entails.  

Human vs. Machine

As an example, let’s consider a situation where revenue and spending data is consolidated weekly from 5 countries with 10 cities each. If p represents the probability of human error in each of the source files and P the probability of obtaining a faulty Annual Report aggregating the weekly ones then P equals 1-(1-p)(5*10*52)

In other words, if you tolerate 1% error at the weekly-city level, you’d have to live with 99,999999999% chances of error in the yearly report (which is 11700 times the odds of winning the big lottery prize in France! Even if you tolerate just 0,1% human error at the lowest level, you get 8% chance of a wrong yearly report.

Validate…

As much as humanism is a motivation for many of us, we are not perfect machines. The engineers’ normal reaction to lower the risk of error would be to introduce redundancies and decrease p by having someone else check the work of the first person. That’s a good approach as it reduces p down to p2 which turns 1% into 0,01%. The other approach is to determine a set of formulas that detect errors and create automated mechanisms for implementing them. In fact, well trained accountants usually build into spreadsheets, formulas that generate indicators of possible errors and the accounting practice of « Balancing the Books » which consists of checking that all revenues and spendings are accounted for by computing the difference between the two, is an ultimate example of such number checking.

Platforms such as ForePaaS are well equipped to be part of hybrid data processing pipes where humans are required to upload .csv files through a dedicated interface (Drop Zones) and python scripts implement checks at different levels. For one of our large customers who requires that files are uploaded by a large number of third-party subcontractors, we implemented a set of rules that work both at the « structural » and « semantic » levels. 

A structural rule will check things such as the existence of a value and its data type within the file and would be primarily geared towards avoiding clerical typing or manipulation errors. A file would be rejected or quarantined if a variable is missing or not properly formed and an error message would be displayed immediately for correction. 

The semantics rule will apply logic to the content of the files in order to accept or reject them. For example, a “Degree of Completion” indictor is expected only go up towards 100% and any decrease from one week to the next would prompt and error requiring a correction or stronger justification.

A third level of validation can occur by comparing it to an “expected value”. A predictive model can estimate what would be a normal value based on historical data and other sources such as the weather or public holidays. Any deviation of a certain magnitude from this nominal value would in turns generate an alert to prompt a confirmation of the data quality before it is injected in the production pipe.

… and automate!

Automation on the other hand takes care of error-prone repetitive manipulations and calculations and applies the same logic to all data as long as the structure is as expected. It needs to be properly orchestrated and tasks can be serialized or parallelized for performance and conditional triggering can help capture complex logic that deals with a wide variety of situations. For example, a business unit that doesn’t sell certain product lines, will not be included in the allocation of their fixed costs and those who do sell the product line will be impacted proportionally to their share in overall sales volumes in any given period. This means that sales numbers need to be crunched before the fixed cost allocation steps and if any future revision of the latter indicates the existence of product line sales that didn’t impact a business unit before, then the numbers of all impacted business units need to be recalculated. If done manually, such a change of calculation rules would be hard to impose given that unplanned workload of other people would be required.      

Long live CSV

So CSV files still have some beautiful days ahead and as long as they are processed within a platform like ForePaaS that offers adequate capabilities, they will still deliver the goods for both visualization and data science!

If any of this sounds familiar, don’t hesitate to contact us and share your own experience with CSV files.