How to Use CSV Files With Machine Learning and Analytics

How to Use CSV Files With Machine Learning and Analytics


Understanding CSV files. What is a CVS file. How to work with CSV files for Machine Learning and Analytics projects.


Invented or encountered as early as 1972 (according to Wikipedia) CSV or Comma Separated Values files, the format remains an integral part of data integration. Because even early software written in Fortran supported them and are easily human-readable, they remain an all too often go-to way of exporting and importing structured data. Today most PC Operating Systems propose Excel as a default application for opening them, and Excel native files can also be easily exported as CSV files (not to be confused with CVS Pharmacy). In doing so, the human factor is introduced in the data processing chain with all the error potential.  


Human vs. Machine


For example, consider a situation where revenue and spending data as CSV files are consolidated weekly from 5 countries with ten cities each. If p represents the probability of human error in each of the source files and P the probability of obtaining a faulty Annual Report aggregating the weekly ones, then P equals 1-(1-p)(5*10*52). In other words, if you tolerate 1% error at the weekly-city level, you’d have to live with 99,999999999% (is that enough 9?) chances of error in the yearly report (which is 11700 times the odds of winning the big lottery prize in France! Even if you tolerate just 0,1% human error at the lowest level, you get an 8% chance of a false yearly report.


How to validate CSV files


As much as humanism is a motivation for many of us, we are not perfect machines. The engineers’ normal reaction to lower the risk of error would be to introduce redundancies and decrease p by having someone else check the first person’s work. That’s a good approach as it reduces p down to p2, turning 1% into 0,01%. The other method is to determine a set of formulas that detect errors and create automated mechanisms for implementing them. Well-trained accountants usually build procedures that generate indicators of possible mistakes into spreadsheets. The accounting practice of « Balancing the Books, » which checks that all revenues and spending are accounted for by computing the difference between the two, is an ultimate example of such number checking.


Platforms such as ForePaaS are well equipped to be part of hybrid data processing pipes where humans must upload .csv files through a dedicated interface (Drop Zones), and python scripts implement checks at different levels. For one of our large customers who requires that many third-party subcontractors upload files, we implemented a set of rules that work both at the « structural » and « semantic » levels. These work with well with CSV files.


A structural rule will check the existence of a value and its data type within the CSV files. It would be primarily geared towards avoiding clerical typing or manipulation errors. A file would be rejected or quarantined if a variable is missing or not properly formed, and an error message would be displayed immediately for correction. 


The semantics rule will apply logic to the files’ content to accept or reject them. For example, a “Degree of Completion” indictor is expected to only go up towards 100%, and any decrease from one week to the next would prompt an error requiring a correction or more substantial justification.


The third level of validation can occur by comparing it to an “expected value.” A predictive model can estimate an average value based on historical data in CSV files format and other sources such as the weather or public holidays. Any deviation of a certain magnitude from this nominal value would, in turn, generate an alert to prompt confirmation of the data quality before it is injected into the production pipe.


How to automate your CSV files input


On the other hand, automation takes care of error-prone repetitive manipulations and calculations and applies the same logic to all data as long as the structure is as expected. It needs to be correctly orchestrated, and tasks can be serialized or parallelized for performance. Conditional triggering for CSV files can help capture complex logic that deals with various situations. For example, a business unit that doesn’t sell specific product lines will not be included in allocating their fixed costs. Those who do sell the product line will be impacted proportionally to their share in overall sales volumes in any given period.


This means sales numbers need to be crunched before the fixed cost allocation steps. Suppose any future revision of the latter indicates the existence of product line sales that didn’t impact a business unit before. In that case, the numbers of all affected business units need to be recalculated. If done manually, such a change of calculation rules would be hard to impose given that the unplanned workload of other people would be required.


Long live the king, long live CSV files


So CSV files still have some beautiful days ahead. As long as they are processed within a platform like ForePaaS that offers adequate capabilities, they will still deliver visualization and data science goods! If any of this sounds familiar, don’t hesitate to contact us and share your own experience with CSV files.


For more articles on data, analytics, machine learning, and data science, follow me (Paul) on Towards Data Science.


Get started with ForePaaS for FREE!


End-to-end Machine Learning and Analytics Platform

Discover how to make your journey towards successful ML/ Analytics – Painless