Why Data Scientists Should Gather Their Own Raw Data

Why Data Scientists Should Gather Their Own Raw Data

What is good raw data for data scientists? Why data scientists should not rely on others to provide them with ready-to-crunch data. How can they gather their own raw data with no data bias.


Why good raw data for the data scientists is important


Good raw data for data scientists is hard to find. Data scientists frustration rarely comes from limitations of what available modeling algorithms can do. The data scientist’s most common excuse for failing to deliver at the expectation level is “data is not good enough” or “there’s not enough of it.”


But they never give precise answers to the questions “what is good enough raw data?” and “how much is enough?”. Regardless of the veracity of the objections, they usually underscore the notion that all too often, aspiring data scientists don’t see data quality and volumes as a core part of their mission and tend to rely on others to provide ready-to-crunch data.  More senior professionals will tell you that in real life, 80% of their work is about collecting and preparing the correct data and that they prefer starting with the rawest data possible because when others do it for them, they tend to introduce their own biases. 


Data scientists should understand the business


The humble-pie-eating data scientists, although it might taste bitter – it’s the best pie, think of their craft as augmenting professional people’s ability to do their job rather than replacing them.  These are some of the problems with machine learning. Data science is just a lever to move more enormous rocks and not meant to replace rock-moving! So collaborating with the folks who know which stone to move is usually a sound approach. For example, shopping mall managers may not know all about why shoppers visit their locations. Still, they have good judgment about new ideas to explore and raw data features to consider when modeling to predict shopping traffic. That insight can be lost when data scientists are left out of the conversation since algorithmic abilities can nudge them into suggesting or exploring new avenues when explained to business people.


In other instances, even the business folks may have misconceptions about what they’re trying to accomplish. Recently a lawyer described how she uses an online service and raw data to get input on subjects not part of her core expertise. The service ML built her profile as highly involved in those particular subjects, which is antithetical to the reality that she precisely doesn’t care enough to be an expert in those subjects yet is recognized among her peers in other areas that she never inquires about. The proper machine learning profiling approach meant for advertisement should have inferred a different persona.


How can data scientists broaden their raw data horizon


Too many data science projects stem from the desire to « do something with our raw data.» Yet, most B2B or B2C businesses are sensitive to extraneous conditions like weather, stock markets, or other factors such as those for which economic indexes can be good proxy indicators. Some companies like electrical power generators critically rely on the weather and economic activity forecasts. Still, many other businesses can enhance their business predictions by incorporating external raw data and projections. Shopping mall traffic turned out to be decently forecastable when considering the weather, day of the week, yearly celebrations, and school vacations. If you add social network event public announcements by shops and the competition, you get excellent predictive capabilities.


The ForePaaS Platform was designed with these realities in mind. An extensive connector marketplace puts all sorts of raw data in reach. The data engineering capabilities allow mixing various temporalities from those captures asynchronously in real-time to processing them periodically. It also encourages dialog between data scientists and business users by offering a comprehensive point-and-click environment to build production-grade web applications that promote human consumption and feedback of the information produced.


For more articles on data, analytics, machine learning, and data science, follow Paul Sinaï on Towards Data Science.


Get started with ForePaaS for FREE!


End-to-end Machine Learning and Analytics Platform

Discover how to make your journey towards successful ML/ Analytics – Painless