Feature Engineering, the not-so-secret sauce of data scientists

Feature Engineering, the not-so-secret sauce of data scientists

Speaking at a conference in New York earlier this year, Richard Pook, an executive search consultant at Dore Partnership, explained “the only data scientists who can demand the really big pay are those that can bridge the gap between the analytics team and the C-suite. Everyone else gets a lot less.” So if you think Data Scientists are hard to come by, wait ‘till you look for one to take care of business!  

While it’s tempting to imagine Data Scientists building and manipulating complex algorithms, reality is way more down-to-earth. The algorithm is only a fraction of the data scientist work, most of their time being dedicated to data transformation and enrichment thanks to more features to highlight relevant elements hidden inside your data: what is called feature engineering. A simple algorithm, enriched with more data and better features, can thus perform far better than weak assumptions combined to a complex model.

 

Closer to home, in preparation for end-of-year present giving, I tried to use Machine Learning to determine the dresses my teenage daughters would like. To that effect, I web-scrapped pictures from a popular e-commerce site (exclusively for personal use!) and asked them to create an I-like/I-don’t-like training set of pictures. Using a couple of popular algorithms I only produced predictions that couldn’t beat a coin flip!

Sure the “I’m special, and unpredictable” teenage reaction played a role but when prompted, explanations came from tiny details like “the shapes of the straps”, “the drape”, “the hemlines”…  

Yes, my teenage daughters put their fingers on the concept of feature engineering and learning spaces!

Vladimir Vapnik, one of the machine learning pioneers explains that « …When musicians are training in master classes, the teacher does not show exactly how to play. He or she talks to students and gives some images transmitting hidden information ». Vapnik talks of “Gestalt description » and non-inductive approaches. Indeed, a feature is a complementary information interesting to improve a prevision’s relevance. 

Another simple example is a rock-paper-scissors game modeled to predict the next move of my opponent. I found out that unless you create the feature “who wins”, the prediction based on historical data is very bad but the players become very predictable once you add it!

In essence, one needs to first understand the needs and the business goals of the users, enter and embrace their semantic space and the way they apprehend and judge the subject matter for the magic of Data Science to happen.

 

It is vital to capture the additional features including ones stemming from third party data sources such as public holidays and other local celebrations for date series to understand and predict shopping activities for example.

The ForePaaS platform was designed with such usage in mind with the aim to roll this configuration in a continuous and robust production setting.

 

Want to learn more about Data Science features of our platform?

Schedule a demo