Let there be light!

 
sunrise.png

Colossal changes

For more than 20 years, I worked in a large group, a world leader in building materials, mainly in positions of R&D, Innovation, or Technical Expertise in Europe and North America. Very early on, I was immersed in the world of data. I have participated and then led functional expertise and training groups around data throughout my career. It was the activity I enjoyed the most, which largely explains my change in career path. In a large group characterized by the importance of performance, indicators, and data were omnipresent and often made up my daily life. What I didn't realize then was that I was using 20th century techniques and tools. As we say colloquially, "in the kingdom of the blind, the one-eyed are kings". However, in the early 2000s, gigantic theoretical advances were made in the field of statistics and machine learning. These advances, combined with the increase in computing power and especially the development of open source tools, are the three advances that have completely transformed the discipline.


Everything is data

Perhaps the most significant change is in the raw material itself: data. In my world before, the data was either numerical data (indicators, the results of physicochemical analysis of R&D or quality control, measurements of sensors, etc ...) or categorical (a type of product, customer segment, type of production device), and the data processing techniques we used in the industry applied almost exclusively to this type of data. In the more manufacturing part of the business, we used control charts to control processes, detect changes, etc.

Today, everything is "data" and can potentially mobilize everything to generate knowledge or improve productivity. Whether it is images, videos, or text, the three advances mentioned above have enabled colossal advances in exploiting this type of data (generally called unstructured data). Theoretical advances and the increase in computing power have created the winning conditions for the evolution and use of techniques in deep learning to exploit this data. The generalization of open source tools allows their democratization and makes these techniques accessible to all, not only to the "big tech" sectors. At Videns, for example, we have developed models that automatically classify scanned documents as images and extract specific information based on the type of document (a supplier for an invoice, for example).


Structured data is no exception.

But it would be wrong to think that recent advances are only about unstructured data and deep learning. In the early 2000s, our ability to predict based on structured (numerical or categorical) data also improved tremendously, especially with the development of ensemble methods. This set of techniques, which includes "random forests" and "gradient boosting", are techniques capable of modeling strongly non-linear phenomena, which are commonplace around us. At Videns, we have used these techniques to predict, for example, a customer's propensity to adopt a new financial product or to predict the price of airline tickets based on the characteristics of the trip.


Cultural obstacles

In my former company, which was characterized by a strong engineering culture, one of the major obstacles to the development of predictive models was resistance to the use of purely statistical models. By that, I mean models that are not based on engineering rules or laws from physics and chemistry. One of the justifications for this resistance may be the (perhaps unconscious) belief that the capacity for generalization of these "statistical" models will be less than those which are based on physical laws, for example. However, one of the principles of the development of predictive machine learning models is precisely to ensure that a model's performance is not just observable on the data used for training but that the model can be generalized with satisfactory performance to new data. Another resistance sometimes observed in business contexts is the difficulty in accepting models of the "black box" type, that is to say, models for which, despite an adequate capacity for generalization, it is difficult to explain the precise path that led a model to a given prediction from the input data. Perhaps this resistance is related to our need to mentally represent to ourselves how the model works or at least to have a good intuition about it - or is it that we need to progress in our maturity before trusting in these types of solutions? Regardless, this aspect is an area of ​​intense R&D in the field of data science. At Videns, depending on the business context, one of our practices may be developing two predictive models in parallel, an explainable predictive model and a "black box" type predictive model. Depending on the performance gap, the customer can then decide to use one or the other (or both!).


Give it a try!

The hype for artificial intelligence and machine learning is extremely high today. But there are reasons for this. As I have witnessed in these few lines, the estate has gone through an intense transition over the past twenty years. This puts the ability to make better use of your data and create value within everyone's reach, or many.