Tools for Productionalizing Data Science
This meetup focused on useful tools for data science in production. Why is data science in production so important? And what does this even mean? Well, data scientists get caught up in creating functional models that produce accurate results, but what happens after the model is created is equally as important. Data science in production is simply the deployment of models. If a model is not properly deployed or even accessible, then guess what? No one can use it.
The first speaker of the evening Jed Dougherty, Lead Data Scientist at Dataiku, focused heavily on this idea during his talk: “So you built a model, now what?”
The typical data science workflow consists of ingesting, cleaning and enriching data and then training that data as a model to produce accurate results. Jed reminded us that accuracy alone is not important, but how fast your model can produce results is important as well. If someone can get results faster with another model, they are not going to wait around for your model to return a result. Not only that, but your model needs to be able to produce fast results for multiple requests. So how do you make sure that your model runs quickly: test and adapt.
Jed introduced Apache Bench, a common program used for measuring the performance of web servers. In a single command, you can send multiple requests to a URL and get an output of performance results for each request. These performance results will allow you to adapt your model so it can run faster at scale.
Another useful tool in data science production is Apache Airflow, a platform used for monitoring workflows. Brian Lavery, Senior Data Engineer, and John Paletto, Data Scientist, from The New York Times presented this in the second half of the evening. Their talk was called: “Intro to Airflow for Data Analysts and Data Scientists”.
Brian first outlined the ways in which data scientists and engineers at the NYT use Airflow for their projects. In the model creation stage, they use Airflow for creating feature sets, which allows users to interact with a model, and for actually running their models and making predictions. In the production stage, Airflow allows them to backfill or reload data so they can rerun models, as well as, set up alerts to notify them of any anomalies that may arise.
John wrapped up the evening with a brief use case of Airflow for their “Project Feels”. The NYT started “Project Feels” so that advertisers could place more relevant ads next to articles that users searched for. Ultimately, the NYT made models that predicted the emotion of an article, and then pushed the appropriate ads to users depending on what articles they searched and the emotion that those articles invoked. How interesting! Click Here for more information regarding “Project Feels”.
While I went into this meetup with very little knowledge of data science in production, I left with a high-level understanding of model performance testing and data science workflows. I look forward to learning more in future meetups!