Overview: Deployment of Quantitative Workflows to Production
June 4, 2017
Original Link: https://medium.com/datmo/part-1-deploying-machine-learning-in-production-5b00d3a92553
Building and deploying machine learning models for enterprise use cases can be time consuming, tricky, and even political within an organization. The technology is exciting — predictive intelligence and classification, for example, have great disruptive potential and enable organizations to build competitive advantages with their own data. Before putting data and models to work, however, one must consider the iterative development and versioning, deployment orchestration, and feedback cycle in the context of the application and the teams responsible for making it work.
The Ideal Lifecycle of Model Deployment & Iteration
For example, let’s say I run an e-commerce business and have been tracking conversion data to build a recommendation algorithm. I might want to answer the question: what products should I send to a user via email or social media to inspire a new purchase? Then I can define the problem as utilizing user information and outputting the likelihood of a given customer to convert on a particular product. Once I have the problem definition, I have to identify and mobilize the right teams to figure out a solution. Part of finding the right people means making sure you have people who can gather the data, select the models, format the data, train the models, and iterate until you have a satisfactory solution.
Once you have that, you finally get the deployment where the goal is to deploy my model as a RESTful API or a microservice so it can be easily plugged into the current enterprise workflow. I can easily run analyses in batch to determine conversions or in some cases, I may also want to to continuously train my system using historical and new user data along with a ground truth value of whether or not they converted. When doing all of this, I want to run my model on real data, iterate, and analyze results to see if my recommendations increase user engagement and conversion. Based on the data I generate from the results, I can then reassess and refine my problem or I can feed the data into a continuous learning loop.
Companies like Google, Facebook, and Microsoft are in the process or have already assembled teams, including people responsible for machine learning development, infrastructure and product so that they can create loops to take full advantage of machine learning in all of their products. In our experience even the biggest and most innovative organizations in the world haven’t fully figured out how to co-ordinate and collaborate between these teams, so they can effectively build and deploy machine learning solutions to production. The ones who have addressed this collaboration problem head on are starting to reap the benefits. Think products like Google search and photos, Facebook newsfeed and recommendations, and Uber’s routing algorithms.
Our definitions of a model deployment in this article really encapsulate 2 major concepts: 1) the static machine learning model which you might choose during Model Selection, but most importantly 2) the features, weights, and meta information associated with it. What this means is that every time you train a model, a new snapshot is created which includes all of the meta information. When we deploy a model we really mean we are taking one snapshot of it and running it in production and when we go through the loop one more (or many more) times, we create new snapshots that improve performance over time.
As for the process, Model Deployment a two part process (and both parts are equally important):
- Production Deployment: Generally a machine learning model, in order to work in production, must be scalable, accurate, and simple.
- Unfortunately, in many enterprises, data science and dev ops teams are separate entities, which results in a lag in deploying RESTful APIs or microservices.
- Analysis and Monitoring: Track the performance of AI models on a continuous stream of data to find out what works best and iterate.
- Again, because dev ops and analytics teams might be separate, there is no single “source of truth” to tell each team what is working.
This two part process must happen in unison while keeping teams on the same page. We have observed that getting the right teams to work together in this collaborative process with all parties in sync is the biggest barrier to deploying effective machine learning systems into production.
What Enterprise Machine Learning Deployment Really Looks Like Today
Many large and data driven companies have built (or are building) extensive pipelines and internal workflow tools to handle this lifecycle. Unfortunately, the process is often disjoint across teams and maintaining the workflow is not scalable. Below are the key problems faced in creating a pipeline today and the teams that deal with them:
Data Preprocessing
- Data gathering and munging is done for every analysis. No easy way to reproduce.
Model Development
- Creating and training models is manual and slow, performance is evaluated individually. Checking hyper parameters is manual and not well documented. Other stakeholders are unable to see results on the fly.
Resource Coordination
- Model training requires orchestration and quick experimentation using different configuration parameters.
Model Validation
- Validate that the model is working and meets the KPIs needed for the product by testing it on real live validation sets and improving the model on the fly.
Production Analytics
- Gathering feedback data from production and continuously feeding that back to retrain models scalably to iterate quickly.
Deployment
- Deploying models in production is slow and is manually scaled up. Multiple teams are tracking models and how they perform in separate locations — there is no single “source of truth” for all stakeholders to monitor performance KPIs.
Best Practices for Deployment & How Datmo Helps
Datmo’s Realtime Dashboard. More about this in Part 2 :)
Luckily for each of the problems above, there is a better way to help the teams work together toward a common business goal.
Data Preprocessing
- Ensure configuration parameters are separated out from the model definition, so you can quickly preprocess your data without messing with the model. Datmo keeps track of these configurations
- Keep any feature engineering on data separate from model definitions. Datmo monitors feature engineering through model snapshots.
Model Development
- Version your model snapshots/iterations. Datmo automatically keeps track of your model iterations including data, features, and snapshots and any data preprocessing done above.
- Run your code within containers. Datmo makes your models automatically scalable with containerization to ensure you can run any snapshot from anywhere.
Resource Coordination
- Run a number of training tasks in parallel with different configurations to create performant models faster. Datmo orchestrates isolated parallel tasks on any machine
Model Validation
- Pipe real live data into your models to quickly assess performance and validate your models. Datmo enables validation orchestration on live data and batched historical data
- Create automated validation tests for the production code at regular intervals on subsets of data. Datmo enables models deployed in production to be tested at regular intervals.
Production Analytics
- Monitor a realtime dashboard to see the performance on live data and sift through different time windows to analyze historical trends. Datmo has a single dashboard to monitor all models
Deployment
- Package your trained model into an easy-to-run container and monitor all of the statistics in one place so all stakeholders can view results. Datmo’s Realtime Dashboard keeps track of all deployments and enables you to diagnose and fix issues in minutes.
- Swap out the latest version of the snapshot to ensure the highest performance in production. Datmo enables a simple one click replacement of snapshot.
Conclusion
The process for creating, deploying and managing machine learning in everyday applications may seem complex, but it’s really an iterative loop. It involves a massive amount of coordination and collaboration, but it can be boiled down to some essentials that can be solved for. We’re working to simplify collaboration and improve this process to make it seamless while fitting into a familiar workflow for developers. Datmo enables the people building and maintaining AI to work together, and simplifies this iterative loop.
Stay tuned for part 2 of our series on Deployment to find out how Datmo makes deployment a piece of cake and learn about how you can use those features to monitor your own work.
To take advantage of our free tools, sign up for our beta at https://datmo.com. Also, we would love for you to get in touch with us if you have any questions and/or thoughts on what we’ve described above via email at anand@datmo.com or on Twitter at @datmoAI or Facebook
Happy Building 🙂
Signup to our newsletter at https://datmo.com/ to get updates on our product releases and resources.
P.S. Thanks for reading this far! If you found value in this, We’d really appreciate it if you recommend this post (by clicking the ❤ button) so other people can see it!.
Concept Dependencies