This blog was originally posted on the Datmo Blog on June 9, 2018

Pushing the needle forward for AI deployments

June 9, 2018

Original Link: https://medium.com/datmo/pushing-the-needle-forward-for-ai-deployments-633750c946be

At Datmo, we spend a lot of time working to build a better product for customers deploying AI to production, and part of that is ensuring that we are offering the most seamless developer experience possible. While each customer has a unique set of constraints between their model type and production needs, there is one thing consistent for everyone — people want to spend less time on deployments, and more time working on their models and experiments.

This revelation is nothing particularly new. As explained in more detail by James Lindenbaum (Co-Founder of Heroku), more effective deployment/provisioning developer tools are a critical force multiplier in the fight to accelerate current software lifecycles.

In the recent history of modern software development, we’ve seen vast improvements in efficiency for developers writing code, and despite great advances in generalized ops and deployment tools, deploying/provisioning have come to constitute an uncomfortably large portion of the model development lifecycle. This is due, in no small part, to a large number of unique problems AI deployments pose when compared to conventional software.

When it comes to deployment of models, these unique difficulties are attributable to three major causes unique to models:

First, is the tracking and reproduction of environments. When considering the number of model frameworks, language level packages, and system level (GPU/OS) drivers, compounded by compatibility issues for all of their respective versions, it becomes extremely difficult to track, standardize, and reproduce runtime environments.

Second, is the need to create a model interface (API) for serving requests, a process almost completely independent from the training that produced the model. To build a truly efficient inference system, a team would need to effectively master the intricacies of the underlying machine learning algorithms as well as the available deployment options.

Third, is the differing number of deployment platforms and methods, all of which have their own respective setup and management pains. Similarly, while one deployment method may be optimal for one model, that won’t necessarily be the case for future models. However, in the status quo, each deployment method has its own learning curve, hindering companies from seamlessly switching between them.

The combination of these three factors creates an extremely large barrier to entry for performant and effective model deployment.

As a result of these complexities, there have been a startling number of system-level liabilities surfacing in the form of brittle glue code, unmanageable pipeline jungles, dead experimental code paths, improper abstraction debt, and more. In the face of these unique barriers, we see companies struggling to operationalize their first models, or realizing that their approach to continuously deploying models has become unscalable at the speed or breadth they require.

To us, the path forward is clear:
Create a tool that helps bring standardization, speed, and seamlessness to the AI deployment world.

In setting out to solve for the pain points of AI deployment, we’ve optimized our tool, Datmo, around one concept that encapsulates the value we bring to developers — minimizing Time to Deployment (TTD). Seen as the last mile problem in AI, we know that empowering data scientists and AI engineers to deploy and manage their models will close the development-deployment-iteration loop in a way that unlocks immense value for companies.

To show you this system, let’s talk about where deployment is today. The manual method of deployment is a long and arduous process. Even once you have your model code, there are still many operationalization steps that stand in the way of data scientists and using their model in production.

  1. Install and write an API Wrapper for your model with a microframework (Falcon, Flask, Tornado, etc)
  2. Push your model + API code to a remote source (GitHub, Bitbucket, custom git repo, etc)
  3. Provision a server on cloud provider of choice
  4. SSH into a remote server
  5. Pull code from the source remote to a remote server
  6. Install GPU/system level utilities/drivers (if applicable)
  7. Install project dependencies
  8. Install the WSGI server framework (NGINX, Gunicorn, etc)
  9. Instantiate WSGI server pointed at API app
  10. Push process to the background for persistence, exit SSH

Our first step in minimizing this long process was building datmo snapshots, the initial level of abstraction and the foundational building block for model deployment. We built snapshots to serve as a standalone record for the state of a given model’s code, files, data, metrics, and environments, which are all of the components necessary to reproduce your model elsewhere — all unified within a single entity.

Snapshots are extremely valuable and are the key to being able to log and reproduce your experiments (both locally and remotely), but after going through the process from the status quo and writing a blog post on converting models to a cloud API on AWS, we knew that there was more that could be done on the “operationalization” front.

This leads us to today:

With our datmo deployment CLI, we have abstracted resource (server/infrastructure) provisioning and model deployment into one simple command. No manual resource provisioning or server-side code meddling required, only a few local config files.

In fact, it’s so easy that you can do it in two minutes or less. See it in action here:

From our experience, there are two optimal vectors for deployment — a containerized microservice based solution, as well as a serverless based approach.

For containerized microservice deployments:

  1. Add a one-line designator for denoting which function the API route should point to
  2. Create a Dockerfile (handles system utilities, environments, and package installations)
    * Here’s an example of a Dockerfile that covers system-level utility and package installations that we’ve made, feel free to customize it for your own use-cases!
    * Alternatively, here’s a handy guide to writing one from scratch, no Docker knowledge required!
  3. Create a datmo configuration file (path to Dockerfile and model, as well as desired API route name)
  4. Datmo deploy command on CLI

For serverless deployments:

  1. Add a one-line designator for denoting which function the API route should point to
  2. Create a serverless deployment package
    * Here’s a guide to creating one for AWS Lambda, one of the most popular serverless deployment infrastructure choices.
  3. Create a serverless datmo configuration file (path to Dockerfile and model, as well as desired API route name)
  4. Datmo deploy command on CLI

Once you’ve chosen your method and followed those steps, you’re good to go! Datmo’s tool helps solve all of the provisioning, environment setup, and app deployment, based on your Dockerfile and configuration file. All you need to do is run `$ datmo deploy` and datmo will handle the rest.

Time to Deployment benchmarks:

For Time to Deployment, we count the difference in time between when you’ve hit deploy to the moment your API is live and queryable. Below are our baseline Time to Deployments from a cold start (no previously configured/provisioned resources) on AWS for both serverless (AWS Lambda) and containerized server deployments (on-demand EC2 instances).

There are two primary deployment types — an initial deployment (which requires provisioning the infrastructure for the code to live on while it’s running), and updates/redeployments (which leverage existing infrastructure you’ve provisioned already). Naturally, the initial deployment will take a little longer, but once that’s out of the way, redeploying new versions of your models is extremely fast.

Serverless deployments

One of the best parts about serverless applications on modern cloud platforms is infinite and instantaneous scalability, without having to think about the number of resources you’re provisioning. With one deploy command, datmo helps you reach this goal even faster.

The deployed serverless prediction code is available here, along with the additional serverless-specific config file.

Microservice cluster deployments (CPU)

While a single deployment can seem very daunting, scaling an application out to a large number of servers is an even more difficult problem. Have no fear, because we’ve designed datmo’s CLI to handle this with ease. With a parameter passed to the CLI at deploy time, datmo will automatically spin up a cluster with any number of servers, all without any additional legwork.

The full scikit-learn model source code is available here.

Microservice cluster deployments (GPU)

Models leveraging GPU-powered prediction or training become even more difficult and time intensive to deploy than their CPU alternatives. However, the same set of steps enable you to achieve similar deployment times despite the added complexity that datmo is handling behind the scenes. Shown below are our results for a standard MXNet implementation of ResNet-152:

MXNet model source code is taken from the official Apache tutorial, and available as-used in our tests here.

*Note: Lambdas not possible for models of this size (memory limit) as well as GPU compatibility not existing for current cloud serverless implementations.*

Summary

TTD Comparison for Iris CPU Model across many servers

Regardless of what a given model’s optimal deployment method is, datmo’s deployment tools can help you every step of the way — from initial deployment to continuous deployment at scale.

Enterprise-ready:
In short, we care a lot about how we can make the lives of data scientists and AI engineers better, and optimizing TTD is one of the biggest ways we have seen ourselves able to do that, offering a consistent and seamless experience whether you’re deploying your first model or your hundredth. And just as your needs grow over time, we’re confident to say we’re building for the future of the industry, to grow right alongside you by offering:

- Decoupled serving architecture and model predictions by having a layered architecture for deployment, simplifying model packaging and deployment for the data scientist.
- Caching, authentication, and scalability for deployed models with varying throughputs.
- Support for a concurrent serving of multiple models, using any number of AI frameworks or cloud providers.

We know that when it comes to the enterprise stage, the process of deployment isn’t binary — there’s so much more involved. That’s why all of our benchmarks are run on our turn-key enterprise solution, where each deployment comes out of the box with:

* Job queueing (microservice deployment method)
* Task scheduling
* REST API/stream/batch request building
* Environment image reproducibility (containerization)
* App initialization
* Request load balancing

We sweat the small details because we know from firsthand experience that an AI pipeline is only as strong as its’ weakest link, and business value can only be accessed if the entire system works like a well-oiled machine from end to end. Interested in our product, or having trouble with machine learning in production? We’d love to chat.

Concept Dependencies