Overview: Tracking and Reproducibility in Quantitative Workflows

May 5, 2017

Original Link: https://medium.com/datmo/tracking-and-reproducibility-in-data-projects-af64ce109774

Documenting your work is necessary, but boring, regardless of the type of work you do. While tracking and reproducing work for most generic web-connected applications and workflows is becoming more standardized (i.e., document state-saving and tracking through Google Docs and code version control system like Github) there is currently no widely accepted standard or simple automation for data science and machine learning. This is not to say developers and data scientists don’t track their work, but their process tends to be rinse and repeat, time-consuming, and rarely automated.

complex, time-consuming, and precarious process

Not only is keeping track of the state of your work an important part of getting things done, but automating and observing best practices in tracking also drives better productivity and collaboration. Below, we propose some best practices and an open source system for solving tracking and reproducibility when working with data and machine learning. Furthermore, we introduce a new way to automate this process. Let’s start with the goals that we strive to achieve.

Tracking Goals

  1. Prevent lost history of your trained models, configurations, metrics, and environments
  2. Ensure accurate reporting of results
  3. Avoid errors when repeating, learning from, and reproducing someone else’s work (or even your own work!)

Broadly we can categorize tracking in data projects into 5 sections: code, configurations, environments, performance metrics, and files. Below we break down the current workflow and problems with it, some best practices for handling these workflows and the Datmo equivalents for the best practices.

Tracking Problems with Existing Solutions

Best Practices with Existing Solutions

Datmo Solution

Keeping track of workflows while working with and modeling data shouldn’t be like pulling teeth. Datmo’s simple command-line interface takes into account the common workflows and best practices that data scientists and developers are used to, and automates the entire tracking process.

More specifically, Datmo enables tracking all 5 components above via Snapshots of models. In addition, Datmo enables simple orchestration of tasks and runs to ensure replicability of snapshots across machines. Datmo provides 2 main value propositions:

  • Tracks models via snapshots, which are points in time of a trained model
  • Enables orchestration for running tasks run in parallel and maintaining isolated results

Datmo’s CLI along with GUI platform allows you to build, track, share, and collaborate on data projects. This allows collaborators to see all of the work you did, including snapshots and task runs. These collaborators can now improve, comment and work on new experiments from there improving the way everyone works on their data projects.

If you want to get access and play around with our Datmo CLI and GUI, click here, sign up, and send an email to me at anand@datmo.com and I’ll be sure to send you an invite code :).

We hope no one will ever have to deal with the issue of tracking work manually ever again. Looking forward to your feedback!

Check out our simple Iris Classification demo below:

Signup to our newsletter at https://datmo.com/ to get updates on our progress and tips on how to improve your workflow.

P.S. Thanks for reading this far! If you found value in this, We’d really appreciate it if you recommend this post (by clapping below) so other people can see it!.

Concept Dependencies: