Anand Sampat

Learning Data Science You May Already Know The Basics

September 15, 2017

Original Link: https://medium.com/datmo/learning-data-science-you-may-already-know-the-basics-64c9d764bcbb

There are countless people trying to get into data science lately, but many are intimidated by the idea of learning a new workflow. In just the last year alone, the number of people reading about and interacting with machine learning has jumped from ~100k in 2016 to over 10 million in 2017 worldwide.

The two biggest camps of people learning data science are software engineers looking to learn about quantitative development, and quantitative-oriented mathematicians and statisticians looking to scale their analysis and impact using machine learning (ML) and advanced analysis.

For software engineers, there are actually a lot of similarities to the “plan-build-test-deploy” (PBTD) process that you already know well.

The PBTD loop you already know and love

We’re going to break down the data science development and production pipeline, and show how you can leverage the knowledge you already have about the PBTD workflow to hit the ground running.

Data Science follows the same PBTD loop you already know and love.

Data Science and Software Development follow the same loops.

While the process looks the same at the high level, there are some specifics that make data science slightly different.

Plan

“Every extra minute spent planning is 10 minutes less spent debugging”

Planning with a team

Data science, just like any task, requires planning; much like software engineering, developing a specification is part of the game. Here are the key steps:

Identify business needs and create metrics: leverage business stakeholder knowledge to identify business needs and translate those needs into metrics that can be reliably measured. Machine learning needs certain values to minimize or maximize to assess its effectiveness.

Design experiments: come up with a set of experiments to generate the metrics you care about and identify the algorithms you need to use to get the desired results.

Build

Building and iterating is a constant process

Once you’ve come up with a plan, you can start to build your system. In software engineering this is is where you write your app’s code. In data science it’s just a few more steps:

Environment setup: Based on the algorithm and experiment architecture you planned for, identify the library and framework dependencies you have (e.g. numpy, pandas, Python 2.7, TensorFlow, etc). Then set it up on the required hardware (local vs. distributed solutions).

Data Exploration: Explore the data tables or sources that can help generate the metrics you care about. This is where you clean your data, extract relavent variables, and engineer features that will be fed as inputs into your machine learning algorithm.

Model Generation: Write the code that executes the algorithm you have chosen. This is the crux of the data science process and has a few key components:

Test

Testing your work

Testing in software comes in many different forms. For data science, there are a few types which are particularly relevant.

Deploy

Once we have tested the trained model, we push it to production, similar to software engineering. There are many tools for deploying production code, all built around the ability to track changes with git commits. For data science projects, there is a huge void in deployment tools, as only the largest technology companies have the resources required to build proprietary in-house solutions. Datmo is building the platform and CLI that solves these problems specifically for data science and machine learning development. Drawing parallels to SWE, here are the key components

Let’s see it in action

Harry Potter Sorting Hat

We’ll walk you through this process for a simple model — a sorting hat! You will provide a selfie / picture of your face and the sorting hat will sort you into the appropriate Hogwarts house!

If you like what we’re doing at Datmo, show us some love by Clapping, sharing the story, and signing up for Datmo :)

Concept Dependencies