Case Study

Learn how Iguazio provided Quadient with an out-of-the-box data science platform and next-level MLOps automation capabilities, saving their data scientists and developers precious time and resources

The Easiest Way to Track Data Science Experiments with MLRun

Adi Hirschtein | December 2, 2019

Almost every customer I meet is in a certain stage of developing an ML-based application. Some are just at the beginning of their journey while others are heavily invested. It’s fascinating to see how data science, a once commonly used buzz word, is becoming a real strategy for almost any company.

In the following post, I’ll address one of challenges that customers bring up time and again – running and tuning experiment tracking. With a step by step tutorial I’ll cover complexity concerns and show how to solve them with MLRun, a new open source framework which optimizes the management of machine learning operations.

MLRun is an open source framework which provides a generic and easy to use mechanism for data scientists and developers/engineers to describe and track code, metadata, inputs and outputs of machine learning related tasks (executions).

MLRun tracks various elements, stores them in a database and presents all running jobs as well as historical jobs in a single report.

The database location is configurable and users run queries to search for specific jobs based on criteria. Users work on MLRun on a local IDE or notebook and then run the same code on a larger cluster using scale-out containers or functions.

Install MLRun

Install the MLRun library, run import commands and set the path for the MLRun database:

!pip install git+https://github.com/mlrun/mlrun.git
from mlrun import new_function, RunTemplate, NewTask, get_run_db
import yaml
import pandas as pd
# set mlrun db path (can also be specified in run_start command)
%env MLRUN_META_DBPATH=./

Key elements

  • Task (run) – enables you to define the desired parameters, inputs, outputs and tracking of a run. A run can be created from a template and run over different runtimes or functions.
  • Function - a runtime specific software package and attributes (e.g. image, command, args, environment, etc.). A function can run one or many runs/tasks and can be created from templates.
  • Runtime - a computation framework. MLRun supports multiple runtimes such as a local Kubernetes job, DASK, Nuclio, Spark and mpijob (Horovod). Runtimes may support parallelism and clustering (i.e. distributing the work among processes/containers).

Add MLRun hooks to the code

To capture the data mentioned above, add hooks in the job itself. You can track the results of the job, artifacts such as the model file or csv file, source code, labels and more.

MLRun introduces the concept of an ML context. Code is instrumented to get parameters and inputs from the context as well as log outputs, artifacts, tags and time-series metrics.

from mlrun.artifacts import ChartArtifact, TableArtifact, PlotArtifact
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# define a function with spec as parameter
import time
def handler(context, p1=1, p2='xx'):
# access input metadata, values, and inputs
    print(f'Run: {context.name} (uid={context.uid})')
    print(f'Params: p1={p1}, p2={p2}')
    time.sleep(1)

# log the run results (scalar values)
    context.log_result('accuracy', p1 * 2)
    context.log_result('loss', p1 * 3)

    # add a lable/tag to this run
    context.set_label('category', 'tests')

    # log a simple artifact + label the artifact
    context.log_artifact('model.txt', body=b'abc is 123', labels={'framework': 'xgboost'})

    # create a matplot figure and store as artifact
    fig, ax = plt.subplots()
    np.random.seed(0)
    x, y = np.random.normal(size=(2, 200))
    color, size = np.random.random((2, 200))
    ax.scatter(x, y, c=color, s=500 * size, alpha=0.3)
    ax.grid(color='lightgray', alpha=0.7)
    context.log_artifact(PlotArtifact('myfig', body=fig))

    # create a dataframe artifact
    df = pd.DataFrame([{'A':10, 'B':100}, {'A':11,'B':110}, {'A':12,'B':120}])
    context.log_artifact(TableArtifact('mydf.csv', df=df, visible=True))

    return 'my resp'

Running the job

Specifically for this example, we are using an inline code withing Jupyter.  but note that you can also save and run it as a python file.

Next, define the task and its parameters, such as task name, input parameters, secret file (if needed) and labels.

Create a function, specify the command (i.e. our Python script) and assign it to the task that was defined previously.

Once the job is done it will display the information that had been captured in the MLRun tracking database while the job was running.

task = NewTask(handler=handler,name='demo', params={'p1': 5}).set_label('type', 'demo')
run = new_function().run(task)

The job is captured in the MLRun database.

For each run you can view the name, status, labels and all parameters and artifacts that are associated with the run.

View the database at any point by using the following commands:

db_path = %env MLRUN_META_DBPATH
db = get_run_db(db_path).connect()
db.list_runs(labels='demo').show()

Running jobs with hyper parameters

Hyper parameters are important because they directly control the behavior of the training algorithm and have a significant impact on the performance of the model being trained. Data scientists often want to run the same model with different parameters to figure out which configuration is the best fit.

First, create a template (see below) and then run it with the hyper_params flag.

In this case, MLRun tracks the information for each individual running instance, making it easy to compare between the different runs.

Based on the example below, run three iterations of the training job, each with a different parameter and see how the job identifies the iteration with the lowest loss.

task = NewTask(handler=handler).with_hyper_params({'p1': [5, 2, 3]}, 'min.loss')
run = new_function().run(task)

View job results

Once the job is done the results are displayed. Click on the ’iteration_results’ artifact to open a window with detailed information. In our case, iteration number two has the lowest loss score, which is defined as our best fit function, and therefore is displayed as the ’best_iteration’.

Showing the iteration_results

To conclude

Running and tracking data science jobs may require lots of DevOps and labor-intensive tasks if you’re trying to do everything on your own. MLRun provides the flexibility to run any job either locally on your laptop or on a distributed cluster at scale in a much simpler way. Users easily run reports to get details on each and every job and If needed, any previous job can be reproduced and run again with all its artifacts and meta data.

For more details on MLRun go to the MLRun repo https://github.com/mlrun/mlrun

Next steps

Running on Iguazio’s Data Science Platform provides the ability to pick and choose various built-in frameworks and run jobs at scale, accelerating performance and eliminating infrastructure overhead throughout the entire machine learning pipeline.

To try MLRun on a scalable cluster check out the Iguazio Data Science Platform.