Data Science Automation (MLOps) Services

On This Page

Overview

The platform has pre-deployed services for data science and machine-learning operations (MLOps) automation and tracking:

MLRun

MLRun is Iguazio's open-source MLOps orchestration framework, which offers an integrative approach to managing machine-learning pipelines from early development through model development to full pipeline deployment in production. MLRun offers a convenient abstraction layer to a wide variety of technology stacks while empowering data engineers and data scientists to define the feature and models. MLRun also integrates seamlessly with other platform services, such as Kubeflow Pipelines, Nuclio, and V3IO Frames.

The MLRun server is provided as a default (pre-deployed) shared single-instance tenant-wide platform service (mlrun), including a graphical user interface ("the MLRun dashboard" or "the MLRun UI"), which is integrated as part of the Projects area of the platform dashboard.

The MLRun client API is available via the MLRun Python package (mlrun), including a command-line interface (mlrun). You can easily install and update this package from the Jupyter Notebook service by using the /User/align_mlrun.sh script, which is available in your running-user directory after you create a Jupyter Notebook platform service. For more information, see Installing and Updating the MLRun Python Package in the platform introduction.

The MLRun library features a generic and simplified mechanism for helping data scientists and developers describe and run scalable ML and other data science tasks in various runtime environments while automatically tracking and recording execution code, metadata, inputs, and outputs. The capability to track and view current and historical ML experiments along with the metadata that is associated with each experiment is critical for comparing different runs, and eventually helps to determine the best model and configuration for production deployment.

MLRun is runtime and platform independent, providing a flexible and portable development experience. It allows you to develop functions for any data science task from your preferred environment, such as a local IDE or a web notebook; execute and track the execution from the code or using the MLRun CLI; and then integrate your functions into an automated workflow pipeline (such as Kubeflow Pipelines) and execute and track the same code on a larger cluster with scale-out containers or functions.

For detailed MLRun information and examples, including an API reference, see the MLRun documentation, which is available also in the Data Science and MLOps section of the platform documentation. See also the MLRun restrictions in the platform's Software Specifications and Restrictions.

You can find full MLRun end-to-end use-case demo applications as well as a getting-started and how-to tutorials in the MLRun-demos repository repository. These demos and tutorials are pre-deployed in each user's /User/demos directory for the first Jupyter Notebook service created for the user. You can also find a pre-deployed /User/update-demos.sh script for updating the demo files. For details, see End-to-End Use-Case Applications and How-To Demos in the platform introduction. In addition, check out the MLRun functions marketplace — a centralized location for open-source contributions of function components that are commonly used in machine-learning development.

Configuring MLRun Services

Pod Priority for User Jobs

Pods (services, or jobs created by those services) can have priorities, which indicate the relative importance of one pod to the other pods on the node. The priority is used for scheduling: a lower priority pod can be evicted to allow scheduling of a higher priority pod. Pod priority is relevant for all pods created by the service.
Eviction uses these values to determine what to evict with conjunction to the pods priority. See more details in Interactions between Pod priority and quality of service.

Pod priority is specified through Priority classes, which map to a priority value. The priority values are: High, Medium, Low. The default is Medium.

Configure the default User functions default priority for a service, which is applied to the service itself or to all subsequently created user-jobs in the service's Common Parameters tab, User jobs defaults section, Priority class drop-down list.

The priority applies to the jobs created by MLRun.

Resources for User Jobs

Warning
Do not change the MLRun service resource parameters unless Iguazio support recommends that you change them.
Note
Configuring the number of workers is relevant if the platform is running with MLRun v1.1.0 or higher.

When you create a pod in an MLRun job, the pod has default CPU and memory limits, and a default number of workers (2). When the job runs, it can consume resources up to the limits defined. The CPU and memory configurations are applied to each replica.
You can configure the default limits and the number of workers at the service level.
When creating a service, set the Memory and CPU in the Common Parameters tab, under User jobs defaults.
When creating a job, you can overwrite the default overwrite the default Memory and/or CPU in the Configuration tab, under Resources.
You cannot overwrite the number of workers defined at the service level .

See more about Resource Management for Pods and Containers.

Service Account

You can add a custom service account to an MLRun service. The annotations associated with the account are persistent when restarting the service or during upgrades. The service account must already exist on the cluster. (If you do not specify a name, the service uses a default account.)

Node Selection

You can assign jobs and functions to a specific node or a node group, to manage your resources, and to differentiate between processes and their respective nodes. A typical example is a workflow that you want to only run on dedicated servers.

When specified, the service or the pods of a function can only run on nodes whose labels match the node selector entries configured for the service. You can also specify labels that were assigned to app nodes by an iguazio IT Admin user. See Setting Labels on App Nodes.

Configure the key-value node selector pairs in the Custom Parameters tab of the service.

If node selection for the service is not specified, the selection criteria defaults to the Kubernetes default behavior, and jobs run on a random node.
Node selection is relevant for all cloud services.

See more about Kubernetes nodeSelector.

You can also configure the node selection for individual MLRun jobs by going to Platform dashboard | Projects | New Job | Resources | Node selector, and adding or removing Key:Value pairs.

Modify the priority for an ML function by pressing ML functions, then User of the function, Edit | Resources | Pods Priority drop-down list.

External System Docker Registry

Note
External docker registry is relevant if the platform is running with MLRun v1.0.5 or higher.

By default, the MLRun service on the default tenant is configured to work with a predefined, tenant-wide Docker Registry service, which uses a pre-deployed, local, on-cluster Docker Registry. You can change the configuration of the MLRun service to work with an off-cluster external system Docker Registry service.

The system registry ensures that all container images required by the system to operate are pulled from the given address. This is supported for managed Kubernetes integrated with cloud-provided container registry. (e.g.: AKS with ACR, EKS with ECR). The URL is the container registry address. When deploying multiple systems to the same container registry, you can use different URLs, for example xyz.my-ecr.amazon.com/some-unique-name. This is recommended to avoid overriding the existing container images.

Either select an external Docker Registry from the drop-down list, or press Create new..

Parameters:

  • URL: Required.
  • Username and password: Optional
  • Image prefix. Optional. When defined, the image prefix is appended to the container images (that were built in MLRun) when they are pushed by an Iguazio service (e.g. MLRun) to the registry.

The external registry does not support explicit authentication. You must ensure that the k8s is deployed with a role that allows it to read/write to that given registry.

Tip
If you're using the ECR/ACR for both the user custom registry and the external system docker registry, you can distinguish between the registries with suffixes. For example:
my-ecr-address.ecr.com/my-igz-system-runtime for the custom system container registry
my-ecr-address.ecr.com/my-igz-system for the custom user registry

When creating a registry on ECR

  • If the permissions for the ECR are already set as part of the cluster deployment (using the EC2 IAM policy), then use ecm.com as the URL and leave the username and password blank. (EC2 instances are attached with roles allowing it to work with the ECR.)
  • If the ECR was not used for the cluster installation:
    • URL: The ECR URL (in the format <aws_account_id>.dkr.ecr..amazonaws.com).
    • Username: AWS access key ID
    • Password: AWS secret access key
Note
When using ECR as the external container registry, make sure that the project secrets AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY have read/write access to ECR.

The access keys or the EC2 IAM policy must have these permissions:

{ 
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:CreateRepository",
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:CompleteLayerUpload",
                "ecr:GetDownloadUrlForLayer",
                "ecr:InitiateLayerUpload",
                "ecr:PutImage",
                "ecr:UploadLayerPart"
            ],
            "Resource": "*"
        }
    ]
}

See more details in EKS and AWS Vanilla Kubernetes.

Kubeflow Pipelines

Google Kubeflow Pipelines is an open-source framework for building and deploying portable, scalable ML workflows based on Docker containers. For detailed information, see the Kubeflow Pipelines documentation.

Kubeflow Pipelines is provided as a default (pre-deployed) shared single-instance tenant-wide platform service (pipelines), which can be used to create and run ML pipeline experiments. The pipeline artifacts are stored in a pipelines directory in the "users" data container and pipeline metadata is stored in a mlpipeline directory in the same container. The pipelines dashboard can be accessed by selecting the Pipelines option in the platform dashboard's navigation side menu. (This option is available to users with a Service Admin, Application Admin, or Application Read Only management policy.)

See Also