Session #11

Handling Large Datasets in Data Preparation & ML Training Using MLOps


In this technical training session, we’ll explore how to use Dask, Kubernetes, and MLRun to scale data preparation and training with maximum performance.

Dask is an open-source library for parallel computing written in Python, which can be used in conjunction with open source MLOps orchestration tool MLRun over Kubernetes to handle large-scale datasets.

In this session, we will provide a demonstration of how to use these tools to scale your data prep and ML training with ease.

Watch this session to explore:

  1. An overview of the tools available for large-scale data processing in Python (PySpark, Dask, Vaex, and more), and how they are used with existing ML frameworks
  2. Understanding Dask and how to use the same native Python code at scale, without the need to learn other technologies like Spark
  3. How to run Dask in a distributed and elastic way over Kubernetes to improve resource utilization
  4. How to deploy Dask-based data engineering and ML pipelines with MLRun and Kubeflow, in one click
  5. Further optimizations for handling large-scale data effectively and efficiently