Distributed computing tools are a great way to run training on very large datasets. Many ML use cases require big data to train the model, which complicates development and production processes:
- In cases where data is too large to fit into the local machine’s memory, it is possible to use native Python file streaming and other tools to iterate through the dataset without loading it into memory, though this method can be slow going, since jobs are run on a single thread.
- Parallelization can speed up the process, but you’re still limited by the resources in your local machine
Distributed computing tackles this challenge, by distributing tasks to multiple independent worker machines, each handling a part of the dataset in its own memory and dedicated processor. By using distributed training, data scientists can scale their code on very large datasets to run in parallel on several workers. This solution surpasses parallelization where jobs run in multiple processes or in threads on a single machine.
There are two main options for transitioning to a distributed compute environment:
- Spark: A very mature and popular tool for data scientists who prefer a more SQL-oriented approach, especially in enterprises with legacy tech stacks and JVM infrastructure
- Dask: The go-to tool for large-scale data processing for those who prefer Python or native code. It has seamless integration with common Python data tools that data scientists already use. Dask is great for complex use cases or applications that don’t neatly fit into the Spark computing model.
More on Dask:
This technical training session explores how to use Dask, Kubernetes, and MLRun to scale data preparation and training with maximum performance.
This demo shows a simple task, demonstrating that Dask is about five times faster than Spark in processing large workloads.
This article covers how to use the Snowflake Connector for Python with Dask