In-Depth Platform Overview
This document provides an in-depth overview of the Iguazio Data Science Platform (“the platform”) and how to use it to implement a full data science workflow.
The platform uses Kubernetes (k8s) as the baseline cluster manager, and deploys various application microservices on top of Kubernetes to address different data science tasks. Most of the provided services support scaling out and GPU acceleration and have a secure and low-latency access to the platform’s shared data store and file system, enabling high performance and scalability with maximum resource efficiency.
The platform makes extensive use of Nuclio serverless functions to automate various tasks — such as data collection, extract-transform-load (ETL) processes, model serving, and batch jobs. Nuclio functions describe the code and include all the required resource definitions and configuration for running the code. The functions auto scale and can be versioned. The platform supports various methods for generating Nuclio functions — using the graphical dashboard, Docker, Git, or Jupyter Notebook — as demonstrated in the platform tutorials.
The following sections detail how you can use the platform to implement all stages of a data science workflow from research to production.
Collecting and Ingesting Data
There are many ways to collect and ingest data from various sources into the platform:
- Streaming data in real time from sources such as Kafka, Kinesis, Azure Event Hubs, or Google Pub/Sub.
- Loading data directly from external databases using an event-driven or periodic/scheduled implementation. See the explanation and examples in the read-external-db tutorial.
- Loading files (objects), in any format (for example, CSV, Parquet, JSON, or a binary image), from internal or external sources such as Amazon S3 or Hadoop. See, for example, the file-access tutorial.
- Importing time-series telemetry data using a Prometheus compatible scraping API.
- Ingesting (writing) data directly into the system using RESTful AWS-like simple-object, streaming, or NoSQL APIs. See the platform’s Web-API References.
- Scraping or reading data from external sources — such as Twitter, weather services, or stock-trading data services — using serverless functions.
See, for example, the
stocksdemo use-case application.
For more information and examples of data collection, ingestion, and preparation with the platform, see the basic-data-ingestion-and-preparation tutorial Jupyter notebook.
Exploring and Processing Data
The platform includes a wide range of integrated open-source data query and exploration tools, including the following:
- Apache Spark data-processing engine — including the Spark SQL and Datasets, MLlib, R, and GraphX libraries — with real-time access to the platform’s NoSQL data store and file system.
- Presto distributed SQL query engine, which can be used to run interactive SQL queries over platform NoSQL tables or other object (file) data sources.
- pandas Python analysis library, including structured DataFrames.
- Dask parallel-computing Python library, including scaled pandas DataFrames.
- Iguazio V3IO Frames — Iguazio’s open-source data-access library, which provides a unified high-performance API for accessing NoSQL, stream, and time-series data in the platform’s data store and features native integration with pandas and RAPIDS. See, for example, the frames tutorial.
- Built-in support for ML packages such as scikit-learn, Pyplot, NumPy, PyTorch, and TensorFlow.
All these tools are integrated with the platform’s Jupyter Notebook service, allowing users to access the same data from Jupyter through different interfaces with minimal configuration overhead. Users can easily install additional Python packages by using the Conda binary package and environment manager and the pip Python package installer, which are both available as part of the Jupyter Notebook service. This design, coupled with the platform’s unified data model, enables users to store and access data using different formats — such as NoSQL (“key/value”), time series, stream data, and files (simple objects) — and leverage different tools and APIs for accessing and manipulating the data, all from a single development environment (namely, Jupyter Notebook).
Building and Training Models
You can develop and test data science models in the platform’s Jupyter Notebook service or in your preferred external editor. When your model is ready, you can train it in Jupyter Notebook or by using scalable cluster resources such as Nuclio functions, Dask, Spark ML, or Kubernetes jobs. You can find model-training examples in the following platform demos; for more information and download instructions, see the platform introduction.
- The NetOps demo demonstrates predictive infrastructure-monitoring using scikit-learn.
- The image-classification demo demonstrates image recognition using TensorFlow and Horovod with MLRun.
If you’re are a beginner, you might find the following ML guide useful — Machine Learning Algorithms In Layman’s Terms.
One of the most important and challenging areas of managing a data science environment is the ability to track experiments. Data scientists need a simple way to track and view current and historical experiments along with the metadata that is associated with each experiment. This capability is critical for comparing different runs, and eventually helps to determine the best model and configuration for production deployment. The platform leverages the open-source MLRun library to help tackle these challenges. You can find examples of using MLRun in the MLRun demos. See the information about getting additional demos in the platform introduction.
Deploying Models to Production
The platform allows you to easily deploy your models to production in a reproducible way by using the open-source Nuclio serverless framework. You provide Nuclio with code or Jupyter notebooks, resource definitions (such as CPU, memory, and GPU), environment variables, package or software dependencies, data links, and trigger information. Nuclio uses this information to automatically build the code, generate custom container images, and connect them to the relevant compute or data resources. The functions can be triggered by a wide variety of event sources, including the most commonly used streaming and messaging protocols, HTTP APIs, scheduled (cron) tasks, and batch jobs.
Nuclio functions can be created from the platform dashboard or by using standard code IDEs, and can be deployed on your platform cluster. A convenient way to develop and deploy Nuclio functions is by using Jupyter Notebook and Python tools. For detailed information about Nuclio, visit the Nuclio website and see the product documentation.
For an overview of Nuclio and how to develop, document, and deploy serverless Python Nuclio functions from Jupyter Notebook, see the nuclio-jupyter documentation. You can also find examples in the platform tutorials. For example, the NetOps demo demonstrates how to deploy a network-operations model as a function; for more information about this demo and how to get it, see the platform introduction.
Visualization, Monitoring, and Logging
Data in the platform — including collected data, internal or external telemetry and logs, and program-output data — can be analyzed and visualized in different ways simultaneously. The platform supports multiple standard data analytics and visualization tools, including SQL, Prometheus, Grafana, and pandas. For example, you can plot or chart data within Jupyter Notebook using Matplotlib; use your favorite BI visualization tools, such as Tableau, to query data in the platform over a Java database connectivity connector (JDBC); or build real-time dashboards in Grafana.
The data analytics and visualization tools and services generate telemetry and log data that can be stored using the platform’s time-series database (TSDB) service or by using external tools such as Elasticsearch. Platform users can easily instrument code and functions to collect various statistics or logs, and explore the collected data in real time.
The Grafana open-source analytics and monitoring framework is natively integrated into the platform, allowing users to create dashboards that provide access to platform NoSQL tables and time-series databases from different dashboard widgets. You can also create Grafana dashboards programmatically (for example, from Jupyter Notebook) using wizard scripts. For information on how to create Grafana dashboards to monitor and visualize data in the platform, see Adding a Custom Grafana Dashboard.