Introducing the Platform
The Iguazio Data Science Platform (“the platform”) is a fully integrated and secure data science platform as a service (PaaS), which simplifies development, accelerates performance, facilitates collaboration, and addresses operational challenges. The platform incorporates the following components:
- A data science workbench that includes Jupyter Notebook, integrated analytics engines, and Python packages
- Model management with experiments tracking and automated pipeline capabilities
- Managed data and machine-learning (ML) services over a scalable Kubernetes cluster
- A real-time serverless functions framework — Nuclio
- An extremely fast and secure data layer that supports SQL, NoSQL, time-series databases, files (simple objects), and streaming
- Integration with third-party data sources such as Amazon S3, HDFS, SQL databases, and streaming or messaging protocols
- Real-time dashboards based on Grafana
The platform uses Kubernetes (k8s) as the baseline cluster manager, and deploys various application microservices on top of Kubernetes to address different data science tasks. Most of the provided services support scaling out and GPU acceleration and have a secure and low-latency access to the platform’s shared data store and file system, enabling high performance and scalability with maximum resource efficiency.
The platform makes extensive use of Nuclio serverless functions to automate various tasks — such as data collection, extract-transform-load (ETL) processes, model serving, and batch jobs. Nuclio functions describe the code and include all the required resource definitions and configuration for running the code. The functions auto scale and can be versioned. The platform supports various methods for generating Nuclio functions — using the graphical dashboard, Docker, Git, or Jupyter Notebook — as demonstrated in the platform tutorials.
For a more in-depth introduction to the platform, see the following resources:
- Components, Services, and Development Ecosystem
- Introduction video (available also as part of the Iguazio Trial Quick-Start tutorial)
- Creating and deploying Nuclio functions with Python and Jupyter Notebook
A good place to start your development is with the platform tutorial Jupyter notebooks, which are available in the home directory of the platform’s Jupyter Notebook service; see especially the getting-started examples and full use-case demo applications. You can find a tutorials overview in the Jupyter Notebook Basics section of this document.
Data Science Workflow
The Iguazio Data Science Platform provides a complete data science workflow in a single ready-to-use platform that includes all the required building blocks for creating data science applications from research to production:
- Collect, explore, and label data from various real-time or offline sources
- Run ML training and validation at scale over multiple CPUs and GPUs
- Deploy models and applications into production with serverless functions
- Log, monitor, and visualize all your data and services
Collecting and Ingesting Data
There are many ways to collect and ingest data from various sources into the platform:
- Streaming data in real time from sources such as Kafka, Kinesis, Azure Event Hubs, or Google Pub/Sub.
- Loading data directly from external databases using an event-driven or periodic/scheduled implementation. See the explanation and examples in the read-external-db tutorial.
- Loading files (objects), in any format (for example, CSV, Parquet, JSON, or a binary image), from internal or external sources such as Amazon S3 or Hadoop. See, for example, the file-access tutorial.
- Importing time-series telemetry data using a Prometheus compatible scraping API.
- Ingesting (writing) data directly into the system using RESTful AWS-like simple-object, streaming, or NoSQL APIs. See the platform’s Web-API References.
- Scraping or reading data from external sources — such as Twitter, weather services, or stock-trading data services — using serverless functions. See, for example, the stocks demo use-case application.
For more information and examples of data collection and ingestion with the platform, see the collect-n-explore tutorial Jupyter notebook.
Exploring and Processing Data
The platform includes a wide range of integrated open-source data query and exploration tools, including the following:
- Apache Spark data-processing engine — including the Spark SQL and Datasets, MLlib, R, and GraphX libraries — with real-time access to the platform’s NoSQL data store and file system.
- Presto distributed SQL query engine, which can be used to run interactive SQL queries over platform NoSQL tables or other object (file) data sources.
- pandas Python analysis library, including structured DataFrames.
- Dask parallel-computing Python library, including scaled pandas DataFrames.
- Iguazio V3IO Frames
[Tech Preview]— Iguazio’s open-source data-access library, which provides a unified high-performance API for accessing NoSQL, stream, and time-series data in the platform’s data store and features native integration with pandas and RAPIDS. See, for example, the frames tutorial.
- Built-in support for ML packages such as scikit-learn, Pyplot, NumPy, PyTorch, and TensorFlow.
All these tools are integrated with the platform’s Jupyter Notebook service, allowing users to access the same data from Jupyter through different interfaces with minimal configuration overhead. Users can easily install additional Python packages by using the Conda binary package and environment manager and the pip Python package installer, which are both available as part of the Jupyter Notebook service. This design, coupled with the platform’s unified data model, enables users to store and access data using different formats — such as NoSQL (“key/value”), time series, stream data, and files (simple objects) — and leverage different tools and APIs for accessing and manipulating the data, all from a single development environment (namely, Jupyter Notebook).
For more information and examples of data exploration with the platform, see the collect-n-explore tutorial Jupyter notebook.
Building and Training Models
You can develop and test data science models in the platform’s Jupyter Notebook service or in your preferred external editor. When your model is ready, you can train it in Jupyter Notebook or by using scalable cluster resources such as Nuclio functions, Dask, Spark ML, or Kubernetes jobs. You can find model-training examples in the platform’s tutorial Jupyter notebooks:
- The NetOps demo tutorial demonstrates predictive infrastructure-monitoring using scikit-learn.
- The image-classification demo tutorial demonstrates image recognition using TensorFlow and Horovod with MLRun.
If you’re are a beginner, you might find the following ML guide useful — Machine Learning Algorithms In Layman’s Terms.
One of the most important and challenging areas of managing a data science environment is the ability to track experiments. Data scientists need a simple way to track and view current and historical experiments along with the metadata that is associated with each experiment. This capability is critical for comparing different runs, and eventually helps to determine the best model and configuration for production deployment. The platform leverages the open-source MLRun library to help tackle these challenges. You can find examples of using MLRun in the experiment tracking directory of the platform tutorial Jupyter notebooks.
Deploying Models to Production
The platform allows you to easily deploy your models to production in a reproducible way by using the open-source Nuclio serverless framework. You provide Nuclio with code or Jupyter notebooks, resource definitions (such as CPU, memory, and GPU), environment variables, package or software dependencies, data links, and trigger information. Nuclio uses this information to automatically build the code, generate custom container images, and connect them to the relevant compute or data resources. The functions can be triggered by a wide variety of event sources, including the most commonly used streaming and messaging protocols, HTTP APIs, scheduled (cron) tasks, and batch jobs.
Nuclio functions can be created from the platform dashboard or by using standard code IDEs, and can be deployed on your platform cluster. A convenient way to develop and deploy Nuclio functions is by using Jupyter Notebook and Python tools. For detailed information about Nuclio, visit the Nuclio website and see the product documentation.
For an overview of Nuclio and how to develop, document, and deploy serverless Python Nuclio functions from Jupyter Notebook, see the nuclio-jupyter documentation. You can also find examples in the platform tutorial Jupyter notebooks; for example, the NetOps demo tutorial demonstrates how to deploy a network-operations model as a function.
Visualization, Monitoring, and Logging
Data in the platform — including collected data, internal or external telemetry and logs, and program-output data — can be analyzed and visualized in different ways simultaneously. The platform supports multiple standard data analytics and visualization tools, including SQL, Prometheus, Grafana, and pandas. For example, you can plot or chart data within Jupyter Notebook using Matplotlib; use your favorite BI visualization tools, such as Tableau, to query data in the platform over a Java database connectivity connector (JDBC); or build real-time dashboards in Grafana.
The data analytics and visualization tools and services generate telemetry and log data that can be stored using the platform’s time-series database (TSDB) service or by using external tools such as Elasticsearch. Platform users can easily instrument code and functions to collect various statistics or logs, and explore the collected data in real time.
The Grafana open-source analytics and monitoring framework is natively integrated into the platform, allowing users to create dashboards that provide access to platform NoSQL tables and time-series databases from different dashboard widgets. You can also create Grafana dashboards programmatically (for example, from Jupyter Notebook) using wizard scripts. For information on how to create Grafana dashboards to monitor and visualize data in the platform, see Adding a Custom Grafana Dashboard.
Jupyter Notebook Basics
The platform’s Jupyter Notebook service displays the JupyterLab UI, which consists of a collapsible left sidebar, a main work area (on the right), and a top menu bar. For details, see the JupyterLab documentation.
The main work area (on the right) contains tabs of documents and activities — for creating, viewing, editing, and running interactive notebooks, shell terminals, or consoles, as well as viewing and editing other common file types.
To create a new notebook or terminal, select the
+ icon) from the top action toolbar in the left sidebar.
The top menu bar exposes available top-level actions, such as exporting a notebook in a different format.
The left-sidebar menu contains commonly used tabs, including a
The home directory of the platform’s Jupyter Notebook service contains the following files and directories:
v3iodirectory, which displays the contents of the
v3ioplatform cluster data mount for browsing the contents of the cluster’s data containers. You can also browse the contents of the data containers from the
Datapage of the dashboard.
The contents of the running-user home directory —
users/<running user>. This directory contains the platform’s tutorial Jupyter notebooks: welcome.ipynb/ README.md— a short introduction to the platform and how to use it to implement a full data science workflow, similar to the documentation on this page. getting-started— a directory containing getting-started tutorials that explain and demonstrate how to perform different platform operations using the platform APIs and integrated tools. demos— a directory containing end-to-end use-case application demos. See the README.mdfile in this directory (available also as a Jupyter notebook) for a full overview of these applications.
- A script and related notebook for updating the tutorial notebooks. See Updating the Tutorial Notebooks to the Latest Version.
For information about the predefined data containers and how to reference data in these containers, see the Working with Data Containers tutorial.
Creating Virtual Environments in Jupyter Notebook
A virtual environment is a named, isolated, working copy of Python that maintains its own files, directories, and paths so that you can work with specific versions of libraries or Python itself without affecting other Python projects.
Virtual environments make it easy to cleanly separate projects and avoid problems with different dependencies and version requirements across components.
Updating the Tutorial Notebooks to the Latest Version
- Most sections of the documentation site are versioned. Make sure that you’re reading the documentation that matches your version of the platform. You can see the current documentation version and select a different version from the version-selection menu at the top of the section-navigation side menu.
- The documentation in the Specifications and Release Notes sections is confidential and restricted to registered users only. For more information, contact email@example.com.
- Setup and Configuration
- Getting-Started Tutorials and Guides
- Components, Services, and Development Ecosystem
- APIs Overview and References
- Specifications — including a support matrix and software specifications and restrictions
- Iguazio sample data set public AWS S3 bucket
General Development Resources
- JupyterLab documentation
- 10 Minutes to pandas
- Machine Learning Algorithms In Layman’s Terms
- Registry of Open Data on AWS
The Iguazio support team will be happy to assist with any questions.