Components, Services, and Development Ecosystem
The Iguazio Data Science Platform (“the platform”) is an open platform that includes standard open-source analytics and machine-learning tools and integrates seamlessly with popular and industry-standard data frameworks and applications. The platform features an extremely fast multi-model data layer, but also allows access to external data sources — such as relational database management systems (RDBMSs), traditional Hadoop “data lakes”, and AWS S3 — and provides a variety of interfaces for accessing your data. All this — combined with its core driver’s indexing, processing, and pre-aggregation capabilities and a friendly user interface — amount to a powerful high-performance data science platform, built for production, with platform as a service (PaaS) ease of use. The provided turnkey solution is used by customers looking for a single platform that supports, facilities, and expedites the implementation of a full data science pipeline — including data collection and exploration, building and training of machine-learning (ML) models, and deployment of models and applications into production. The following sections introduce the platform’s main components and services and the overall development ecosystem:
The dashboard is the platform’s graphical user interface and is also your entry point to your platform cluster. Go to the dashboard URL from any web browser and log in with your platform login credentials. The dashboard allows you to manage and monitor platform activity — including the ability to view and browse the contents of data containers, create and delete containers, add and delete data objects, manage users and application services, set security rules, and monitor performance. Note especially the following dashboard pages:
Services— displays information about application services for the logged-in with options to create, run, configure, monitor, enable, disable, restart, and delete services from a single interface. (Note that the available capabilities depend on the permissions of the logged-in user.) For more information, see the Application Services and Tools section in this document.
Except where otherwise specified, when you open a service from the dashboard you are automatically logged into the service as the active dashboard user. Note that you need to add a security exception upon the first login to any of the HTTPS URLs.
- Browse the contents of your cluster’s data containers, organized in an hierarchical order — including directories and files, objects, data streams, and time-series and NoSQL tables with their partitions, items (rows), and attributes (columns) — and view related metadata. For more information, see the Working with Data Containers tutorial.
- Create and delete data containers and container directories, and upload (ingest) and download (consume) data files. For more information, see the Working with Data Containers and Ingesting and Consuming Files tutorials.
- Define and manage data-access policies. For more information, see Security.
- View performance statistics for your data containers.
Functions— allows you to view, define, and deploy Nuclio serverless functions. For more information, see the Nuclio Serverless Framework overview in this document. Identity— allows security administrators to create and delete platform users and user groups and edit their management policies, which determine their permissions to access data and perform different tasks. For more information, see Platform Users and Security.
The platform has a built-in multi-model data layer (a.k.a. “data store” or “database”) for storing and analyzing various types of data structures — such as NoSQL (“key/value”) tables, time-series databases, data streams, binary objects, and files. The data can be accessed through multiple industry-standard and industry-compatible programming interfaces; users can ingest data through one interface and consume it through another interface, depending on their preferences and needs. This unique unified data model eliminates the need for multiple data stores, constant synchronization, complex pipelines, and painful extract-transform-load (ETL) processes. The following table shows the provided programming interfaces for working with different types of data in the platform’s data store:
|NoSQL (Wide-Column Key/Value) Data||
The platform’s NoSQL data store was built to take advantage of a distributed cluster of physical and virtual machines that use flash memory to deliver in-memory performance while keeping flash economy and density. You can access NoSQL data through these interfaces:
|SQL Data||You can work with SQL data in the platform through these interfaces:|
|Time-Series Data||You can create and manage time-series databases in the platform’s data store through these interfaces:|
You can stream data directly into the platform and consume data from platform streams through the following interfaces:
|File / Simple Data Object||
You can work with data files and simple data objects — such CSV, Parquet, or Avro files, or binary image or video files — through these interfaces:
- See APIs Overview for a summary of the platform APIs; the Platform Fundamentals tutorial for explanations on how to set the data paths for each API; and References for comprehensive references.
- See the platform’s tutorial Jupyter notebooks for code examples and full use-case applications that demonstrate how to use the different APIs.
- The platform’s web APIs (for working with NoSQL, streaming, and simple-object data) are exposed as an application service.
The API endpoint URL of this service is available from the dashboard
Servicespage. For an introduction to working with the web APIs, see Sending Web-API Requests in the Platform Fundamentals tutorial.
- See also the Application Services and Tools section in this document for information on related application services — and specifically Spark, Presto, Time-Series Database (TSDB) Services, pandas and Dask, and V3IO Frames.
Application Services and Tools
Dask | Docker Registry | Elasticsearch | Frames | GPU | Grafana | Hadoop| Hive | Horovod | Jupyter | Log Forwarder | Looker | Monitoring | Nuclio | NumPy | pandas | Presto | Prometheus | Pyplot | PyTorch | QlikView | RAPIDS | scikit-learn | Spark | Tableau | TensorFlow | TSDB CLI | TSDB Nuclio Functions | Web Shell | Zeppelin
In addition to its core data services, the platform comes pre-deployed with essential and useful proprietary and third-party open-source tools and libraries that facilitate the implementation of a full data science workflow, from data collection to production (see Introducing the Platform). Both built-in and integrated tools are exposed to the user as application services that are managed by the platform using Kubernetes. Each application is packaged as a logical unit within a Docker container and is fully orchestrated by Kubernetes, which automates the deployment, scaling, and management of each containerized application. This provides users with the ability and flexibility to run any application anywhere, as part of their operational pipeline.
The application services can be viewed and managed from the dashboard
The platform’s application development ecosystem includes
- Distributed data frameworks and engines — such as Spark, Presto, Horovod, and Hadoop.
- The Nuclio serverless framework.
- Enhanced support for time-series databases (TSDBs) — including a CLI tool, serverless functions, and integration with Prometheus.
- Jupyter Notebook and Zeppelin interactive web notebooks for development and testing of data science and general data applications.
- A web-based shell service and Jupyter terminals, which provide bash command-line shells for running application services and performing basic file-system operations.
- Integration with popular machine-learning and scientific-computing packages for development of ML and artificial intelligence (AI) applications — such as TensorFlow, Keras, scikit-learn, PyTorch, Pyplot, and NumPy.
- Integration with common Python libraries that enable high-performance Python based data processing — such as pandas and Dask and RAPIDS.
- The V3IO Frames open-source unified high-performance DataFrame API library for working with NoSQL, stream, and time-series data in the platform.
- Support for executing code over GPUs.
- Integration with data analytics, monitoring, and visualizations tools — including built-in integration with the open-source Grafana metric analytics and monitoring tool and easy integration with commercial business-intelligence (BI) analytics and visualization tools such as Tableau, Looker, and QlikView.
- Logging and monitoring services for monitoring, indexing, and viewing application-service logs — including a log-forwarder service and integration with Elasticsearch.
Following are overviews of the main application services and tools that can be run on the platform.
Nuclio Serverless Framework
Iguazio’s Nuclio Enterprise Edition serverless-functions framework — a leading open-source project for converging and simplifying data management — is integrated into the platform. Nuclio is a high-performance low-latency framework that supports execution over CPUs or GPUs and a wide array of tools and event triggers, providing users with a complete cloud experience of data services, ML, AI, and serverless functionality — all delivered in a single integrated and self-managed offering at the edge, on-premises (“on-prem”), or in a hosted cloud.
You can use Nuclio functions, for example, to
Collect and ingest data into the platform and consume (query) the data on an ongoing basis. Nuclio offers built-in function templates for collecting data from common sources, such as Apache Kafka streams or databases, including examples of data enrichment and data-pipeline processing.
Run machine-learning models in the serving layer, supporting high throughput on demand and elastic resource allocation.
Nuclio can be easily integrated with Jupyter Notebook, enabling users to develop their entire code (model, feature vector, and application) in Jupyter Notebook and use a single command to deploy the code as a serverless function that runs in the serving layer. For examples, see the platform’s tutorial Jupyter notebooks. For more information about Nuclio, see the platform’s Serverless Functions (Nuclio) introduction.
The platform exposes Nuclio as a default tenant-wide service and allows users to disable and restart the service as well as configure the Docker Registry for storing the Nuclio function images.
Configuring the Docker Registry for the Nuclio Service
By default, the Nuclio service on the default tenant is configured to work with a predefined default tenant-wide “Docker Registry” service, which uses a pre-deployed local on-cluster Docker Registry. However, you can create your own Docker Registry service for working with a remote off-cluster Docker Registry, and change the configuration of the Nuclio service to work with your Docker Registry service and use your registry to store the Nuclio function images.
Jupyter is a project for development of open-source software, standards, and services for interactive computing across multiple programming languages. The Platform comes preinstalled with the JupyterLab web-based user interface, including Jupyter Notebook and JupyterLab Terminals, which are available via a Jupyter Notebook user application service.
Jupyter Notebook is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text; it’s currently the leading industry tool for data exploration and training. Jupyter Notebook supports integration with all key analytics services, enabling users to perform all stages of the data science flow, from data collection to production, from a single interface using various APIs and tools to concurrently access the same data without having to move the data. Your Jupyter Notebook code can execute Spark jobs (for example, using Spark DataFrames); run SQL queries using Presto; define, deploy, and trigger Nuclio serverless functions; send web-API requests; use pandas and V3IO Frames DataFrames; use the Dask library to scale the use of pandas DataFrames; and more. You can use Conda and pip, which are available as part of the Jupyter Notebook service, to easily install Python packages such as pandas and Dask and machine-learning packages. In addition, you can use Jupyter terminals to execute shell commands, such as file-system and installation commands. As part of the configuration of the platform’s Jupyter Notebook service you select a specific Jupyter flavor and you can optionally define environment variables for the service.
Iguazio provides tutorial Jupyter notebooks with code examples ranging from getting-started examples to full end-to-end demo applications, including detailed documentation.
Start out by reading the introductory
In version 2.3.1 of the platform, you can set the custom
- Jupyter — a basic version of Jupyter for execution over CPUs, which includes basic Python libraries (such as pandas and scikit-learn) and Presto.
- Jupyter Spark — includes all components of the “Jupyter” flavor as well as Spark.
- Jupyter Deep Learning — includes all components of the “Jupyter” flavor as well as deep-learning Python libraries (such as TensorFlow, PyTorch, and Keras).
- Jupyter Full Stack (default) — a full version of Jupyter for execution over CPUs, which includes all components of the “Jupyter”, “Jupyter Spark”, and “Jupyter Deep Learning” flavors.
- Jupyter Deep Learning + GPU
[Tech Preview]— includes all components of the “Jupyter Deep Learning” flavor as well support for executing code over GPUs using NVIDIA CUDA.
NOTE: In the current release, to use this flavor with the Horovod framework for running distributed ML training jobs over GPUs, you must first contact Iguazio’s customer-success team; the team will deploy the Horovod service and provide you with additional guidance. For more information about the platform’s support for Horovod, see the GPU Support section.
- Jupyter Deep Learning with Rapids
[Tech Preview]— includes all components of the “Jupyter Deep Learning + GPU” flavor as well as additional software for using NVIDIA RAPIDS to execute code over GPUs (including the cuDF and cuML RAPIDS libraries and the NVIDIA CUDA Toolkit).
NOTE: In the current release, you must contact Iguazio’s customer-success team before first selecting this flavor, otherwise the service will fail to deploy; the team will make the required changes for using RAPIDS and provide you with additional guidance. For more information about the platform’s support for RAPIDS, see the GPU Support section.
Machine-Learning and Scientific-Computing Packages
You can easily install on your platform cluster (for example, from Jupyter Notebook)ML and scientific-computing packages — such as TensorFlow, Keras, scikit-learn, PyTorch, Pyplot, and NumPy. The platform’s architecture was designed to deploy computation to one or more CPU or GPU with a single Python API.
For example, you can install the TensorFlow open-source library for numerical computation using data-flow graphs. You can use TensorFlow to train a logistic regression model for prediction or a deep-learning model, and then deploy the same model in production over the same platform instance as part of your operational pipeline. The data science and training portion can be developed using recent field data, while the development-to-production workflow is automated and time to insights is significantly reduced. All the required functionality is available on a single platform with enterprise-grade security and a fine-grained access policy, providing you with visibility into the data based on the organizational needs of each team. The following Python code sample demonstrates the simplicity of using the platform to train a TensorFlow model and evaluate the quality of the model’s predictions:
model.train( input_fn=lambda: input_fn(train_data, num_epochs, True, batch_size)) results = model.evaluate(input_fn=lambda: input_fn( test_data, 1, False, batch_size)) for key in sorted(results): print('%s: %s' % (key, results[key]))
Time-Series Database (TSDB) Services
Time-series databases (TSDBs) are used for storing time-series data — a series of time-based data points. The platform features enhanced built-in support for working with TSDBs, which includes a rich set of features for efficiently analyzing and storing time series data. The platform uses the Iguazio V3IO TSDB open-source library, which exposes a high-performance API for working with TSDBs — including creating and deleting TSDB instances (tables) and ingesting and consuming (querying) TSDB data. This API can be consumed in various ways:
- Use the V3IO TSDB command-line interface (CLI) tool (
tsdbctl), which is pre-deployed in the platform, to easily create, delete, and manage TSDB instances (tables) in the platform’s data store, ingest metrics into such tables, and issue TSDB queries. The CLI can be run locally on a platform cluster — from a command-line shell interface, such as the web-based shell or a Jupyter terminal or notebook — or remotely from any computer with a network connection to the cluster. The platform’s web shell and Jupyter terminal environments predefine a
tsdbctlalias to the native CLI that preconfigures the URL of the web-APIs service and the authentication access key for the running user of the parent shell or Jupyter Notebook service.
- Use the platform’s TSDB Nuclio functions service to generate serverless V3IO TSDB Nuclio functions for ingesting and consuming TSDB data.
- Use the Prometheus service service to run TSDB queries. Prometheus is an open-source systems monitoring and alerting toolkit that features a dimensional data model and a flexible query language. The platform’s Prometheus service uses the pre-deployed Iguazio V3IO Prometheus distribution, which packages Prometheus with the V3IO TSDB library for a robust, scalable, and high-performance TSDB solution.
pandas and Dask
pandas is an open-source Python library for high-performance data processing using structured DataFrames (“pandas DataFrames”). Dask is a parallel-computing Python library that features scaled pandas DataFrames. You can easily install these tools on the platform — for example, by using pip or Conda, which are pre-deployed as part of the platform’s Jupyter Notebook service — and use them to perform fast Python based data processing. For more information and examples, see the platform’s tutorial Jupyter notebooks.
Iguazio V3IO Frames is an open-source data-access library that provides a unified high-performance DataFrame API for working with NoSQL, stream, and time-series (TSDB) data in the platform’s data store.
To use this library, create a shared tenant-wide instance of the V3IO Frames service.
You can find many examples of using this library in the platform’s tutorial Jupyter notebooks.
See specifically the
Presto SQL Engine
Presto is an open-source distributed SQL query engine for running interactive analytic queries. The platform has a pre-deployed tenant-wide Presto service that can be used to run SQL queries and perform high-performance low-latency interactive data analytics. You can ingest data into the platform using your preferred method — such as using Spark, the NoSQL Web API, a Nuclio function, or V3IO Frames — and use Presto to analyze the data interactively with the aid of your preferred visualization tool. Running Presto over the platform’s data services allows you to filter data as close as possible to the source.
You can run SQL commands that use ANSI SQL
The Iguazio Presto connector enables you to use Presto to run queries on data in the platform’s NoSQL store — including support for partitioning, predicate pushdown, and column pruning, which enables users to optimize their queries. You can also use Presto’s built-in Hive connector to query data of the supported file types, such as Parquet or ORC, or to save table-query views to the default Hive schema. Note that to use the Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform’s Presto service. For more information, see Using the Hive Connector in the Presto reference overview.
The platform also has a built-in process that uses Presto SQL to create a Hive view that monitors both real-time data in the platform’s NoSQL store and historical data in Parquet or ORC tables
The platform is integrated with the Apache Spark data engine for large-scale data processing, which is available as a user application service. You can use Spark together with other platform services to run SQL queries, stream data, and perform complex data analysis — both on data that is stored in the platform’s data store and on external data sources such as RDBMSs or traditional Hadoop “data lakes”. The support for Spark is powered by a stack of Spark libraries that include Spark SQL and DataFrames for working with structured data, Spark Streaming for streaming data, MLlib for machine learning, and GraphX for graphs and graph-parallel computation. You can combine these libraries seamlessly in the same application.
Spark is fully optimized when running on top of the platform’s data services, including data filtering as close as possible to the source by implementing predicate pushdown and column-pruning in the processing engine. Predicate pushdown and column pruning can optimize your query, for example, by filtering data before it is transferred over the network, filtering data before loading it into memory, or skipping reading entire files or chunks of files.
The platform supports the standard Spark Dataset and DataFrame APIs in Scala, Java, Python, and R. In addition, it extends and enriches these APIs via the Iguazio Spark connector, which features a custom NoSQL data source that enables reading and writing data in the platform’s NoSQL store using Spark DataFrames — including support for table partitioning, data pruning and filtering (predicate pushdown), performing “replace” mode and conditional updates, defining and updating counter table attributes (columns), and performing optimized range scans. The platform also supports the Spark Streaming API. For more information, see the Spark APIs reference.
You can run Spark jobs on your platform cluster from a Jupyter or Zeppelin web notebook; for details, see Running Spark Jobs from a Web Notebook.
You can also run Spark jobs by executing
sparkvariable) and stop the session at the end of the flow to release resources (for example, by calling
The platform supports accelerated code execution over NVIDIA graphics processing units (GPUs):
You can run Nuclio serverless functions on GPUs.
You can run Jupyter Notebook code on GPUs: the platform’s Jupyter Notebook service has two flavors that support GPU (see Jupyter Flavors) and features enhanced support for the following GPU libraries:
[Tech Preview]— you can use Uber’s Horovod distributed deep-learning framework to convert a single-GPU TensorFlow, Keras, or PyTorch model-training program to a distributed program that trains the model simultaneously over multiple GPUs. The objective is to speed up your model training with minimal changes to your existing single-GPU code and without complicating the execution. The platform supports Horovod with the “Jupyter Deep Learning + GPU” flavor of the Jupyter Notebook service.
[Tech Preview]— you can use NVIDIA’s RAPIDS open-source libraries suite to execute end-to-end data science and analytics pipelines entirely on GPUs. The platform supports RAPIDS with the “Jupyter Deep Learning with Rapids” flavor of the Jupyter Notebook service. Note that RAPIDS supports GPUs with the NVIDIA Pascal architecture or better and compute capability 6.0+.
The platform tutorial Jupyter notebooks include a
demos/gpu-demosdirectory with full GPU use-case application demos.
To use Horovod or RAPIDS with version 2.3.1 of the platform, contact Iguazio’s customer-success team before selecting the respective GPU Jupyter flavor and running GPU code; the team will perform the required initialization for these services and provide you with additional guidance.
When configuring the number of GPUs for your Jupyter Notebook service (by setting the common
Resources > GPU > Limitservice parameter), note that the while the service is enabled it monopolizes the allocated GPUs even when they’re not in use.
Data Analytics, Monitoring, and Visualization Tools
There are various tools that allow you to monitor and query your data and produce graphical interactive representations that make it easy to quickly analyze the data and begin discovering new actionable insights in a matter of seconds, with no programming effort.
The Grafana open-source platform for data analytics, monitoring, and visualization is pre-integrated in the platform and available as a user application service.
You can use the Grafana service to define custom Grafana dashboards for monitoring, visualizing, and understanding data stored in the platform, such as time-series metrics and NoSQL data.
This can be done by using the custom
iguazio data source, or by using a Prometheus data source for running Prometheus queries on platform TSDB tables.
You can also issue data alerts and create, explore, and share dashboards.
Remote Visualization Tools (#remote-visualization-tools)
All leading BI data visualization tools can be installed remotely and configured to run on top of the data services of the Iguazio Data Science Platform over a Java database connectivity (JBDC) connector. The following images display data visualization using the popular Tableau, QlikView, and Looker visualization tools:
Logging and Monitoring Services
The platform features a default tenant-wide log-forwarder service (“Log Forwarder”) for forwarding application-service logs to an instance of the Elasticsearch open-source search and analytics engine by using the open-source Filebeat log-shipper utility.
In addition, the platform has a default tenant-wide monitoring service (“Monitoring”), which is disabled by default, for monitoring Nuclio serverless functions and gathering performance statistics.
For detailed information about these services and how to use them, as well as information about additional platform monitoring, logging, and debugging tools, see the Logging, Monitoring, and Debugging.
The platform includes a web-based command-line shell (“web shell”) service for running application services and performing basic file-system operations from a web browser.
For example, you can use the Presto CLI to run SQL queries on your data; use the TSDB CLI to work with TSDBs; use
The custom web-shell service parameters allow you to optionally associate the service with a Spark service and select a Kubernetes service account that will determine the permissions for using the
- “None” — no permission to use
- “Log Reader” — list pods and view logs.
- “Application Admin” — list pods; view logs; and create or delete secrets and ConfigMaps.
- “Service Admin” — list pods; view logs; create or delete secrets and ConfigMaps; and create, delete, list, or get jobs and cron jobs.
- The web shell isn’t a fully functional Linux shell. See Software Specifications and Restrictions for specific restrictions.
- To log out of the web shell, run the
exitcommand in the shell.
Apache Zeppelin is an open-source web platform for performing interactive data analytics. Zeppelin is pre-installed in the platform and available as a user application service. You can use Zeppelin to create data-driven and interactive Scala, Java, Python, or R applications that run over Apache Spark, as well as run Spark SQL queries and file-system commands. You can also easily be visualize the execution output with Zeppelin for quick insights. You can save your Zeppelin notebook, with its code, and share it with members of your organization.
The platform’s Zeppelin service includes a pre-deployed getting-started note (“Iguazio Getting Started Example”) that demonstrates how to use the platform APIs.
The platform provides a self-service and open-source Apache Hadoop framework that makes it easy, fast, and cost-effective to process and analyze vast amounts of data for operational, analytics, and data-engineering needs. You can run Hadoop commands from any platform command-line interface — such as a web-based shell, a Jupyter notebook or terminal, or a Zeppelin notebook — as demonstrated, for example, in the Working with Data Containers tutorial and in the getting-started ingestion examples in the platform’s Jupyter and Zeppelin tutorial notebooks.