Application Services and Tools
In addition to its core data services, the platform comes pre-deployed with essential and useful proprietary and third-party open-source tools and libraries that facilitate the implementation of a full data science workflow, from data collection to production (see Introducing the Platform). Both built-in and integrated tools are exposed to the user as application services that are managed by the platform using Kubernetes. Each application is packaged as a logical unit within a Docker container and is fully orchestrated by Kubernetes, which automates the deployment, scaling, and management of each containerized application. This provides users with the ability and flexibility to run any application anywhere, as part of their operational pipeline.
The application services can be viewed and managed from the dashboard
The platform’s application development ecosystem includes
- Distributed data frameworks and engines — such as Spark, Presto, Horovod, and Hadoop.
- The Nuclio serverless framework.
- Enhanced support for time-series databases (TSDBs) — including a CLI tool, serverless functions, and integration with Prometheus.
- Jupyter Notebook and Zeppelin interactive web notebooks for development and testing of data science and general data applications.
- A web-based shell service and Jupyter terminals, which provide bash command-line shells for running application services and performing basic file-system operations.
- Integration with popular machine-learning and scientific-computing packages for development of ML and artificial intelligence (AI) applications — such as TensorFlow, Keras, scikit-learn, PyTorch, Pyplot, and NumPy.
- Integration with common Python libraries that enable high-performance Python based data processing — such as pandas and Dask and RAPIDS.
- Support for automating and tracking data science tasks and workflows using the MLRun library and Kubeflow Pipelines — including defining, running, and tracking managed, scalable, and portable ML tasks and full workflow pipelines.
- The V3IO Frames open-source unified high-performance DataFrame API library for working with NoSQL, stream, and time-series data in the platform.
- Support for executing code over GPUs.
- Integration with data analytics, monitoring, and visualizations tools — including built-in integration with the open-source Grafana metric analytics and monitoring tool and easy integration with commercial business-intelligence (BI) analytics and visualization tools such as Tableau, Looker, and QlikView.
- Logging and monitoring services for monitoring, indexing, and viewing application-service logs — including a log-forwarder service and integration with Elasticsearch.
For basic information about how to manage and create services in the dashboard, see the Platform Fundamentals guide. For detailed service specifications, see the platform’s Support and Certification Matrix.
Following are overviews of the main application services and tools that can be run on the platform.
If you’re looking for a specific service or tool, you can also use these alphabetical links:
Dask | Docker Registry | Elasticsearch | Frames | GPU | Grafana | Hadoop| Hive | Horovod | Jupyter| Kubeflow | Log Forwarder | Looker| MLRun | Monitoring | Nuclio | NumPy | pandas | Pipelines | Presto | Prometheus | Pyplot | PyTorch | QlikView | RAPIDS | scikit-learn | Spark | Tableau | TensorFlow | TSDB CLI | TSDB Nuclio Functions | Web Shell | Zeppelin
Jupyter is a project for development of open-source software, standards, and services for interactive computing across multiple programming languages. The Platform comes preinstalled with the JupyterLab web-based user interface, including Jupyter Notebook and JupyterLab Terminals, which are available via a Jupyter Notebook user application service.
Jupyter Notebook is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text; it’s currently the leading industry tool for data exploration and training. Jupyter Notebook supports integration with all key analytics services, enabling users to perform all stages of the data science flow, from data collection to production, from a single interface using various APIs and tools to concurrently access the same data without having to move the data. Your Jupyter Notebook code can execute Spark jobs (for example, using Spark DataFrames); run SQL queries using Presto; define, deploy, and trigger Nuclio serverless functions; send web-API requests; use pandas and V3IO Frames DataFrames; use the Dask library to scale the use of pandas DataFrames; and more. You can use Conda and pip, which are available as part of the Jupyter Notebook service, to easily install Python packages such as pandas and Dask and machine-learning packages. In addition, you can use Jupyter terminals to execute shell commands, such as file-system and installation commands. As part of the configuration of the platform’s Jupyter Notebook service you select a specific Jupyter flavor and you can optionally define environment variables for the service.
Iguazio provides tutorial Jupyter notebooks with code examples ranging from getting-started examples to full end-to-end demo applications, including detailed documentation.
Start out by reading the introductory
In version 2.5.4 of the platform, you can set the custom
- Jupyter Full Stack
- A full version of Jupyter for execution over central processing units (CPUs).
- Jupyter Full Stack with GPU
- A full version of Jupyter for execution over graphics processing units (GPUs).
This flavor is available only in environments with GPUs and is sometimes referred to in the documentation as the Jupyter “GPU flavor”.
For more information about the platform’s GPU support, see the GPU Support section.
NoteWhen configuring the resources for a Jupyter Notebook service with a GPU flavor, note that for NVIDIA GPUs, a limit of 0 means unlimited GPU allocation (and not 0 GPUs) — i.e., all available GPUs will be consumed.
Nuclio Serverless Framework
Iguazio’s Nuclio Enterprise Edition serverless-functions framework — a leading open-source project for converging and simplifying data management — is integrated into the platform. Nuclio is a high-performance low-latency framework that supports execution over CPUs or GPUs and a wide array of tools and event triggers, providing users with a complete cloud experience of data services, ML, AI, and serverless functionality — all delivered in a single integrated and self-managed offering at the edge, on-premises (“on-prem”), or in a hosted cloud.
You can use Nuclio functions, for example, to
Collect and ingest data into the platform and consume (query) the data on an ongoing basis. Nuclio offers built-in function templates for collecting data from common sources, such as Apache Kafka streams or databases, including examples of data enrichment and data-pipeline processing.
Run machine-learning models in the serving layer, supporting high throughput on demand and elastic resource allocation.
Nuclio can be easily integrated with Jupyter Notebook, enabling users to develop their entire code (model, feature vector, and application) in Jupyter Notebook and use a single command to deploy the code as a serverless function that runs in the serving layer. For examples, see the platform’s tutorial Jupyter notebooks. For more information about Nuclio, see the platform’s Serverless Functions (Nuclio) introduction.
The platform exposes Nuclio as a default tenant-wide service and allows users to disable and restart the service as well as configure the Docker Registry for storing the Nuclio function images.
Configuring the Docker Registry for the Nuclio Service
By default, the Nuclio service on the default tenant is configured to work with a predefined default tenant-wide “Docker Registry” service, which uses a pre-deployed local on-cluster Docker Registry. However, you can create your own Docker Registry service for working with a remote off-cluster Docker Registry, and change the configuration of the Nuclio service to work with your Docker Registry service and use your registry to store the Nuclio function images.
Machine-Learning and Scientific-Computing Packages
You can easily install on your platform cluster (for example, from Jupyter Notebook) ML and scientific-computing packages — such as TensorFlow, Keras, scikit-learn, PyTorch, Pyplot, and NumPy. The platform’s architecture was designed to deploy computation to one or more CPU or GPU with a single Python API.
For example, you can install the TensorFlow open-source library for numerical computation using data-flow graphs. You can use TensorFlow to train a logistic regression model for prediction or a deep-learning model, and then deploy the same model in production over the same platform instance as part of your operational pipeline. The data science and training portion can be developed using recent field data, while the development-to-production workflow is automated and time to insights is significantly reduced. All the required functionality is available on a single platform with enterprise-grade security and a fine-grained access policy, providing you with visibility into the data based on the organizational needs of each team. The following Python code sample demonstrates the simplicity of using the platform to train a TensorFlow model and evaluate the quality of the model’s predictions:
model.train( input_fn=lambda: input_fn(train_data, num_epochs, True, batch_size)) results = model.evaluate(input_fn=lambda: input_fn( test_data, 1, False, batch_size)) for key in sorted(results): print('%s: %s' % (key, results[key]))
Data Science Automation and Tracking
MLRun is Iguazio’s open-source library for automating and tracking data science tasks and full workflows, including integration with Kubeflow Pipelines and the Nuclio serverless framework. The library features a generic and simplified mechanism for helping data scientists and developers describe and run scalable ML and other data science tasks in various runtime environments while automatically tracking and recording execution code, metadata, inputs, and outputs. The capability to track and view current and historical ML experiments along with the metadata that is associated with each experiment is critical for comparing different runs, and eventually helps to determine the best model and configuration for production deployment.
MLRun is runtime and platform independent, providing a flexible and portable development experience. It allows you to develop functions for any data science task from your preferred environment, such as a local IDE or a web notebook; execute and track the execution from the code or using the MLRun CLI; and then integrate your functions into an automated workflow pipeline (such as Kubeflow Pipelines) and execute and track the same code on a larger cluster with scale-out containers or functions.
You can easily install MLRun by running the following command (for example, from a Jupyter Notebook notebook or terminal):
!pip install git+https://github.com/mlrun/mlrun.git
For detailed information and examples, see the
Google Kubeflow Pipelines is an open-source framework for building and deploying portable, scalable ML workflows based on Docker containers.
For detailed information, see the Kubeflow Pipelines documentation.
The platform has a pre-deployed tenant-wide Kubeflow pipelines service (“Pipelines”) that can be used to create and run ML pipeline experiments.
The pipeline artifacts are stored in a
Time-Series Database (TSDB) Services
Time-series databases (TSDBs) are used for storing time-series data — a series of time-based data points. The platform features enhanced built-in support for working with TSDBs, which includes a rich set of features for efficiently analyzing and storing time series data. The platform uses the Iguazio V3IO TSDB open-source library, which exposes a high-performance API for working with TSDBs — including creating and deleting TSDB instances (tables) and ingesting and consuming (querying) TSDB data. This API can be consumed in various ways:
- Use the V3IO TSDB command-line interface (CLI) tool (
tsdbctl), which is pre-deployed in the platform, to easily create, delete, and manage TSDB instances (tables) in the platform’s data store, ingest metrics into such tables, and issue TSDB queries. The CLI can be run locally on a platform cluster — from a command-line shell interface, such as the web-based shell or a Jupyter terminal or notebook — or remotely from any computer with a network connection to the cluster. The platform’s web shell and Jupyter terminal environments predefine a
tsdbctlalias to the native CLI that preconfigures the URL of the web-APIs service and the authentication access key for the running user of the parent shell or Jupyter Notebook service.
- Use the platform’s TSDB Nuclio functions service to generate serverless V3IO TSDB Nuclio functions for ingesting and consuming TSDB data.
- Use the Prometheus service service to run TSDB queries. Prometheus is an open-source systems monitoring and alerting toolkit that features a dimensional data model and a flexible query language. The platform’s Prometheus service uses the pre-deployed Iguazio V3IO Prometheus distribution, which packages Prometheus with the V3IO TSDB library for a robust, scalable, and high-performance TSDB solution.
pandas and Dask
pandas is an open-source Python library for high-performance data processing using structured DataFrames (“pandas DataFrames”). Dask is a parallel-computing Python library that features scaled pandas DataFrames. You can easily install these tools on the platform — for example, by using pip or Conda, which are pre-deployed as part of the platform’s Jupyter Notebook service — and use them to perform fast Python based data processing. For more information and examples, see the platform’s tutorial Jupyter notebooks.
Iguazio V3IO Frames is an open-source data-access library that provides a unified high-performance Python DataFrame API for working with NoSQL, stream, and time-series (TSDB) data in the platform’s data store.
For more information and detailed usage instructions, see the Frames API reference.
As indicated in this reference, you can find many examples of using the Frames API in the platform’s tutorial Jupyter notebooks; see specifically the
Presto SQL Engine
Presto is an open-source distributed SQL query engine for running interactive analytic queries. The platform has a pre-deployed tenant-wide Presto service that can be used to run SQL queries and perform high-performance low-latency interactive data analytics. You can ingest data into the platform using your preferred method — such as using Spark, the NoSQL Web API, a Nuclio function, or V3IO Frames — and use Presto to analyze the data interactively with the aid of your preferred visualization tool. Running Presto over the platform’s data services allows you to filter data as close as possible to the source.
You can run SQL commands that use ANSI SQL
The Iguazio Presto connector enables you to use Presto to run queries on data in the platform’s NoSQL store — including support for partitioning, predicate pushdown, and column pruning, which enables users to optimize their queries. You can also use Presto’s built-in Hive connector to query data of the supported file types, such as Parquet or ORC, or to save table-query views to the default Hive schema. Note that to use the Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform’s Presto service. For more information, see Using the Hive Connector in the Presto reference overview.
The platform also has a built-in process that uses Presto SQL to create a Hive view that monitors both real-time data in the platform’s NoSQL store and historical data in Parquet or ORC tables
The platform is integrated with the Apache Spark data engine for large-scale data processing, which is available as a user application service. You can use Spark together with other platform services to run SQL queries, stream data, and perform complex data analysis — both on data that is stored in the platform’s data store and on external data sources such as RDBMSs or traditional Hadoop “data lakes”. The support for Spark is powered by a stack of Spark libraries that include Spark SQL and DataFrames for working with structured data, Spark Streaming for streaming data, MLlib for machine learning, and GraphX for graphs and graph-parallel computation. You can combine these libraries seamlessly in the same application.
Spark is fully optimized when running on top of the platform’s data services, including data filtering as close as possible to the source by implementing predicate pushdown and column-pruning in the processing engine. Predicate pushdown and column pruning can optimize your query, for example, by filtering data before it is transferred over the network, filtering data before loading it into memory, or skipping reading entire files or chunks of files.
The platform supports the standard Spark Dataset and DataFrame APIs in Scala, Java, Python, and R. In addition, it extends and enriches these APIs via the Iguazio Spark connector, which features a custom NoSQL data source that enables reading and writing data in the platform’s NoSQL store using Spark DataFrames — including support for table partitioning, data pruning and filtering (predicate pushdown), performing “replace” mode and conditional updates, defining and updating counter table attributes (columns), and performing optimized range scans. The platform also supports the Spark Streaming API. For more information, see the Spark APIs reference.
You can run Spark jobs on your platform cluster from a Jupyter or Zeppelin web notebook; for details, see Running Spark Jobs from a Web Notebook.
You can also run Spark jobs by executing
sparkvariable) and stop the session at the end of the flow to release resources (for example, by calling
The platform supports accelerated code execution over NVIDIA graphics processing units (GPUs):
You can run Nuclio serverless functions on GPUs.
You can run GPU applications that use one of the following supported GPU libraries from a platform Jupyter Notebook service with the GPU flavor. You can find full use-case application demos in the
demos/gpudirectory of the platform’s tutorial Jupyter notebooks; see the overview in the README.iypnbnotebook or README.mdfile in this directory.
Jupyter GPU Resources NoteIn environments with GPUs, you can use the common
Horovod — the platform has a default tenant-wide “Horovod” service for using Uber’s Horovod distributed deep-learning framework for creating machine-learning modules that are trained simultaneously over multiple GPUs. You can use Horovod to convert a single-GPU TensorFlow, Keras, or PyTorch model-training program to a distributed multi-GPU program. The objective is to speed up your model training with minimal changes to your existing single-GPU code and without complicating the execution. Note that you can also run Horovod code over CPUs with just minor modification. For sample Horovod GPU and CPU applications, see the
demos/gpu/horovoddirectory of the platform’s Jupyter Notebook tutorials.Note
- To run Horovod code, ensure that the Horovod platform service is enabled. (This service is enabled by default.)
- Horovod applications allocate GPUs dynamically from among the available GPUs in the system; they don’t use the GPU resources of the parent Jupyter Notebook service.
See also the Jupyter GPU resources note.
RAPIDS — you can use NVIDIA’s RAPIDS open-source libraries suite to execute end-to-end data science and analytics pipelines entirely on GPUs. For sample RAPIDS GPU applications, see the
demos/gpu/rapidsdirectory of the platform’s Jupyter Notebook tutorials.Note
RAPIDS applications use the GPU resource of the parent Jupyter Notebook service. Therefore, you must configure at least one GPU resource for this service: from the dashboard
Servicespage, select to edit your Jupyter Notebook service, select the Common Parameterstab, and set the Resources > GPU > Limitfield to a value greater than zero. See also the Jupyter GPU resources note. Resources > GPU > Limitparameter of the Jupyter Notebook service to guarantee the configured number of GPUs for use by each service replica. While the Jupyter Notebook service is enabled, it monopolizes the configured amount of GPUs even when the GPUs aren’t in use. RAPIDS applications use the GPUs that were allocated for the Jupyter Notebook service from which the code is executed, while Horovod applications allocate GPUs dynamically and don’t use the GPUs of the parent Jupyter Notebook service. Take this into account when configuring the GPU resources for your Jupyter Notebook service. For example, on systems with limited GPU resources you might need to reduce the amount of GPU resources allocated to the Jupyter Notebook service or set it to zero to successfully run the Horovod code over GPUs.
Data Analytics, Monitoring, and Visualization Tools
There are various tools that allow you to monitor and query your data and produce graphical interactive representations that make it easy to quickly analyze the data and begin discovering new actionable insights in a matter of seconds, with no programming effort.
The Grafana open-source platform for data analytics, monitoring, and visualization is pre-integrated in the platform and available as a user application service.
You can use the Grafana service to define custom Grafana dashboards for monitoring, visualizing, and understanding data stored in the platform, such as time-series metrics and NoSQL data.
This can be done by using the custom
iguazio data source, or by using a Prometheus data source for running Prometheus queries on platform TSDB tables.
You can also issue data alerts and create, explore, and share dashboards.
Remote Visualization Tools
All leading BI data visualization tools can be installed remotely and configured to run on top of the data services of the Iguazio Data Science Platform over a Java database connectivity (JBDC) connector. The following images display data visualization using the popular Tableau, QlikView, and Looker visualization tools:
Logging and Monitoring Services
The platform features a default tenant-wide log-forwarder service (“Log Forwarder”) for forwarding application-service logs to an instance of the Elasticsearch open-source search and analytics engine by using the open-source Filebeat log-shipper utility.
In addition, the platform has a default tenant-wide monitoring service (“Monitoring”), which is disabled by default, for monitoring Nuclio serverless functions and gathering performance statistics.
For detailed information about these services and how to use them, as well as information about additional platform monitoring, logging, and debugging tools, see the Logging, Monitoring, and Debugging.
The platform includes a web-based command-line shell (“web shell”) service for running application services and performing basic file-system operations from a web browser.
For example, you can use the Presto CLI to run SQL queries on your data; use the TSDB CLI to work with TSDBs; use
The custom web-shell service parameters allow you to optionally associate the service with a Spark service and select a Kubernetes service account that will determine the permissions for using the
- “None” — no permission to use
- “Log Reader” — list pods and view logs.
- “Application Admin” — list pods; view logs; and create or delete secrets and ConfigMaps.
- “Service Admin” — list pods; view logs; create or delete secrets and ConfigMaps; and create, delete, list, or get jobs and cron jobs.
- The web shell isn’t a fully functional Linux shell. See Software Specifications and Restrictions for specific restrictions.
- To log out of the web shell, run the
exitcommand in the shell.
Apache Zeppelin is an open-source web platform for performing interactive data analytics. Zeppelin is pre-installed in the platform and available as a user application service. You can use Zeppelin to create data-driven and interactive Scala, Java, Python, or R applications that run over Apache Spark, as well as run Spark SQL queries and file-system commands. You can also easily be visualize the execution output with Zeppelin for quick insights. You can save your Zeppelin notebook, with its code, and share it with members of your organization.
The platform’s Zeppelin service includes a pre-deployed getting-started note (“Iguazio Getting Started Example”) that demonstrates how to use the platform APIs.
The platform provides a self-service and open-source Apache Hadoop framework that makes it easy, fast, and cost-effective to process and analyze vast amounts of data for operational, analytics, and data-engineering needs. You can run Hadoop commands from any platform command-line interface — such as a web-based shell, a Jupyter notebook or terminal, or a Zeppelin notebook — as demonstrated, for example, in the Working with Data Containers tutorial and in the getting-started ingestion examples in the platform’s Jupyter and Zeppelin tutorial notebooks.