Application Services and Tools

On This Page

Overview

In addition to its core data services, the platform comes pre-deployed with essential and useful proprietary and third-party open-source tools and libraries that facilitate the implementation of a full data science workflow, from data collection to production (see Introducing the Platform). Both built-in and integrated tools are exposed to the user as application services that are managed by the platform using Kubernetes. Each application is packaged as a logical unit within a Docker container and is fully orchestrated by Kubernetes, which automates the deployment, scaling, and management of each containerized application. This provides users with the ability and flexibility to run any application anywhere, as part of their operational pipeline.

The application services can be viewed and managed from the dashboard Services page using a self-service model. This approach enables users to quickly get started with their development and focus on the business logic without having to spend precious time on deploying, configuring, and managing multiple tools and services. In addition, users can independently install additional software — such as real-time data analytics and visualization tools — and run them on top of the platform services.

The platform’s application development ecosystem includes

  • Distributed data frameworks and engines — such as Spark, Presto, Horovod, and Hadoop.
  • The Nuclio serverless framework.
  • Enhanced support for time-series databases (TSDBs) — including a CLI tool, serverless functions, and integration with Prometheus.
  • Jupyter Notebook and Zeppelin interactive web notebooks for development and testing of data science and general data applications.
  • A web-based shell service and Jupyter terminals, which provide bash command-line shells for running application services and performing basic file-system operations.
  • Integration with popular machine-learning and scientific-computing packages for development of ML and artificial intelligence (AI) applications — such as TensorFlow, Keras, scikit-learn, PyTorch, Pyplot, and NumPy.
  • Integration with common Python libraries that enable high-performance Python based data processing — such as pandas and Dask and RAPIDS.
  • Support for automating and tracking data science tasks and workflows using the MLRun library and Kubeflow Pipelines — including defining, running, and tracking managed, scalable, and portable ML tasks and full workflow pipelines.
  • The V3IO Frames open-source unified high-performance DataFrame API library for working with NoSQL, stream, and time-series data in the platform.
  • Support for executing code over GPUs.
  • Integration with data analytics, monitoring, and visualizations tools — including built-in integration with the open-source Grafana metric analytics and monitoring tool and easy integration with commercial business-intelligence (BI) analytics and visualization tools such as Tableau, Looker, and QlikView.
  • Logging and monitoring services for monitoring, indexing, and viewing application-service logs — including a log-forwarder service and integration with Elasticsearch.

For basic information about how to manage and create services in the dashboard, see the Platform Fundamentals guide. For detailed service specifications, see the platform’s Support and Certification Matrix.

DNS-Configuration Prerequisite
As a prerequisite to using the platform’s application services, you need to configure conditional forwarding for your cluster’s DNS server. For more information and step-by-step instructions, see Configuring the DNS Server.

Following are overviews of the main application services and tools that can be run on the platform.
If you’re looking for a specific service or tool, you can also use these alphabetical links:
Dask | Docker Registry | Elasticsearch | Frames | GPU | Grafana | Hadoop| Hive | Horovod | Jupyter| Kubeflow | Log Forwarder | Looker| MLRun | Monitoring | Nuclio | NumPy | pandas | Pipelines | Presto | Prometheus | Pyplot | PyTorch | QlikView | RAPIDS | scikit-learn | Spark | Tableau | TensorFlow | TSDB CLI | TSDB Nuclio Functions | Web Shell | Zeppelin

Jupyter

Jupyter is a project for development of open-source software, standards, and services for interactive computing across multiple programming languages. The Platform comes preinstalled with the JupyterLab web-based user interface, including Jupyter Notebook and JupyterLab Terminals, which are available via a Jupyter Notebook user application service.

Jupyter Notebook is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text; it’s currently the leading industry tool for data exploration and training. Jupyter Notebook supports integration with all key analytics services, enabling users to perform all stages of the data science flow, from data collection to production, from a single interface using various APIs and tools to concurrently access the same data without having to move the data. Your Jupyter Notebook code can execute Spark jobs (for example, using Spark DataFrames); run SQL queries using Presto; define, deploy, and trigger Nuclio serverless functions; send web-API requests; use pandas and V3IO Frames DataFrames; use the Dask library to scale the use of pandas DataFrames; and more. You can use Conda and pip, which are available as part of the Jupyter Notebook service, to easily install Python packages such as pandas and Dask and machine-learning packages. In addition, you can use Jupyter terminals to execute shell commands, such as file-system and installation commands. As part of the configuration of the platform’s Jupyter Notebook service you select a specific Jupyter flavor and you can optionally define environment variables for the service.

Iguazio provides tutorial Jupyter notebooks with code examples ranging from getting-started examples to full end-to-end demo applications, including detailed documentation. Start out by reading the introductory welcome.ipynb notebook (available also as a Markdown README.md file).

Jupyter Flavors

In version 2.5.4 of the platform, you can set the custom Flavor parameter of the Jupyter Notebook service to one of the following flavors to install a matching Jupyter Docker image:

Jupyter Full Stack
A full version of Jupyter for execution over central processing units (CPUs).
Jupyter Full Stack with GPU
A full version of Jupyter for execution over graphics processing units (GPUs). This flavor is available only in environments with GPUs and is sometimes referred to in the documentation as the Jupyter “GPU flavor”. For more information about the platform’s GPU support, see the GPU Support section.

Note
When configuring the resources for a Jupyter Notebook service with a GPU flavor, note that for NVIDIA GPUs, a limit of 0 means unlimited GPU allocation (and not 0 GPUs) — i.e., all available GPUs will be consumed.

Nuclio Serverless Framework

Iguazio’s Nuclio Enterprise Edition serverless-functions framework — a leading open-source project for converging and simplifying data management — is integrated into the platform. Nuclio is a high-performance low-latency framework that supports execution over CPUs or GPUs and a wide array of tools and event triggers, providing users with a complete cloud experience of data services, ML, AI, and serverless functionality — all delivered in a single integrated and self-managed offering at the edge, on-premises (“on-prem”), or in a hosted cloud.

You can use Nuclio functions, for example, to

  • Collect and ingest data into the platform and consume (query) the data on an ongoing basis. Nuclio offers built-in function templates for collecting data from common sources, such as Apache Kafka streams or databases, including examples of data enrichment and data-pipeline processing.

  • Run machine-learning models in the serving layer, supporting high throughput on demand and elastic resource allocation.

Nuclio can be easily integrated with Jupyter Notebook, enabling users to develop their entire code (model, feature vector, and application) in Jupyter Notebook and use a single command to deploy the code as a serverless function that runs in the serving layer. For examples, see the platform’s tutorial Jupyter notebooks. For more information about Nuclio, see the platform’s Serverless Functions (Nuclio) introduction.

The platform exposes Nuclio as a default tenant-wide service and allows users to disable and restart the service as well as configure the Docker Registry for storing the Nuclio function images.

Configuring the Docker Registry for the Nuclio Service

By default, the Nuclio service on the default tenant is configured to work with a predefined default tenant-wide “Docker Registry” service, which uses a pre-deployed local on-cluster Docker Registry. However, you can create your own Docker Registry service for working with a remote off-cluster Docker Registry, and change the configuration of the Nuclio service to work with your Docker Registry service and use your registry to store the Nuclio function images.

Machine-Learning and Scientific-Computing Packages

You can easily install on your platform cluster (for example, from Jupyter Notebook) ML and scientific-computing packages — such as TensorFlow, Keras, scikit-learn, PyTorch, Pyplot, and NumPy. The platform’s architecture was designed to deploy computation to one or more CPU or GPU with a single Python API.

For example, you can install the TensorFlow open-source library for numerical computation using data-flow graphs. You can use TensorFlow to train a logistic regression model for prediction or a deep-learning model, and then deploy the same model in production over the same platform instance as part of your operational pipeline. The data science and training portion can be developed using recent field data, while the development-to-production workflow is automated and time to insights is significantly reduced. All the required functionality is available on a single platform with enterprise-grade security and a fine-grained access policy, providing you with visibility into the data based on the organizational needs of each team. The following Python code sample demonstrates the simplicity of using the platform to train a TensorFlow model and evaluate the quality of the model’s predictions:

model.train(
    input_fn=lambda: input_fn(train_data, num_epochs, True, batch_size))
results = model.evaluate(input_fn=lambda: input_fn(
    test_data, 1, False, batch_size))
for key in sorted(results):
    print('%s: %s' % (key, results[key]))

The image-classification demo application in the platform’s tutorial Jupyter notebook demonstrates how to build and train an image recognition and classification ML model by using Keras, TensorFlow, and scikit-learn.

Data Science Automation and Tracking

MLRun

MLRun is Iguazio’s open-source library for automating and tracking data science tasks and full workflows, including integration with Kubeflow Pipelines and the Nuclio serverless framework. The library features a generic and simplified mechanism for helping data scientists and developers describe and run scalable ML and other data science tasks in various runtime environments while automatically tracking and recording execution code, metadata, inputs, and outputs. The capability to track and view current and historical ML experiments along with the metadata that is associated with each experiment is critical for comparing different runs, and eventually helps to determine the best model and configuration for production deployment.

MLRun is runtime and platform independent, providing a flexible and portable development experience. It allows you to develop functions for any data science task from your preferred environment, such as a local IDE or a web notebook; execute and track the execution from the code or using the MLRun CLI; and then integrate your functions into an automated workflow pipeline (such as Kubeflow Pipelines) and execute and track the same code on a larger cluster with scale-out containers or functions.

You can easily install MLRun by running the following command (for example, from a Jupyter Notebook notebook or terminal):

!pip install git+https://github.com/mlrun/mlrun.git

For detailed information and examples, see the README.md file and examples in the MLRun GitHub repository. You can also find platform MLRun examples in the experiment tracking directory of the platform tutorial Jupyter notebooks.

Kubeflow Pipelines

Google Kubeflow Pipelines is an open-source framework for building and deploying portable, scalable ML workflows based on Docker containers. For detailed information, see the Kubeflow Pipelines documentation. The platform has a pre-deployed tenant-wide Kubeflow pipelines service (“Pipelines”) that can be used to create and run ML pipeline experiments. The pipeline artifacts are stored in a pipelines directory in the “users” data container and pipeline metadata is stored in a mlpipeline directory in the same container. The pipelines dashboard can be accessed by selecting the Pipelines option in the platform dashboard’s navigation side menu.

Time-Series Database (TSDB) Services

Time-series databases (TSDBs) are used for storing time-series data — a series of time-based data points. The platform features enhanced built-in support for working with TSDBs, which includes a rich set of features for efficiently analyzing and storing time series data. The platform uses the Iguazio V3IO TSDB open-source library, which exposes a high-performance API for working with TSDBs — including creating and deleting TSDB instances (tables) and ingesting and consuming (querying) TSDB data. This API can be consumed in various ways:

  • Use the V3IO TSDB command-line interface (CLI) tool (tsdbctl), which is pre-deployed in the platform, to easily create, delete, and manage TSDB instances (tables) in the platform’s data store, ingest metrics into such tables, and issue TSDB queries. The CLI can be run locally on a platform cluster — from a command-line shell interface, such as the web-based shell or a Jupyter terminal or notebook — or remotely from any computer with a network connection to the cluster. The platform’s web shell and Jupyter terminal environments predefine a tsdbctl alias to the native CLI that preconfigures the URL of the web-APIs service and the authentication access key for the running user of the parent shell or Jupyter Notebook service.
  • Use the platform’s TSDB Nuclio functions service to generate serverless V3IO TSDB Nuclio functions for ingesting and consuming TSDB data.
  • Use the Prometheus service service to run TSDB queries. Prometheus is an open-source systems monitoring and alerting toolkit that features a dimensional data model and a flexible query language. The platform’s Prometheus service uses the pre-deployed Iguazio V3IO Prometheus distribution, which packages Prometheus with the V3IO TSDB library for a robust, scalable, and high-performance TSDB solution.

For more information and examples, see the TSDB tutorials and guides and the TSDB section in the frames.ipynb tutorial Jupyter notebook.

pandas and Dask

pandas is an open-source Python library for high-performance data processing using structured DataFrames (“pandas DataFrames”). Dask is a parallel-computing Python library that features scaled pandas DataFrames. You can easily install these tools on the platform — for example, by using pip or Conda, which are pre-deployed as part of the platform’s Jupyter Notebook service — and use them to perform fast Python based data processing. For more information and examples, see the platform’s tutorial Jupyter notebooks.

V3IO Frames [Tech Preview]

Iguazio V3IO Frames is an open-source data-access library that provides a unified high-performance Python DataFrame API for working with NoSQL, stream, and time-series (TSDB) data in the platform’s data store. For more information and detailed usage instructions, see the Frames API reference. As indicated in this reference, you can find many examples of using the Frames API in the platform’s tutorial Jupyter notebooks; see specifically the frames.ipynb notebook. See also the Frames restrictions in the Software Specifications and Restrictions documentation.

Presto SQL Engine

Presto is an open-source distributed SQL query engine for running interactive analytic queries. The platform has a pre-deployed tenant-wide Presto service that can be used to run SQL queries and perform high-performance low-latency interactive data analytics. You can ingest data into the platform using your preferred method — such as using Spark, the NoSQL Web API, a Nuclio function, or V3IO Frames — and use Presto to analyze the data interactively with the aid of your preferred visualization tool. Running Presto over the platform’s data services allows you to filter data as close as possible to the source.

You can run SQL commands that use ANSI SQL SELECT statements, which will be executed using Presto, from Jupyter Notebook, a serverless Nuclio function, or a local or remote Presto client. The platform comes pre-deployed with the native Presto CLI client (presto-cli), a convenience wrapper to this CLI that preconfigures some options for local execution (presto), and the Presto web interface (UI) — which you can log into from the dashboard’s Services page. You can also integrate the platform’s Presto service with a remote Presto client — such as Tableau or QlikView — to remotely query and analyze data in the platform over a Java database connectivity (JDBC) connector.

The Iguazio Presto connector enables you to use Presto to run queries on data in the platform’s NoSQL store — including support for partitioning, predicate pushdown, and column pruning, which enables users to optimize their queries. You can also use Presto’s built-in Hive connector to query data of the supported file types, such as Parquet or ORC, or to save table-query views to the default Hive schema. Note that to use the Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform’s Presto service. For more information, see Using the Hive Connector in the Presto reference overview.

The platform also has a built-in process that uses Presto SQL to create a Hive view that monitors both real-time data in the platform’s NoSQL store and historical data in Parquet or ORC tables [Tech Preview].

For more information about using Presto in the platform, see the Presto Reference. See also the Presto and Hive restrictions in the Software Specifications and Restrictions documentation.

Spark

The platform is integrated with the Apache Spark data engine for large-scale data processing, which is available as a user application service. You can use Spark together with other platform services to run SQL queries, stream data, and perform complex data analysis — both on data that is stored in the platform’s data store and on external data sources such as RDBMSs or traditional Hadoop “data lakes”. The support for Spark is powered by a stack of Spark libraries that include Spark SQL and DataFrames for working with structured data, Spark Streaming for streaming data, MLlib for machine learning, and GraphX for graphs and graph-parallel computation. You can combine these libraries seamlessly in the same application.

Apache Spack stack

Spark is fully optimized when running on top of the platform’s data services, including data filtering as close as possible to the source by implementing predicate pushdown and column-pruning in the processing engine. Predicate pushdown and column pruning can optimize your query, for example, by filtering data before it is transferred over the network, filtering data before loading it into memory, or skipping reading entire files or chunks of files.

The platform supports the standard Spark Dataset and DataFrame APIs in Scala, Java, Python, and R. In addition, it extends and enriches these APIs via the Iguazio Spark connector, which features a custom NoSQL data source that enables reading and writing data in the platform’s NoSQL store using Spark DataFrames — including support for table partitioning, data pruning and filtering (predicate pushdown), performing “replace” mode and conditional updates, defining and updating counter table attributes (columns), and performing optimized range scans. The platform also supports the Spark Streaming API. For more information, see the Spark APIs reference.

You can run Spark jobs on your platform cluster from a Jupyter or Zeppelin web notebook; for details, see Running Spark Jobs from a Web Notebook. You can also run Spark jobs by executing spark-submit from a web-based shell or Jupyter terminal or notebook; for details, see Running Spark Jobs with spark-submit. You can find many examples of using Spark in the platform’s Jupyter and Zeppelin tutorial notebooks, getting-started tutorials, and Spark APIs reference. See also the Spark restrictions in the Software Specifications and Restrictions documentation.

Note
It’s good practice to create a Spark session at the start of the execution flow (for example, by calling SparkSession.builder and assigning the result to a spark variable) and stop the session at the end of the flow to release resources (for example, by calling spark.stop()).

GPU Support

The platform supports accelerated code execution over NVIDIA graphics processing units (GPUs):

  • You can run Nuclio serverless functions on GPUs.

  • You can run GPU applications that use one of the following supported GPU libraries from a platform Jupyter Notebook service with the GPU flavor. You can find full use-case application demos in the demos/gpu directory of the platform’s tutorial Jupyter notebooks; see the overview in the README.iypnb notebook or README.md file in this directory.

    • Horovod — the platform has a default tenant-wide “Horovod” service for using Uber’s Horovod distributed deep-learning framework for creating machine-learning modules that are trained simultaneously over multiple GPUs. You can use Horovod to convert a single-GPU TensorFlow, Keras, or PyTorch model-training program to a distributed multi-GPU program. The objective is to speed up your model training with minimal changes to your existing single-GPU code and without complicating the execution. Note that you can also run Horovod code over CPUs with just minor modification. For sample Horovod GPU and CPU applications, see the demos/gpu/horovod directory of the platform’s Jupyter Notebook tutorials.

      Note
      • To run Horovod code, ensure that the Horovod platform service is enabled. (This service is enabled by default.)
      • Horovod applications allocate GPUs dynamically from among the available GPUs in the system; they don’t use the GPU resources of the parent Jupyter Notebook service. See also the Jupyter GPU resources note.

    • RAPIDS — you can use NVIDIA’s RAPIDS open-source libraries suite to execute end-to-end data science and analytics pipelines entirely on GPUs. For sample RAPIDS GPU applications, see the demos/gpu/rapids directory of the platform’s Jupyter Notebook tutorials.

      Note
      • RAPIDS supports GPUs with the NVIDIA Pascal architecture or better and compute capability 6.0+.

      • RAPIDS applications use the GPU resource of the parent Jupyter Notebook service. Therefore, you must configure at least one GPU resource for this service: from the dashboard Services page, select to edit your Jupyter Notebook service, select the Common Parameters tab, and set the Resources > GPU > Limit field to a value greater than zero. See also the Jupyter GPU resources note.

    Jupyter GPU Resources Note
    In environments with GPUs, you can use the common Resources > GPU > Limit parameter of the Jupyter Notebook service to guarantee the configured number of GPUs for use by each service replica. While the Jupyter Notebook service is enabled, it monopolizes the configured amount of GPUs even when the GPUs aren’t in use. RAPIDS applications use the GPUs that were allocated for the Jupyter Notebook service from which the code is executed, while Horovod applications allocate GPUs dynamically and don’t use the GPUs of the parent Jupyter Notebook service. Take this into account when configuring the GPU resources for your Jupyter Notebook service. For example, on systems with limited GPU resources you might need to reduce the amount of GPU resources allocated to the Jupyter Notebook service or set it to zero to successfully run the Horovod code over GPUs.

Data Analytics, Monitoring, and Visualization Tools

There are various tools that allow you to monitor and query your data and produce graphical interactive representations that make it easy to quickly analyze the data and begin discovering new actionable insights in a matter of seconds, with no programming effort.

Grafana

The Grafana open-source platform for data analytics, monitoring, and visualization is pre-integrated in the platform and available as a user application service. You can use the Grafana service to define custom Grafana dashboards for monitoring, visualizing, and understanding data stored in the platform, such as time-series metrics and NoSQL data. This can be done by using the custom iguazio data source, or by using a Prometheus data source for running Prometheus queries on platform TSDB tables. You can also issue data alerts and create, explore, and share dashboards.

Note
In cloud platform environments, Grafana is currently available as a shared single-instance tenant-wide service.

For more information about the Grafana service, see the Adding a Custom Grafana Dashboard tutorial. See also the Grafana restrictions in the Software Specifications and Restrictions documentation.

Remote Visualization Tools

All leading BI data visualization tools can be installed remotely and configured to run on top of the data services of the Iguazio Data Science Platform over a Java database connectivity (JBDC) connector. The following images display data visualization using the popular Tableau, QlikView, and Looker visualization tools:

Tableau

Tableau data-visualization image

    

QlikView

QlikView dashboard social-media buzz image

    

Looker

Looker Pulse dashboard with snowplow traffic image

Note
Other integrated services might also contain data-visualization tools.

Logging and Monitoring Services

The platform features a default tenant-wide log-forwarder service (“Log Forwarder”) for forwarding application-service logs to an instance of the Elasticsearch open-source search and analytics engine by using the open-source Filebeat log-shipper utility.

In addition, the platform has a default tenant-wide monitoring service (“Monitoring”), which is disabled by default, for monitoring Nuclio serverless functions and gathering performance statistics.

For detailed information about these services and how to use them, as well as information about additional platform monitoring, logging, and debugging tools, see the Logging, Monitoring, and Debugging.

Web Shell

The platform includes a web-based command-line shell (“web shell”) service for running application services and performing basic file-system operations from a web browser. For example, you can use the Presto CLI to run SQL queries on your data; use the TSDB CLI to work with TSDBs; use spark-submit to run Spark jobs; run local and Hadoop file-system commands; or use kubectl CLI to run commands against the platform’s application clusters.

The custom web-shell service parameters allow you to optionally associate the service with a Spark service and select a Kubernetes service account that will determine the permissions for using the kubectl CLI from the web shell. Following is a list of the kubectl operations that can be performed with each service account:

  • “None” — no permission to use kubectl.
  • “Log Reader” — list pods and view logs.
  • “Application Admin” — list pods; view logs; and create or delete secrets and ConfigMaps.
  • “Service Admin” — list pods; view logs; create or delete secrets and ConfigMaps; and create, delete, list, or get jobs and cron jobs.
Note

Zeppelin

Apache Zeppelin is an open-source web platform for performing interactive data analytics. Zeppelin is pre-installed in the platform and available as a user application service. You can use Zeppelin to create data-driven and interactive Scala, Java, Python, or R applications that run over Apache Spark, as well as run Spark SQL queries and file-system commands. You can also easily be visualize the execution output with Zeppelin for quick insights. You can save your Zeppelin notebook, with its code, and share it with members of your organization.

The platform’s Zeppelin service includes a pre-deployed getting-started note (“Iguazio Getting Started Example”) that demonstrates how to use the platform APIs.

Note
See the Zeppelin notes in the Spark-APIs reference for more information about running Spark code from Zeppelin, and see the Software Specifications and Restrictions for general Zeppelin service restrictions in the platform.

Hadoop

The platform provides a self-service and open-source Apache Hadoop framework that makes it easy, fast, and cost-effective to process and analyze vast amounts of data for operational, analytics, and data-engineering needs. You can run Hadoop commands from any platform command-line interface — such as a web-based shell, a Jupyter notebook or terminal, or a Zeppelin notebook — as demonstrated, for example, in the Working with Data Containers tutorial and in the getting-started ingestion examples in the platform’s Jupyter and Zeppelin tutorial notebooks.

See Also