Ingesting and Preparing Data

On This Page

Learn about different methods for ingesting data into the Iguazio Data Science Platform, analyzing the data, and preparing it for the next step in your data pipeline.

Note

A version of this notebook is available also as part of the pre-deployed platform tutorial Jupyter notebooks — see the data-ingestion-and-preparation/README.ipynb notebook.

Overview

The Iguazio Data Science Platform (“the platform”) allows storing data in any format. The platform’s multi-model data layer and related APIs provide enhanced support for working with NoSQL (“key-value”), time-series, and stream data. Various steps of the data science life cycle (pipeline) might require different tools and frameworks for working with data, especially when it comes to the different mechanisms required during the research and development phase versus the operational production phase. The platform features a wide set of methods for manipulating and managing data, of different formats, in each step of the data life cycle, using a variety of frameworks, tools, and APIs — such as Spark SQL and DataFrames, Spark Streaming, Presto SQL queries, pandas DataFrames, Dask, the V3IO Frames Python library, and web APIs.

This tutorial provides an overview of various methods for collecting, storing, and manipulating data in the platform, and refers to sample tutorial notebooks that demonstrate how to use these methods.
For an in-depth overview of the platform and how it can be used to implement a full data science workflow, see In-Depth Platform Overview. For full end-to-end platform use-case application demos, see the demos tutorial notebooks directory.

pipeline-diagram

Platform Data Containers

Data is stored within data containers in the platform’s distributed file system (DFS), which serve as the platform’s data store. There are two predefined containers — the default “bigdata” container and the “users” container — and you can also create additional custom containers. For detailed information about the platform data containers and how to reference the contained data, see the Platform Fundamentals tutorial.

Basic Flow

The virtual-env tutorial walks you through basic scenarios of ingesting data from external sources into the platform’s data store and manipulating the data using different data formats. The tutorial includes an example of ingesting a CSV file from an AWS S3 bucket; converting it into a NoSQL table using Spark DataFrames; running SQL queries on the table; and converting the table into a Parquet file.

Reading Data from External Databases

You can use different methods to read data from external databases into the platform’s data store, such Spark over JDBC or SQLAlchemy.

Using Spark over JDBC

Spark SQL includes a data source that can read data from other databases using Java database connectivity (JDBC). The results are returned as a Spark DataFrame that can easily be processed using Spark SQL, or joined with other data sources. The spark-jdbc tutorial includes several examples of using Spark JDBC to ingest data from various databases — such as MySQL, Oracle, and PostgreSQL.

Using SQLAlchemy

The read-external-db tutorial outlines how to ingest data using SQLAlchemy — a Python SQL toolkit and Object Relational Mapper, which gives application developers the full power and flexibility of SQL — and then use Python DataFrames to work on the ingested data set.

Working with Spark

The platform has a default pre-deployed Spark service that enables ingesting, analyzing, and manipulating data using different Spark APIs:

  • Using Spark SQL and DataFrames
  • Using the Spark Streaming API — see Using Streaming Streaming under “Working with Spark”.

Using Spark SQL and DataFrames

Spark lets you write and query structured data inside Spark programs by using either SQL or a familiar DataFrame API. DataFrames and SQL provide a common way to access a variety of data sources. You can use the Spark SQL and DataFrames API to ingest data into the platform, for both batch and micro-batch processing, and analyze and manipulate large data sets, in a distributed manner.

The platform’s custom NoSQL Spark DataFrame implements the Spark data-source API to support a custom data source that enables reading and writing data in the platform’s NoSQL store using Spark DataFrames, including enhanced features such as data pruning and filtering (predicate push down); queries are passed to the platform’s data store, which returns only the relevant data. This allows accelerated and high-speed access from Spark to data stored in the platform.

The spark-sql-analytics tutorial demonstrates how to use Spark SQL and DataFrames to access objects, tables, and unstructured data that persists in the platform’s data store.

For more information and examples of data ingestion with Spark DataFrames, see Getting Started with Data Ingestion Using Spark.
For more about running SQL queries with Spark, see Running Spark SQL Queries under “Running SQL Queries on Platform Data”.

Working with Streams

The platform supports various methods for working with data streams, including the following:

Using Nuclio to Get Data from Common Streaming Engines

The platform has a default pre-deployed Nuclio service that uses Iguazio’s Nuclio serverless-framework, which provides a mechanism for analyzing and processing real-time events from various streaming engines. Nuclio currently supports the following streaming frameworks — Kafka, Kinesis, Azure Event Hubs, platform streams (a.k.a. V3IO streams), RabbitMQ, and MQTT.

Using Nuclio functions to retrieve and analyze streaming data in real time is a very common practice when building a real-time data pipeline. You can stream any type of data — such as telemetry (NetOps) metrics, financial transactions, web clicks, or sensors data — in any format, including images and videos. You can also implement your own logic within the Nuclio function to manipulate or enrich the consumed stream data and prepare it for the next step in the pipeline.

Nuclio serverless functions can sustain high workloads with very low latencies, thus making them very useful for building an event-driven applications with strict latency requirements.

For more information about Nuclio, see the platform’s serverless introduction.

Using the Platform’s Streaming Engine

The platform features a custom streaming engine and a related stream format — a platform stream (a.k.a. V3IO stream). You can use the platform’s streaming engine to write data into a queue in a real-time data pipeline, or as a standard streaming engine (similar to Kafka and Kinesis), so you don’t need to use an external engine.

The platform’s streaming engine is currently available via the platform’s Streaming Web API.
In addition, the platform’s Spark-Streaming Integration API enables using the Spark Streaming API to work with platform streams, as explained in the next section (Using Spark Streaming).

The stream-enrich demo application includes an example of a Nuclio function that uses platform streams.

Using Spark Streaming

You can use the Spark Streaming API to ingest, consume, and analyze data using data streams. The platform features a custom Spark-Streaming Integration API to allow using the Spark Streaming API with platform streams.

Running SQL Queries on Platform Data

You can run SQL queries on NoSQL and Parquet data in the platform’s data store, using any of the following methods:

Running Full ANSI Presto SQL Queries

The platform has a default pre-deployed Presto service that enables using the Presto open-source distributed SQL query engine to run interactive SQL queries and perform high-performance low-latency interactive analytics on data that’s stored in the platform. To run a Presto query from a Jupyter notebook, all you need is to use an SQL magic command — %sql followed by your Presto query. Such queries are executed as distributed queries across the platform’s application nodes. The basic-data-ingestion-and-preparation tutorial demonstrates how to run Presto queries using SQL magic.

Note that for running queries on Parquet tables, you need to work with Hive tables. The csv-to-hive tutorial includes a script that converts a CSV file into a Hive table.

Running Spark SQL Queries

The spark-sql-analytics tutorial demonstrates how to run Spark SQL queries on data in the platform’s data store.

For more information about the platform’s Spark service, see Working with Spark in this tutorial.

Running SQL Queries from Nuclio Functions

In some cases, you might need to run an SQL query as part an event-driven application. The nuclio-read-via-presto tutorial demonstrates how to run an SQL query from a serverless Nuclio function.

Working with Parquet Files

Parquet is a columnar storage format that provides high-density high-performance file organization.
The parquet-read-write tutorial demonstrates how to create and write data to a Parquet table in the platform and read data from the table.

After you ingest Parquet files into the platform, you might want to create related Hive tables and run SQL queries on these tables.
The parquet-to-hive tutorial demonstrates how you can do this using Spark DataFrames.

Accessing Platform NoSQL and TSDB Data Using the Frames Library

V3IO Frames (“Frames”) is a multi-model open-source data-access library, developed by Iguazio, which provides a unified high-performance DataFrame API for working with data in the platform’s data store. Frames currently supports the NoSQL (key-value) and time-series (TSDB) data models via its NoSQL (nosql|kv) and TSDB (tsdb) backends. The frames tutorial provides an introduction to Frames and demonstrates how to use it to work with NoSQL and TSDB data in the platform. See also the Frames API reference.

Getting Data from AWS S3 Using curl

A simple way to ingest data from the Amazon Simple Storage Service (S3) into the platform’s data store is to run a curl command that sends an HTTP request to the relevant AWS S3 bucket, as demonstrated in the following code cell. For more information and examples, see the basic-data-ingestion-and-preparation tutorial.

%%sh
CSV_PATH="/User/examples/stocks.csv"
curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > ${CSV_PATH}

Running Distributed Python Code with Dask

Dask is a flexible library for parallel computing in Python, which is useful for computations that don’t fit into a DataFrame. Dask exposes low-level APIs that enable you to build custom systems for in-house applications. This helps parallelize Python processes and dramatically accelerates their performance. The dask-cluster tutorial demonstrates how to use Dask with platform data.

Running DataFrames on GPUs using NVIDIA cuDF

The platform allows you to use NVIDIA’s RAPIDS open-source libraries suite to execute end-to-end data science and analytics pipelines entirely on GPUs. cuDF is a RAPIDS GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. This library features a pandas-like API that will be familiar to data engineers and data scientists, who can use it to easily accelerate their workflows without going into the details of CUDA programming. The gpu-cudf-vs-pd tutorial demonstrates how to use the cuDF library and compares performance benchmarks with pandas and cuDF.

Note
To use the cuDF library, you need to create a RAPIDS Conda environment. For more information, see the virtual-env tutorial.

What’s Next?

  • Review and run from a Jupyter Notebook service the data-ingestion-and-preparation tutorial notebooks that are most relevant to your development needs. (If you don’t already have a Jupyter Notebook service, create one.) A good place to start is the basic-data-ingestion-and-preparation.ipynb tutorial.
  • See the Ingesting and Consuming Files tutorial for more data-ingestion and consumption methods — using the platform’s dashboard, web API, and file-system interface.
  • Check out the other platform getting-started tutorials and guides. For example, if you want to use Spark DataFrames, check out Getting Started with Data Ingestion Using Spark.
  • See the detailed references for the APIs and tools that you wish to use.