Overview of Spark Datasets

On This Page

Using Spark DataFrames

A Spark Dataset is an abstraction of a distributed data collection that provides a common way to access a variety of data sources. A DataFrame is a Dataset that is organized into named columns ("attributes" in the platform's unified data model). See the Spark SQL, DataFrames and Datasets Guide. You can use the Spark SQL Datasets/DataFrames API to access data that is stored in the platform.

In addition, the platform's Iguazio Spark connector defines a custom data source that enables reading and writing data in the platform's NoSQL store using Spark DataFrames — including support for table partitioning, data pruning and filtering (predicate pushdown), performing "replace" mode and conditional updates, defining and updating counter table attributes (columns), and performing optimized range scans. For more information, see The NoSQL Spark DataFrame.

See Spark DataFrame Data Types for the data types that are currently supported in the platform.

Spark DataFrames and Tables

A DataFrame consumes and updates data in a table, which is a collection of data objects — items (rows) — and their attributes (columns). The attribute name is the column name, and its value is the data stored in the relevant item (row). See also Working with NoSQL Data. As with all data in the platform, the tables are stored within data containers.

When writing (ingesting) data to a table with a Spark DataFrame, you need to set the key option to a column (attribute) that identifies the table's sharding key — the sharding-key attribute. With the NoSQL DataFrame, you can also optionally set the custom sorting-key write option to an attribute that identifies the table's sorting key — the sorting-key attribute. The combination of these keys makes up the table's primary key and is used to uniquely identify items in the table. When no sorting-key attribute is defined, the sharding-key attribute is also the table's primary-key attribute (identity column). See also Object Names and Primary Keys.

Data Paths

When using Spark DataFrames to access data in the platform's data containers, provide the path to the data as a fully qualified v3io path of the following format — where <container name> is the name of the parent data container and <data path> is the relative path to the data within the specified container:

v3io://<container name>/<data path>

You pass the path as a string parameter to the relevant Spark method for the operation that you're performing — such as load or csv for read or save for write. For example, save("v3io://mycontainer/mytable") or csv("v3io://mycontainer/mycoldata.csv").

See Also