The Spark Service

On This Page

The platform is integrated with the Apache Spark data engine for large-scale data processing, which is available as a user application service. You can use Spark together with other platform services to run SQL queries, stream data, and perform complex data analysis — both on data that is stored in the platform's data store and on external data sources such as RDBMSs or traditional Hadoop "data lakes". The support for Spark is powered by a stack of Spark libraries that include Spark SQL and DataFrames for working with structured data, Spark Streaming for streaming data, MLlib for machine learning, and GraphX for graphs and graph-parallel computation. You can combine these libraries seamlessly in the same application.

Apache Spack stack

Spark is fully optimized when running on top of the platform's data services, including data filtering as close as possible to the source by implementing predicate pushdown and column-pruning in the processing engine. Predicate pushdown and column pruning can optimize your query, for example, by filtering data before it is transferred over the network, filtering data before loading it into memory, or skipping reading entire files or chunks of files.

The platform supports the standard Spark Dataset and DataFrame APIs in Scala, Java, Python, and R. In addition, it extends and enriches these APIs via the Iguazio Spark connector, which features a custom NoSQL data source that enables reading and writing data in the platform's NoSQL store using Spark DataFrames — including support for table partitioning, data pruning and filtering (predicate pushdown), performing "replace" mode and conditional updates, defining and updating counter table attributes (columns), and performing optimized range scans. The platform also supports the Spark Streaming API. For more information, see the Spark APIs reference.

You can run Spark jobs on your platform cluster from a Jupyter or Zeppelin web notebook; for details, see Running Spark Jobs from a Web Notebook. You can also run Spark jobs by executing spark-submit from a web-based shell or Jupyter terminal or notebook; for details, see Running Spark Jobs with spark-submit. You can find many examples of using Spark in the platform's Jupyter and Zeppelin tutorial notebooks, Spark data-ingestion quick-start tutorial, and Spark APIs reference. See also the Spark restrictions in the Software Specifications and Restrictions documentation.

You can also use the Spark SQL and DataFrames API to run Spark over a Java database connectivity (JDBC) connector. For more information, see Ingesting and Preparing Data.

Note
It's good practice to create a Spark session at the start of the execution flow (for example, by calling SparkSession.builder and assigning the result to a spark variable) and stop the session at the end of the flow to release resources (for example, by calling spark.stop()).

See Also