Overview of the Spark APIs

On This Page

Introduction

The platform supports the following standard Apache Spark APIs and custom extensions and APIs for working with data over the Spark engine:

Spark Datasets — you can consume and update data in the platform by using Apache Spark SQL Datasets/DataFrames. You can also extend the standard Spark DataFrames functionality by using the platform's custom NoSQL Spark DataFrame data source. See Spark Datasets API Reference.
Spark Streaming API — you can use the platform's Spark-Streaming Integration Scala API to map platform streams to Spark input streams, and then use the Apache Spark Streaming API to consume data and metadata from these streams. See Spark-Streaming Integration API Reference.

Note that the platform's NoSQL Web API extends the functionality provided by the Spark APIs and related platform extensions. This API supports various item update modes, conditional-update logic and the use of update expressions, and the ability to define counter attributes. For more information, see NoSQL Web API Reference.

You can run Spark jobs in the platform using standard industry tools. For example, you can run spark-submit from a web-based shell or Jupyter Notebook service, or run Spark jobs from a web notebook such as Jupyter Notebook, provided the service is connected to a Spark service. All these platform interfaces have a predefined SPARK_HOME environment variable that maps to the Spark installation directory. The spark-installation binaries directory ($SPARK_HOME/bin) contains the required binaries and shell scripts for running Spark; this directory is included in the environment path ($PATH) to simplify execution from any directory. The installation also includes the required library files for using the platform's Spark APIs and the built-in Spark examples.

Note

It's good practice to create a Spark session at the start of the execution flow (for example, by calling SparkSession.builder and assigning the result to a spark variable) and stop the session at the end of the flow to release its resources (for example, by calling spark.stop()).

Running Spark Jobs with spark-submit

You can run Spark jobs by executing spark-submit from the UI of a web-based shell service or from a terminal or notebook in the UI of a Jupyter Notebook service, provided the service is connected to a Spark service. For detailed information about spark-submit, see the Submitting Applications Spark documentation. spark-submit is mapped to the location of the script ($SPARK_HOME/bin/spark-submit), so you can run spark-submit without specifying the path.

The master URL of the Spark cluster is preconfigured in the environments of the platform web-based shell and Jupyter Notebook services. Do not use the --master option to override this configuration.

The library files for the built-in Spark examples are found at $SPARK_HOME/examples/jars. You can run the following command, for example, to execute the SparkPi example, which calculates the value of pi:

spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples*.jar 10

When the command succeeds, the output should contain the following line:

Pi is roughly 3.1432911432911435

To refer spark-submit to your own Spark application or library (JAR) file, upload the file to one of your cluster's data containers, and then specify the path to the file by using the v3io cluster data mount — /v3io/<container name>/<path to file>. For example, the following command runs a myapp.py Python application that is found in a pyspark_apps directory in the "projects" container:

spark-submit /v3io/projects/pyspark_apps/myapp.py

Deploy Modes

Client Deployment

By default, spark-submit launches applications using the client deploy mode. In this mode, the driver is launched in the same worker process as the client that submits the application (such as Jupyter Notebook, or a web shell).

Cluster Deployment

You can optionally submit Spark jobs using the cluster deploy mode by adding --deploy-mode=cluster to the spark-submit call. In this mode, the driver is launched from a worker process in the cluster. This mode is supported for Scala and Java; Spark doesn't currently support Python in standalone clusters.

Cluster deployment provides a variety of advantages such as the ability to automate jobs execution and run Spark jobs remotely on the cluster — which is useful, for example, for running ongoing Spark jobs, such as streaming.

Running Spark Jobs from a Web Notebook

One way to run Spark jobs is from a web notebook for interactive analytics. The platform comes preinstalled with an open-source web-notebook application — Jupyter Notebook. (See Support and Certification Matrix and The Platform's Application Services). For more information about these tools and how to use them to run Spark jobs, see the respective third-party product documentation.