Overview

On This Page

Introduction

The platform supports the following standard Apache Spark APIs and custom extensions and APIs for working with data over the Spark engine:

  • Spark Datasets — you can consume and update data in the platform by using Apache Spark SQL Datasets/DataFrames. You can also extend the standard Spark DataFrames functionality by using the platform’s custom NoSQL Spark DataFrame data source. See Spark Datasets.

  • Spark Streaming API — you can use the platform’s Spark-Streaming Integration Scala API to map platform streams to Spark input streams, and then use the Apache Spark Streaming API to consume data and metadata from these streams. See Spark-Streaming Integration API.

Note that the platform’s NoSQL Web API extends the functionality provided by the Spark APIs and related platform extensions. This API supports various item update modes, conditional-update logic and the use of update expressions, and the ability to define counter attributes. For more information, see NoSQL Web API.

You can run Spark jobs in the platform using standard industry tools. For example, you can run spark-submit from a web-based shell or Jupyter Notebook service, or run Spark jobs from a web notebook such as Jupyter Notebook or Zeppelin, provided the service is connected to a Spark service. All these platform interfaces have a predefined SPARK_HOME environment variable that maps to the Spark installation directory. The spark-installation binaries directory ($SPARK_HOME/bin) contains the required binaries and shell scripts for running Spark; this directory is included in the environment path ($PATH) to simplify execution from any directory. The installation also includes the required library files for using the platform’s Spark APIs and the built-in Spark examples.

Note
It’s good practice to create a Spark session at the start of the execution flow (for example, by calling SparkSession.builder and assigning the result to a spark variable) and stop the session at the end of the flow to release its resources (for example, by calling spark.stop()).

Running Spark Jobs with spark-submit

You can run Spark jobs by executing spark-submit from the UI of a web-based shell service or from a terminal or notebook in the UI of a Jupyter Notebook service, provided the service is connected to a Spark service. For detailed information about spark-submit, see the Submitting Applications Spark documentation. spark-submit is mapped to the location of the script ($SPARK_HOME/bin/spark-submit), so you can run spark-submit without specifying the path.

The master URL of the Spark cluster is preconfigured in the environments of the platform web-based shell and Jupyter Notebook services. Do not use the –master option to override this configuration.

The library files for the built-in Spark examples are found at $SPARK_HOME/examples/jars. You can run the following command, for example, to execute the SparkPi example, which calculates the value of pi:

spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples*.jar 10

When the command succeeds, the output should contain the following line:

Pi is roughly 3.1432911432911435 

To refer spark-submit to your own Spark application or library (JAR) file, upload the file to one of your cluster’s data containers, and then specify the path to the file by using the v3io cluster data mount — /v3io/<container name>/<path to file>. For example, the following command runs a myapp.py Python application that is found in a pyspark_apps directory in the “bigdata” container:

spark-submit /v3io/bigdata/pyspark_apps/myapp.py

Running Spark Jobs from a Web Notebook

One way to run Spark jobs is from a web notebook for interactive analytics. The platform comes preinstalled with two open-source web-notebook applications — Jupyter Notebook and Apache Zeppelin (see Support and Certification Matrix and Application Services and Tools). For more information about these tools and how to use them to run Spark jobs, see the respective third-party product documentation.

Zeppelin Notes

When running Spark jobs from a Zeppelin notebook, note the following:

  • To use Spark, be sure to bind your Zeppelin note with the Spark interpreter. Upon opening a note, if you see a Settings | Interpreter binding section with a list of interpreters, starting with spark, optionally deselect interpreters that you don’t need and then scroll down to the end of the list and select Save.

  • In some versions of Zeppelin, such as version 0.8.0, the Spark interpreter (%spark) produces an “illegal start of definition” error for the standard multi-line Scala command syntax used in this documentation, whereby new lines begin with a period (‘.’). This is a known Zeppelin issue — see the ZEPPELIN-1620 Apache bug. You can bypass the error by doing any of the following:

    • Embed the paragraph code within curly braces ({...}). Note that this applies a local scope to the code.
    • Join multiple lines into a single line for each command.
    • Edit the code to move all start-of-line periods to the end of the previous lines.