Ingesting and Consuming Files

On This Page

Overview

This tutorial demonstrates how to ingest (write) a new file object to a data container in the platform, and consume (read) an ingested file, either from the dashboard or by using the Simple-Object Web API. The tutorial also demonstrates how to convert a CSV file to a NoSQL table by using the Spark SQL and DataFrames API.

The examples are for CSV and PNG files that are copied to example directories in the platform’s predefined data containers. You can use the same methods to process other types of files, and you can also modify the file and container names and the data paths that are used in the tutorial examples, but it’s up to you to ensure the existence of the elements that you reference.

Before You Begin

Follow the Working with Data Containers tutorial to learn how to create and delete container directories and access container data. As suggested in that tutorial, it’s recommended that you first review the Platform Fundamentals guide, and specifically the sections pertaining to the interfaces used in the current tutorial — the web APIs and Spark DataFrames.

Note that to send web-API requests, you need to have the URL of the parent tenant’s Web APIs service, and either a platform username and password or an access key for authentication. To learn more and to understand how to structure the web-API requests, see Sending Web-API Requests. (The tutorial Postman examples use the username-password authentication method, but if you prefer, you can replace this with access-key authentication, as explained in the documentation.)

Using the Dashboard to Ingest and Consume Files

You can easily ingest and consume files from the dashboard.

Ingesting Files Using the Dashboard

Follow these steps to ingest (upload) files to the platform from the dashboard:

  1. On the Data page, select a container (for example, bigdata) to display the container Browse data tab (default), which allows you to browse the contents of the container. To upload a file to a specific container directory, navigate to this directory as explained in the Working with Data Containers tutorial.

  2. Use either of the following methods to upload a file to the container:

    • Simply drag and drop a file into the main-window area (which displays the container or directory contents), as demonstrated in the following image:

      Dashboard - drag-and-drop a file to a container directory

    • Select the “upload” icon () from the action toolbar and then, when prompted, browse to the location of the file in your local file system.

    For example, you can download the example bank.csv file and upload it to a mydata directory in the default “bigdata” container. Creating the directory is as simple as selecting the new-folder icon in the dashboard () and entering the directory name; for detailed instructions, see the Working with Data Containers tutorial.

When the upload completes, you should see your file in the dashboard, as demonstrated in the following image:

Dashboard - container directory with an uploaded file

Consuming Files Using the Dashboard

Follow these steps to retrieve (download) an uploaded file from the dashboard: on the Data page, select a container (for example, bigdata) to display the container Browse data tab (default), and then use the side navigation tree to navigate to the directory that contains the file. Then, check the check box next to the file that you want to download, select the download icon () from the action toolbar, and select the location to which to download the file.

Using the Simple-Object Web API to Ingest and Consume Files

You can ingest and consume files by sending Simple-Object Web API HTTP requests using your preferred method, such as Postman or curl.

Ingesting Files Using the Simple-Object Web API

You can use Postman, for example, to send a Simple-Object Web API PUT Object request that uploads a file object to the platform:

  1. Create a new request and set the request method to PUT.

  2. In the request URL field, enter the following; replace <web-APIs URL> with the URL of the parent tenant’s Web APIs service, replace <container name> with the name of the container to which you want to upload the data, and replace <image-file path> with the relative path within this container to which you want to upload the file:

    <web-APIs URL>/<container name>/<image-file path>
        

    For example, the following URL sends a request to web-API service URL https://default-tenant.app.mycluster.iguazio.com:8443 to upload a file named igz_logo.png file to a mydata directory in the “bigdata” “ container:

    https://default-tenant.app.mycluster.iguazio.com:8443/bigdata/mydata/igz_logo.png
        

    Any container directories in the specified <image-file path> path that don’t already exist will be created automatically, but the container directory must already exist.

  3. In the Authorization tab, set the authorization type to "Basic Auth" and enter your username and password in the respective credential fields.

  4. In the Body tab —

    1. Select the file format that matches the uploaded file. For an image file, select binary.

    2. Select Choose Files and browse to the file to upload in your local file system.

  5. Select Send to send the request, and then check the response.

For a successful request, you should be able to see the uploaded image file from the dashboard: in the side navigation menu, select Data and then select the container to which you uploaded the file (for example, “bigdata”). In the container’s Browse tab, navigate from the left navigation tree to the container directory in which you selected to save the file (for example, mydata), and verify that the directory contains the uploaded file (for example, igz_logo.png).

Consuming Files Using the Simple-Object Web API

After you ingest a file, you can send a Simple-Object Web API GET Object request to retrieve (consume) it from the platform. For example, to retrieve the image file that you uploaded in the previous steps, define the following Postman request:

  1. Set the request method to GET.

  2. Enter the following as the request URL, replacing the <...> placeholders with the same data that you used in Step #2 of the ingest example:

    <web-APIs URL>/<container name>/<image-file path>
        

    For example:

    https://default-tenant.app.mycluster.iguazio.com:8443/bigdata/mydata/igz_logo.png
        

  3. In the Authorization tab, set the authorization type to "Basic Auth" and enter your username and password in the respective credential fields.

  4. Select Send to send the request, and then check the response Body tab. The response body should contain the contents of the uploaded file.

Converting a CSV File to a NoSQL Table

The unified data model of the platform allows you to ingest a file in one format and consume it in another format. The following example uses the Spark SQL and DataFrames API to convert a bank.csv CSV file in a zeppelin_getting_started_example directory in the “bigdata” container to a “bank_nosql” NoSQL table in the same directory. A version of this example is included in the pre-deployed “Iguazio Getting Started Example” note of the platform Zeppelin service.

Note
  • The getting-started/collect-n-explore.ipynb platform tutorial Jupyter notebook has a similar Python example code for converting a CSV file to a NoSQL table. However, it uses a different example directory (users/<username>/examples), a different CSV file (stocks.csv), and a different NoSQL table (stocks_tab).
  • To run either the Zeppelin or Jupyter Notebook example, you first need to create the respective notebook service. See Creating a New Service.

The Step 1 shell paragraph in the example Zeppelin note uses Hadoop FS and curl commands to create the required sample directory and download and copy the CSV file to this directory. The Step 2 Spark paragraph in this note includes CSV to NoSQL-table conversion code that is similar to the Scala code snippet in this tutorial. Therefore, you can run the example code by logging into your Zeppelin service, selecting the sample note, and sequentially running the Spark jobs in these paragraphs. If you prefer to run the conversion code from a new dedicated note (for example, if you prefer to use Python instead of Scala), do the following:

  1. Download the example bank.csv file, if you have not already done so as part of the dashboard file-ingestion tutorial.
  2. From the dashboard Data > bigdata > Browse tab, create a new zeppelin_getting_started_example directory and upload bank.csv to this directory.
  3. Log into your Zeppelin service and create a new note (for example, “my_iguazio_trial”) that is bound to the “spark” interpreter.
  4. Copy the following Scala or Python sample code to a paragraph in your new Zeppelin note and run the Spark job from the note.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    %spark
    {
    import org.apache.spark.sql.SparkSession
    
    val myDF = spark.read.option("header", "true").option("delimiter", ";")
        .csv("v3io://bigdata/zeppelin_getting_started_example/bank.csv", inferSchema="true")
    val nosqlDF = myDF.withColumn("id", monotonically_increasing_id() + 1)
    nosqlDF.show()
    
    nosqlDF.write.format("io.iguaz.v3io.spark.sql.kv")
        .mode("append").option("key", "id")
        .save("v3io://bigdata/zeppelin_getting_started_example/bank_nosql/")
    }

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    
    %pyspark
    import sys
    from pyspark.sql import SparkSession
    from pyspark.sql import *
    
    from pyspark.sql.functions import *
    
    
    myDF = spark.read.option("header", "true").option("delimiter", ";") \
        .csv("v3io://bigdata/zeppelin_getting_started_example/bank.csv",
             inferSchema="true")
    nosqlDF = myDF.withColumn("id", monotonically_increasing_id+1)
    nosqlDF.show()
    
    nosqlDF.write.format("io.iguaz.v3io.spark.sql.kv") \
        .mode("append").option("key", "id") \
        .save("v3io://bigdata/zeppelin_getting_started_example/bank_nosql/")

    Code Walkthrough
    • First, the contents of the bigdata/zeppelin_getting_started_example/bank.csv CSV file is read into a new myDF Spark DataFrame.
      The header option is set to "true" because the CSV file has a header line that specifies the column names and the delimiter option is set to a semicolon (;) because the CSV file uses a comma column delimiter ("age";"job";"marital";...).
      The inferSchema option is set to "true" to automatically infer the data schema from the CSV data instead of defining the schema manually.
      [Code Lines: Scala 5–6 ; Python 7–9]
    • A NoSQL table must have a column (“attribute”) with unique values that serves as the table’s primary key. The value of a row’s primary-key attribute is also the name of the row (“item”). The example CSV file doesn’t have a uniquely identified column. Therefore, before creating the NoSQL table, the code defines a new primary-key column (attribute) named id and assigns its values as sequential numbers starting with 1. The updated CSV content with the new id column is stored in a nosqlDF DataFrame.
      [Code Lines: Scala 7 ; Python 10]
    • For demonstration and testing purposes, the DataFrame show command is used to display the contents of the DataFrame before the following write operation. (This isn’t a necessary step in the conversion process.) [Code Lines: Scala 8 ; Python 11]
    • To complete the conversion, the content of the nosqlDF DataFrame is written to a zeppelin_getting_started_example/bank_nosql NoSQL table in the “bigdata” container. The save mode is set to "append", which means that if the table already exists the contents of the DataFrame will be appended to the existing table (adding new items and updating existing items with the same primary-key attribute values as in the DataFrame).
      [Code Lines: Scala 10–12 ; Python 13–15]
      Note
      Paths to data containers and their contents should be specified as absolute paths of the format v3io://<container name>[<data path>], as demonstrated in the example code ("v3io://bigdata/zeppelin_getting_started_example/bank.csv"; "v3io://bigdata/zeppelin_getting_started_example/bank_nosql/").
    • Step 2 in the example “Iguazio Getting Started Example” note also has the following code, which isn’t part of the CSV to NoSQL conversion:

        nosqlDF.printSchema()
        nosqlDF.createOrReplaceTempView("bank")

        nosqlDF.printSchema()
        nosqlDF.createOrReplaceTempView("bank")

        The printSchema command prints the schema of the new “bank_nosql” table, as inferred by the platform as part of the DataFrame write operation. For more information, see Defining the Table Schema in the NoSQL DataFrame reference documentation.
        The createOrReplaceTempView command creates a temporary “bank” view for querying the DataFrame using Spark SQL, as demonstrated in the subsequent paragraphs of the “Iguazio Getting Started Example” note.

      After the job completes successfully, verify that the table was created. You can do this by adding a new %sh paragraph with the following code to your Zeppelin note, and then running it to list the contents of the zeppelin_getting_started_example/bank_nosql directory in the “bigdata” container. The command should succeed and the output should contain a .#schema file (which defines the table schema) and <ID number> primary-key attribute files (for example, 1) for each table item (row). You can also run this list-directory command from another command-line interface, such as a Jupyter notebook or terminal or a web-based shell service:

      ls -lFA /v3io/bigdata/zeppelin_getting_started_example/bank_nosql/

      You should also be able to see the new table in the dashboard: navigate to the Browse data tab of the target container and then browse to the zeppelin_getting_started_example/bank_nosql directory. You should see the schema and item files described above, as demonstrated in the following image:

      Dashboard - container directory with a CSV file a matching NoSQL table

      For more information and examples of using Spark to ingest and consume data in different formats and convert data formats, see the Getting Started with Data Ingestion Using Spark tutorial.

      What’s Next?

      • Select a getting-started tutorial or guide that best suits your interests and requirements.
      • Create a Jupyter Notebook service (if you don’t already have one) and read the provided introductory welcome.ipynb platform tutorial notebook (available also as a Markdown README.md file). Then, proceed to review and run the code in the tutorial getting-started or demo-application examples, according to your your development needs. A good place to start is the data collection, ingestion, and exploration getting-started notebook — getting-started/collect-n-explore.ipynb.