Ingesting and Consuming Files
This tutorial demonstrates how to ingest (write) a new file object to a data container in the platform, and consume (read) an ingested file, either from the dashboard or by using the Simple-Object Web API. The tutorial also demonstrates how to convert a CSV file to a NoSQL table by using the Spark SQL and DataFrames API.
The examples are for CSV and PNG files that are copied to example directories in the platform’s predefined data containers. You can use the same methods to process other types of files, and you can also modify the file and container names and the data paths that are used in the tutorial examples, but it’s up to you to ensure the existence of the elements that you reference.
Before You Begin
Follow the Working with Data Containers tutorial to learn how to create and delete container directories and access container data. As suggested in that tutorial, it’s recommended that you first review the Platform Fundamentals guide, and specifically the sections pertaining to the interfaces used in the current tutorial — the web APIs and Spark DataFrames.
Note that to send web-API requests, you need to have the URL of the parent tenant’s webapi service, and either a platform username and password or an access key for authentication. To learn more and to understand how to structure the web-API requests, see Sending Web-API Requests. (The tutorial Postman examples use the username-password authentication method, but if you prefer, you can replace this with access-key authentication, as explained in the documentation.)
Using the Dashboard to Ingest and Consume Files
Ingesting Files Using the Dashboard
Follow these steps to ingest (upload) files to the platform from the dashboard:
Datapage, select a container (for example, bigdata) to display the container Browsedata tab (default), which allows you to browse the contents of the container. To upload a file to a specific container directory, navigate to this directory as explained in the Working with Data Containers tutorial.
Use either of the following methods to upload a file to the container:
Simply drag and drop a file into the main-window area (which displays the container or directory contents), as demonstrated in the following image:
Select the “upload” icon () from the action toolbar and then, when prompted, browse to the location of the file in your local file system.
For example, you can download the example
bank.csvfile and upload it to a mydatadirectory in the default “bigdata” container. Creating the directory is as simple as selecting the new-folder icon in the dashboard () and entering the directory name; for detailed instructions, see the Working with Data Containers tutorial.
When the upload completes, you should see your file in the dashboard, as demonstrated in the following image:
Consuming Files Using the Dashboard
Follow these steps to retrieve (download) an uploaded file from the dashboard:
Using the Simple-Object Web API to Ingest and Consume Files
Ingesting Files Using the Simple-Object Web API
You can use Postman, for example, to send a Simple-Object Web API
In the request URL field, enter the following; replace
<web-APIs URL>with the URL of the parent tenant’s webapi service, replace
<container name>with the name of the container to which you want to upload the data, and replace
<image-file path>with the relative path within this container to which you want to upload the file:
<web-APIs URL>/<container name>/<image-file path>
For example, the following URL sends a request to web-API service URL
https://default-tenant.app.mycluster.iguazio.com:8443to upload a file named
igz_logo.pngfile to a mydatadirectory in the “bigdata” “ container:
Any container directories in the specified
<image-file path>path that don’t already exist will be created automatically, but the container directory must already exist.
Select the file format that matches the uploaded file. For an image file, select
Choose Filesand browse to the file to upload in your local file system.
For a successful request, you should be able to see the uploaded image file from the dashboard:
in the side navigation menu, select
Consuming Files Using the Simple-Object Web API
After you ingest a file, you can send a Simple-Object Web API
Enter the following as the request URL, replacing the
<...>placeholders with the same data that you used in Step #2 of the ingest example:
<web-APIs URL>/<container name>/<image-file path>
Converting a CSV File to a NoSQL Table
The unified data model of the platform allows you to ingest a file in one format and consume it in another format.
The following example uses the Spark SQL and DataFrames API to convert a
- The basic-data-ingestion-and-preparation platform tutorial Jupyter notebook has a similar Python example code for converting a CSV file to a NoSQL table.
However, it uses a different example directory (
users/<username>/examples), a different CSV file ( stocks.csv), and a different NoSQL table ( stocks_tab).
- To run either the Zeppelin or Jupyter Notebook example, you first need to create the respective notebook service. See Creating a New Service.
The Step 1 shell paragraph in the example Zeppelin note uses Hadoop FS and curl commands to create the required sample directory and download and copy the CSV file to this directory. The Step 2 Spark paragraph in this note includes CSV to NoSQL-table conversion code that is similar to the Scala code snippet in this tutorial. Therefore, you can run the example code by logging into your Zeppelin service, selecting the sample note, and sequentially running the Spark jobs in these paragraphs. If you prefer to run the conversion code from a new dedicated note (for example, if you prefer to use Python instead of Scala), do the following:
- Download the example
bank.csvfile, if you have not already done so as part of the dashboard file-ingestion tutorial.
- From the dashboard
Data | bigdata > Browsetab, create a new zeppelin_getting_started_exampledirectory and upload bank.csvto this directory.
- Log into your Zeppelin service and create a new note (for example, “my_iguazio_trial”) that is bound to the “spark” interpreter.
- Copy the following Scala or Python sample code to a paragraph in your new Zeppelin note and run the Spark job from the note.
- First, the contents of the
bigdata/zeppelin_getting_started_example/bank.csvCSV file is read into a new
headeroption is set to
"true"because the CSV file has a header line that specifies the column names and the
delimiteroption is set to a semicolon (
;) because the CSV file uses a comma column delimiter (
inferSchemaoption is set to
"true"to automatically infer the data schema from the CSV data instead of defining the schema manually.
[Code Lines: Scala 5–6 ; Python 7–9]
- A NoSQL table must have a column (“attribute”) with unique values that serves as the table’s primary key.
The value of a row’s primary-key attribute is also the name of the row (“item”).
The example CSV file doesn’t have a uniquely identified column.
Therefore, before creating the NoSQL table, the code defines a new primary-key column (attribute) named
idand assigns its values as sequential numbers starting with 1. The updated CSV content with the new idcolumn is stored in a
[Code Lines: Scala 7 ; Python 10]
- For demonstration and testing purposes, the DataFrame
showcommand is used to display the contents of the DataFrame before the following write operation. (This isn’t a necessary step in the conversion process.) [Code Lines: Scala 8 ; Python 11]
- To complete the conversion, the content of the
nosqlDFDataFrame is written to a
zeppelin_getting_started_example/bank_nosqlNoSQL table in the “bigdata” container. The save mode is set to
"append", which means that if the table already exists the contents of the DataFrame will be appended to the existing table (adding new items and updating existing items with the same primary-key attribute values as in the DataFrame).
[Code Lines: Scala 10–12 ; Python 13–15]NotePaths to data containers and their contents should be specified as absolute paths of the format
, as demonstrated in the example code (
v3io://<container name>[<data path>]
Step 2 in the example “Iguazio Getting Started Example” note also has the following code, which isn’t part of the CSV to NoSQL conversion:
printSchemacommand prints the schema of the new “bank_nosql” table, as inferred by the platform as part of the DataFrame write operation. For more information, see Defining the Table Schema in the NoSQL DataFrame reference documentation.
createOrReplaceTempViewcommand creates a temporary “bank” view for querying the DataFrame using Spark SQL, as demonstrated in the subsequent paragraphs of the “Iguazio Getting Started Example” note.
After the job completes successfully, verify that the table was created.
You can do this by adding a new
%sh paragraph with the following code to your Zeppelin note, and then running it to list the contents of the
ls -lFA /v3io/bigdata/zeppelin_getting_started_example/bank_nosql/
You should also be able to see the new table in the dashboard:
navigate to the
For more information and examples of using Spark to ingest and consume data in different formats and convert data formats, see the Getting Started with Data Ingestion Using Spark tutorial.
- Select a getting-started tutorial or guide that best suits your interests and requirements.
- Create a Jupyter Notebook service (if you don’t already have one) and read the provided introductory
welcome.ipynbplatform tutorial notebook (available also as a Markdown README.mdfile). Then, proceed to review and run the code in the tutorial getting-started or demo-application examples, according to your your development needs. A good place to start is the data collection, ingestion, and exploration getting-started notebook — data-ingestion-and-preparation/basic-data-ingestion-and-preparation.ipynb.