The Presto CLI

On This Page

Overview

The web-based shell service and the terminals of the Jupyter Notebook service are automatically connected to the Presto service and include a copy of the native Presto command-line interface (CLI) — presto-cli — which you can use to query data in the platform. The native Presto CLI is found in the /usr/local/bin directory, which is included in the environment path ($PATH) to simplify execution from any directory. To facilitate using Presto with the Iguazio Presto connector to query NoSQL tables in the platform’s data containers, the environment path also contains a presto wrapper that preconfigures your cluster’s Presto server URL, the v3io catalog, the Presto user’s username and password (platform access key), and the Presto Java TrustStore file and password. For detailed information about Presto and its CLI, refer to the Presto documentation.

You start the Presto CLI by running either presto (recommended) or presto-cli from a web shell or JupyterLab terminal. For information about the supported CLI options, see CLI Options. For example:

presto

You can stop the CLI, at any time, by running the exit command.

CLI Options

When starting the CLI, you can specify any supported native Presto CLI option. Use the --help option to see a full list. The following options are especially relevant when using the CLI in the platform:

--server

Sets the location of the Presto server. The presto wrapper already preconfigures the server location for your platform cluster. However, when running presto-cli, you must set –server to the location of the Presto server in your cluster. The Presto server URL is the API URL of the predefined Presto service (presto), which you can copy from the Services page of the platform dashboard. The following command demonstrates setting the Presto server URL to https://presto-api-presto.default-tenant.app.mycluster.iguazio.com:

presto-cli --server https://presto-api-presto.default-tenant.app.mycluster.iguazio.com
--catalog

Sets the default Presto-connector catalog. If you don’t configure a default catalog, you need to specify the catalog in the FROM string of each Presto command; for table commands, the catalog is specified at the start of the table path. (You can override the default configuration by specifying another catalog in specific Presto commands.)

To use the Iguazio Presto connector to query platform NoSQL tables, you need to use the v3io catalog (see The v3io Catalog). The presto wrapper already preconfigures the catalog to v3io. When running presto-cli, you can optionally use the –catalog option to set the default Presto-connector catalog to v3io. For example:

presto-cli --catalog v3io --server https://presto-api-presto.default-tenant.app.mycluster.iguazio.com
--schema

Sets the default Presto schema. In the Iguazio Data Science Platform, the Presto schema is the name of the data container that contains the queried tables. If you don’t configure a default container, you need to include the container name as part of the table path in each Presto command. (You can override the default configuration by specifying another container name in specific Presto commands.) This is true for both presto-cli and presto. For example, the following command configures the CLI to query tables in the default “bigdata” container using presto:

presto --schema bigdata

And this is an example of a similar command using the native Presto CLI (presto-cli):

presto-cli --schema bigdata --catalog v3io --server https://presto-api-presto.default-tenant.app.mycluster.iguazio.com

The v3io Catalog

To configure Presto to work with the Iguazio Presto connector for querying data in the platform’s NoSQL store, you need to use the connector’s custom v3io Presto catalog. The presto CLI wrapper already preconfigures this catalog. When running the native Presto CLI (presto-cli), you can set the –catalog option to configure v3io as the default catalog. For example:

presto-cli --catalog v3io --server https://presto-api-presto.default-tenant.app.mycluster.iguazio.com

If you don’t configure the v3io catalog when starting the CLI (either by using presto or by explicitly setting –catalog v3io), you need to specify the catalog in the FROM string of each Presto command; for commands that reference platform NoSQL tables, the table path must begin with v3io (see Table Paths for details). The following example queries a NoSQL “mytable” table in a “mycontainer” data container:

SELECT * from v3io.mycontainer.mytable;

Supported Commands

After starting the Presto CLI, you can run supported commands for your selected catalog from the Presto command line. Version 2.10.0 of the Iguazio Presto connector’s v3io catalog supports the Presto CREATE VIEW, DROP VIEW, SELECT, SHOW CATALOGS, SHOW CREATE VIEW, SHOW FUNCTIONS, SHOW SCHEMAS, and SHOW TABLES queries and the custom v3io.schema.infer command. See the v3io query examples.

Note
  • SHOW TABLES returns only tables that reside in the container’s root directory, provided the access key includes data-access permissions for this directory.

  • To use the view commands (CREATE VIEW, DROP VIEW, and SHOW CREATE VIEW), you first need to enable Hive for the Presto service. See Enabling Hive. You can then save views of platform NoSQL tables, as well as other supported file types, to the default schema of the Hive presto connector (hive.default).

Table Paths

When using the Iguazio Presto connector, you can specify table paths in one of two ways:

  • Table name — this is the standard Presto syntax and is currently supported only for tables that reside directly in the root directory of the configured data container (Presto schema).

    • When using built-in Presto commands, such as SELECT, you specify the path as v3io.<container name>.<table name>. For example, SELECT * FROM v3io.mycontainer.mytable;.
    • When using the custom v3io.schema.infer command, you pass the container and table names as separate parameters — v3io.schema.infer('<container name>', '<table name>');. For example, call v3io.schema.infer ('mycontainer', 'mytable');.
  • File path — the relative path to the table within the configured data container (/path/to/table). Currently, nested tables in the platform’s data containers must be referenced using this syntax.

    • When using built-in Presto commands, such as SELECT, you specify the path as v3io.<container name>."/path/to/table". For example, SELECT * FROM v3io.mycontainer."/mytables/cars/vendors";. Note that the table path must be embedded within double quotes.
    • When using the custom v3io.schema.infer command, you pass the container name and table path as separate parameters — v3io.schema.infer('<container name>', '/path/to/table');. For example, call v3io.schema.infer ('mycontainer', '/mytables/cards/vendors');.
Note
  • For both syntax variations, in standard Presto commands you can optionally omit the catalog and container (schema) names if they’re already preconfigured; see the CLI –catalog and –schema options. The presto wrapper preconfigures the v3io catalog.

  • Tables in a data container’s root directory can be accessed by using either the table-name or file-path syntax. The table-name syntax is simpler but slower. Therefore, it’s recommended that you use the path syntax when you need to frequently repeat a specific query.

  • Table-path letter case —

    • The table-name syntax (which is supported for tables in the root container directory) ignores the letter case in the table path. Therefore, it also supports uppercase letters in the path; (note that the table names will appear in lowercase letters in query results).
    • The file-path syntax doesn’t currently support uppercase letters in the table path.

Defining the NoSQL Table Schema

Presto handles structured data. Therefore, it needs to be aware of the schema of the data structure. (Don’t confuse this with native Presto schemas, which are used for organizing tables — as explained, for example, for the –schema option.) When writing NoSQL data in the platform using Frames or a Spark DataFrame, the schema of the data table is automatically identified and saved and then retrieved when using Frames, Spark DataFrames, or Presto to read data from the same table (unless you select to explicitly define the schema for the read operation). However, to use Presto, Frames, or Spark DataFrames to read NoSQL data that was written to a table in another way, you first need to define the table schema. You can do this by using the platform’s custom Presto v3io.schema.infer command, which generates the required schema file. For more information, see the NoSQL Table Schema Reference.

The v3io.schema.infer Command

The Iguazio Presto connector exposes a v3io.schema.infer command that can be used to infer the schema of an existing table by analyzing its data. The command has the following syntax — where <container name> is the name of the data container (schema) that contains the table and <relative table path> is the relative path to the table within the container (see Table Paths):

call v3io.schema.infer('<container name>', '<relative table path>');

When the table resides in the container’s root directory, the relative path can be the table name. For example, the following command infers the schema of a “mytable” table in the root directory of a “mycontainer” data container:

call v3io.schema.infer('mycontainer', 'mytable');

For nested tables, you need to specify the table path as '/path/to/table'. For example, the following command infers the schema of a “mytable” table in a mydata directory of a “mycontainer” data container:

call v3io.schema.infer('mycontainer', '/mydata/mytable');

The infer-schema command creates a JSON schema file (.#schema) in the table directory. You can find more information about this file in the NoSQL Table Schema Reference, although note that you don’t typically need to edit this file.

Partitioned Tables

Table partitioning is a common technique for optimizing physical data layout and related queries. In a partitioned table, some item attributes (columns) are used to create partition directories within the root table directory using the format <table path>/<attribute>=<value>[/<attribute>=<value>/…], and each item is then saved to the respective partition directory based on its attribute values. For example, for a "mytable" table with year and month attribute partitions, an item with attributes year = 2018 and month = 1 will be saved to a mytable/year=2018/month=1/ directory. This allows for more efficient data queries that search for the data only in the relevant partition directories instead of scanning the entire table. This technique is used, for example, by Hive, and is supported for all the built-in Spark Dataset file-based data sources (such as Parquet, CSV, and JSON). See also the Partitioned Tables documentation in the NoSQL Databases concepts page, including best practices.

The Iguazio Presto connector supports querying of partitioned NoSQL tables: a partitioned table is queried like any other table, with the table path set to the root table directory and not to a specific partition directory. When processing queries, the platform searches the root table directory that is specified in the read command for nested directories of the format <attribute>=<value>. If it finds such directories, it searches only the partition directories that match the query. For example, for a “mytable” table in a “mycontainer” data container that’s partitioned by year and month attributes, a SELECT * FROM v3io.mycontainer.mytable WHERE month = 12; query will return only the items from the month=12 partition directories in all year=* directories.

Read Optimization

The Iguazio Presto connector supports the following optimized table queries (reads), which are more efficient compared to the standard full table scan:

For more information about these query types, see NoSQL Read Optimization.

Faster Item-Specific Queries

The fastest Presto NoSQL table queries are those that uniquely identify a specific item by its primary-key value. See NoSQL Faster Item-Specific Queries .

Item-Specific Query Operators
The Iguazio Presto connector executes this faster processing for queries that apply the equal-to (=) or IN (IN) operator to the sharding-key attribute and optionally also apply one of these operators the sorting-key attribute (in the case of a compound primary key).

Faster Item-Specific Query Examples

The following commands all identify a specific item by its primary-key value and will be processed more quickly than table-scan processing; (it is assumed that v3io is configured as the default catalog):

  • Retrieve an item with the simple primary-key value “345”:

    SELECT * FROM mycontainer.mytable" WHERE id = 345;
    
  • Retrieve an item with the compound primary-key value “myfile.txt”:

    SELECT * FROM mycontainer.mytable" WHERE basename = 'myfile' and suffix = 'txt';
    

Range Scans

A Presto NoSQL table query that uses supported sharding-key and optional sorting-key filters to retrieve items with the same sharding-key value, is processed by performing a range scan, which is more efficient than the standard full table scan. See NoSQL Range Scans.

Range-Scan Operators
The Iguazio Presto connector uses range scan for compound primary-key table queries that apply the equal-to (=) or IN (IN) operator to the sharding-key attribute, and optionally also apply a comparison operator (=/>/>=/</<= / BETWEEN) to the sorting-key attribute.

Range-Scan Query Examples

The following commands query a “rides” table in a mytaxis table in a “mycontainer” data container. The table is assumed to have compound <sharding key>.<sorting key> primary key, a driver_id sharding-key attribute, a date sorting-key attribute, and a compatible schema — which enable performing range-scan queries that use a sharding-key and optionally also a sorting-key filter. You can find sample Spark DataFrame code for creating a compatible range-scan table in the NoSQL Spark DataFrame reference. The following commands and outputs are compatible with this sample table; (it is assumed that v3io is configured as the default catalog):

  • Retrieve all items with a driver_id sharding-key attribute value of 1 (regardless of the sorting-key value):

    SELECT * FROM mycontainer."/mytaxis/rides/" WHERE driver_id = 1;
    

    Output

       date   |    avg_ride_km     | avg_ride_passengers | driver_id | total_km | num_rides | total_passengers
    ----------+--------------------+---------------------+-----------+----------+-----------+------------------
     20180601 |                5.0 |                 1.6 |         1 |    125.0 |        25 |               40
     20180602 |                5.3 |                 2.3 |         1 |    106.0 |        20 |               46
     20180701 | 3.8000000000000003 |                 1.5 |         1 |    106.4 |        28 |               42
    (3 rows)
        

  • Retrieve all items with a driver_id sharding-key attribute value of 24 and a date sorting-key attribute value within the first six months of 2018:

    SELECT * FROM mycontainer."/mytaxis/rides/" WHERE driver_id = 24 AND date >= '20180101' AND date < '20180701';
    

    Output

       date   | avg_ride_km | avg_ride_passengers | driver_id | total_km | num_rides | total_passengers
    ----------+-------------+---------------------+-----------+----------+-----------+------------------
     20180602 |        52.0 |                 2.2 |        24 |    260.0 |         5 |               11
     20180601 |        41.5 |                2.25 |        24 |    332.0 |         8 |               18
    (2 rows)
        

  • Retrieve all items with a driver_id sharding-key attribute value of 1, 16, or 24 (regardless of the sorting-key value) and an avg_ride_passengers attribute value that is greater or equal to 3:

    SELECT * FROM mycontainer."/mytaxis/rides/" WHERE driver_id IN (1, 16, 24) AND avg_ride_passengers >= 3;
    

    Output

         date   |    avg_ride_km     | avg_ride_passengers | driver_id | total_km | num_rides | total_passengers
      ----------+--------------------+---------------------+-----------+----------+-----------+------------------
       20180701 | 32.199999999999996 |                 4.0 |        16 |    193.2 |         6 |               24
       20180601 |              224.2 |                 8.0 |        16 |    224.2 |         1 |                8
       20180602 |               24.4 |                 4.5 |        16 |    244.0 |        10 |               45
       20180701 | 50.300000000000004 |                 3.0 |        24 |    352.1 |         7 |               21
      (4 rows)
        

v3io Query Examples

Note
  • The examples in this section assume that v3io has been configured as the default catalog. If this isn’t the case, add “FROM v3io” in SHOW SCHEMAS commands and “v3io.” at the start of the table paths (before the name of the data container) in SHOW TABLES and SELECT commands. For example, replace “bigdata.mytable” with “v3io.bigdata.mytable”; see Table Paths for details.
  • See also the separate range-scan query examples.

The following command lists all the data containers (schemas) in the parent platform tenant whose names end in “data”:

SHOW SCHEMAS LIKE '%data';

The following command lists all the tables in the root directory of the “bigdata” data container. (Remember that the SHOW TABLES command only identifies tables in the container’s root directory.)

SHOW TABLES IN bigdata;

The following command shows the contents of a “mytable” table in the “bigdata” data container:

SELECT * FROM bigdata.mytable;

The following command shows the contents of a nested “tests/nosql/table1” table in the “bigdata” container; (see the file-path syntax):

SELECT * FROM bigdata."/tests/nosql/table1";

If you configured the default Presto schema when starting the CLI to “bigdata”, you can optionally run the commands from the previous examples without explicitly specifying the name of the data container:

SHOW TABLES;
SELECT * FROM mytable;
SELECT * FROM "/tests/nosql/table1";

v3io Query Examples Using Views

The following commands demonstrate how to create and use a query view for a NoSQL table. Note that because the view is saved to the default Hive schema, before you create the view you need to ensure that Hive is enabled for the platform’s Presto service:

Note
You can generate a compatible table by running the first steps of the getting-started example in the basic-data-ingestion-and-preparation.ipynb tutorial notebook. Just remember to replace iguazio in the following commands with your platform username.

  1. Create an iguazio_stocks_tab_etc_view view of a SELECT query for all items with the securitytype attribute value "ETC" in a users/<username>/examples/stocks_tab table for user “iguazio”:

    CREATE VIEW hive.default.iguazio_stocks_tab_etc_view AS SELECT * FROM users."/iguazio/examples/stocks_tab" WHERE securitytype = 'ETC';
    
  2. Show the view’s SQL statement:

    SHOW CREATE VIEW hive.default.iguazio_stocks_tab_etc_view;
    
  3. Use the view to return all items in the table:

    SELECT * FROM hive.default.iguazio_stocks_tab_etc_view;
    

    You can also optionally apply an additional filter to the view query. For example, the following query returns all items with the securitytype attribute value "ETC" (view query) and a numberoftrades attribute value that’s greater than 1:

    SELECT * FROM hive.default.iguazio_stocks_tab_etc_view WHERE numberoftrades > 1;
    

See Also