Overview

On This Page

Introduction

You can use the Presto open-source distributed SQL query engine to run interactive SQL queries and perform high-performance low-latency interactive analytics on data that is stored in the platform. Running Presto over the platform’s data services enables you to filter data as close as possible to the source. The platform’s Iguazio Presto connector defines a custom data source that enables you to use Presto to query data in the platform’s NoSQL store — including support for table partitioning, predicate pushdown, column pruning, and performing optimized item-specific and range-scan queries. You can also use Presto’s built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers; see Using the Hive Connector.

The default v2.3.1 platform installation includes the following Presto version 0.206 components:

  • The Presto command-line interface (CLI) for running queries. The web-based shell service and the terminals of the Jupyter Notebook platform service are automatically connected to the predefined Presto service and include both the native Presto CLI (presto-cli) and a presto wrapper to simplify working with the Iguazio Presto connector. For more information, see The Presto CLI.
  • The Presto server. The server address is the API URL of the Presto service, which you can copy from the dashboard Services page.
  • The Presto web interface for monitoring and managing queries. This interface can be accessed by using the HTTP or HTTPS UI URLs of the Presto services, which are available from the dashboard Services page.

Table Paths

For information about setting table paths when using the Iguazio Presto connector, see Table Paths in the Presto CLI reference. For information about setting table paths when using the Hive connector, see Using the Hive Connector.

Using the Hive Connector

You can use Presto’s built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers, or to save table-query views to the default Hive schema (hive.default).

Running the Hive CLI
You can start the Hive CLI in the platform by running the hive command from a web shell or Jupyter terminal.

Enabling Hive

To use the Presto Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform’s Presto service:

  1. On the Services dashboard page, select to edit the Presto service and navigate to the Custom Parameters tab.
  2. Check the Enable Hive check box and provide the required configuration parameters:

    • Username — the name of a platform user for creating and accessing the Hive Metastore.
    • Container — The name of the data container that contains the Hive Metastore.
    • Hive Metastore path — The relative path to the Hive Metastore within the configured container. If the path doesn’t exist, it will be created by the platform.

Select Save Service to save your changes. 3. Select Apply Changes from the top action toolbar of the Services page to deploy your changes.

Note
If you later select to disable Hive or change the Hive Metastore path, the previously configured Hive Metastore won’t be deleted automatically. You can delete it like any other directory in the platform’s distributed file system, by running a file-system command (such as rm -rf) from a command-line interface (a web shell or a Jupyter notebook or terminal).

Creating External Tables

To use the Hive connector to query data in a platform data container, you first need to use the Hive CLI to run a CREATE EXTERNAL TABLE statement that creates an external table. The statement should map the relevant data path to a unique table name, and define the names and data types of the table’s columns (attributes); the data path in the statement should be specified as a fully qualified v3io path of the format v3io://<container name>/<relative data path>:

CREATE EXTERNAL TABLE <table name> (<column name> <column type>[, <column name> <column type>, ...]) stored as <file type> LOCATION '<data path>';

For example, the following command creates an external Hive table that links to a “prqt1” Parquet file in a “mycontainer” container with a string col1 column and a big-integer col2 column:

CREATE EXTERNAL TABLE prqt1 (col1 string, col2 bigint) stored as parquet LOCATION 'v3io://mycontainer/prqt1';

You can then reference this table by its name from Presto queries that use the hive catalog. For example:

SELECT * FROM hive.mycontainer.prqt1;

Defining Table Partitions

The Hive connector can also be used to query partitioned tables (see Partitioned Tables in the Presto CLI reference), but it doesn’t automatically identify table partitions. Therefore, you first need to use the Hive CLI to define the table partitions after creating an external table. You can do this by using either of the following methods

  • Use the MSCK REPAIR TABLE statement to automatically identify the table partitions and update the table metadata in the Hive Metastore:

    MSCK REPAIR TABLE <table name>;
        

    For example, the following command updates the partition metadata for an external “prqt1” table:

    MSCK REPAIR TABLE prqt1;
          

    This is the simplest method, but it only identifies partition directories whose names are of the format <column name>=<column value>.

  • Use the ALTER TABLE ADD PARTITION statement to manually define partitions — where <partition spec> is of the format <partition column> = <partition value>[, <partition column> = <partition value>, ...], and <partition path> is a fully qualified v3io path of the format v3io://<container name>/<relative table-partition path>:

    ALTER TABLE <table name> ADD [IF NOT EXISTS] PARTITION (<partition spec>) LOCATION '<partition path>'[, PARTITION <partition spec> LOCATION '<partition path>'];
        

    For example, the following command defines a partition named “year” whose value is “2019”, which maps to a partition directory named year=2019 in a “prqt1” table in a “mycontainer” container:

    ALTER TABLE prqt1 ADD PARTITION (year=2019) LOCATION 'v3io://mycontainer/prqt1/year=2019';
        

See Also