You can use the Presto open-source distributed SQL query engine to run interactive SQL queries and perform high-performance low-latency interactive analytics on data that is stored in the platform. Running Presto over the platform’s data services enables you to filter data as close as possible to the source. The platform’s Iguazio Presto connector defines a custom data source that enables you to use Presto to query data in the platform’s NoSQL store — including support for table partitioning, predicate pushdown, column pruning, and performing optimized item-specific and range-scan queries. You can also use Presto’s built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers; see Using the Hive Connector.
The default v2.8.0 platform installation includes the following Presto version 323 components:
- The Presto command-line interface (CLI) for running queries.
The web-based shell service and the terminals of the Jupyter Notebook platform service are automatically connected to the predefined Presto service and include both the native Presto CLI (
presto-cli) and a prestowrapper to simplify working with the Iguazio Presto connector. For more information, see The Presto CLI.
- The Presto server.
The server address is the API URL of the Presto service (
presto), which you can copy from the dashboard
- The Presto web UI for monitoring and managing queries.
This interface can be accessed by using the HTTP or HTTPS UI URLs of the Presto service, which are available from the dashboard
For information about setting table paths when using the Iguazio Presto connector, see Table Paths in the Presto CLI reference. For information about setting table paths when using the Hive connector, see Using the Hive Connector.
Using the Hive Connector
You can use Presto’s built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers, or to save table-query views to the default Hive schema (
To use the Presto Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform’s Presto service:
Servicesdashboard page, select to edit the Presto service and navigate to the Custom Parameterstab.
Enable Hivecheck box and provide the required configuration parameters: Username— the name of a platform user for creating and accessing the Hive Metastore. Container— The name of the data container that contains the Hive Metastore. Hive Metastore path— The relative path to the Hive Metastore within the configured container. If the path doesn’t exist, it will be created by the platform.
Save Serviceto save your changes.
Apply Changesfrom the top action toolbar of the Servicespage to deploy your changes.
If you later select to disable Hive or change the Hive Metastore path, the previously configured Hive Metastore won’t be deleted automatically. You can delete it like any other directory in the platform’s distributed file system, by running a file-system command (such as
rm -rf) from a command-line interface (a web shell or a Jupyter notebook or terminal).
You cannot change the Hive user of the Presto service for an existing Hive Metastore. To change the user, you need to either also change the metastore path so as not to point to an existing metastore; or first delete the existing metastore — disable Hive for Presto, apply your changes, delete the current Hive Metastore directory (using a file-system command), and then re-enable Hive for Presto and configure the same metastore path with a new user.
Creating External Tables
To use the Hive connector to query data in a platform data container, you first need to use the Hive CLI to run a
v3io path of the format
v3io://<container name>/<relative data path>:
CREATE EXTERNAL TABLE <table name> (<column name> <column type>[, <column name> <column type>, ...]) stored as <file type> LOCATION '<data path>';
For example, the following command creates an external Hive table that links to a “prqt1” Parquet file in a “mycontainer” container with a string
CREATE EXTERNAL TABLE prqt1 (col1 string, col2 bigint) stored as parquet LOCATION 'v3io://mycontainer/prqt1';
You can then reference this table by its name from Presto queries that use the
SELECT * FROM hive.mycontainer.prqt1;
Defining Table Partitions
The Hive connector can also be used to query partitioned tables (see Partitioned Tables in the Presto CLI reference), but it doesn’t automatically identify table partitions. Therefore, you first need to use the Hive CLI to define the table partitions after creating an external table. You can do this by using either of the following methods
MSCK REPAIR TABLEstatement to automatically identify the table partitions and update the table metadata in the Hive Metastore:
MSCK REPAIR TABLE <table name>;
For example, the following command updates the partition metadata for an external “prqt1” table:
MSCK REPAIR TABLE prqt1;
This is the simplest method, but it only identifies partition directories whose names are of the format
<column name>=<column value>.
ALTER TABLE ADD PARTITIONstatement to manually define partitions — where
<partition spec>is of the format
<partition column> = <partition value>[, <partition column> = <partition value>, ...], and
<partition path>is a fully qualified
v3iopath of the format
v3io://<container name>/<relative table-partition path>:
ALTER TABLE <table name> ADD [IF NOT EXISTS] PARTITION (<partition spec>) LOCATION '<partition path>'[, PARTITION <partition spec> LOCATION '<partition path>'];
For example, the following command defines a partition named “year” whose value is “2019”, which maps to a partition directory named
year=2019in a “prqt1” table in a “mycontainer” container:
ALTER TABLE prqt1 ADD PARTITION (year=2019) LOCATION 'v3io://mycontainer/prqt1/year=2019';