Introduction to Using Trino in the Platform

On This Page

Introduction

You can use the Trino open-source distributed SQL query engine to run interactive SQL queries and perform high-performance low-latency interactive analytics on data that is stored in the platform. Running Trino over the platform's data services enables you to filter data as close as possible to the source. The platform's Iguazio Trino connector defines a custom data source that enables you to use Trino to query data in the platform's NoSQL store — including support for table partitioning, predicate pushdown, column pruning, and performing optimized item-specific and range-scan queries. You can also use Trino's built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers; see Using the Hive Connector. In addition, it's possible to add an Oracle connector to Trino [Tech Preview]; for more information, contact Iguazio's support team.

The default v3.6.1 platform installation includes the following Trino version 370 components:

The Trino command-line interface (CLI) for running queries. The web-based shell service and the terminals of the Jupyter Notebook platform service are automatically connected to the predefined Trino service and include both the native Trino CLI (trino-cli) and a trino wrapper to simplify working with the Iguazio Trino connector. For more information, see The Trino CLI.
The Trino server. The server address is the API URL of the Trino service (Trino), which you can copy from the dashboard Services page.
The Trino web UI for monitoring and managing queries. This interface can be accessed by using the HTTP or HTTPS UI URLs of the Trino service, which are available from the dashboard Services page.

Table Paths

For information about setting table paths when using the Iguazio Trino connector, see Table Paths in the Trino CLI reference. For information about setting table paths when using the Hive connector, see Using the Hive Connector.

Using the Hive Connector

You can use Trino's built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers, or to save table-query views to the default Hive schema (hive.default).

Running the Hive CLI

You can start the Hive CLI in the platform by running the hive command from a web shell or Jupyter terminal.

Enabling Hive

To use the Trino Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform's Trino service:

On the Services dashboard page, select to edit the Trino service and navigate to the Custom Parameters tab.
Check the Enable Hive check box and provide the required configuration parameters:
- Username — the name of a platform user for creating and accessing the Hive Metastore.
- Container — The name of the data container that contains the Hive Metastore.
- Hive Metastore path — The relative path to the Hive Metastore within the configured container. If the path doesn't exist, it will be created by the platform.
Select Save Service to save your changes.
Select Apply Changes from the top action toolbar of the Services page to deploy your changes.

Note

If you later select to disable Hive or change the Hive Metastore path, the previously configured Hive Metastore won't be deleted automatically. You can delete it like any other directory in the platform's distributed file system, by running a file-system command (such as rm -rf) from a command-line interface (a web shell or a Jupyter notebook or terminal).
You cannot change the Hive user of the Trino service for an existing Hive Metastore. To change the user, you need to either also change the metastore path so as not to point to an existing metastore; or first delete the existing metastore — disable Hive for Trino, apply your changes, delete the current Hive Metastore directory (using a file-system command), and then re-enable Hive for Trino and configure the same metastore path with a new user.

Creating External Tables

To use the Hive connector to query data in a platform data container, you first need to use the Hive CLI to run a CREATE EXTERNAL TABLE statement that creates an external table. The statement should map the relevant data path to a unique table name, and define the names and data types of the table's columns (attributes); the data path in the statement should be specified as a fully qualified v3io path of the format v3io://<container name>/<relative data path>:

CREATE EXTERNAL TABLE <table name> (<column name> <column type>[, <column name> <column type>, ...]) stored as <file type> LOCATION '<data path>';

For example, the following command creates an external Hive table that links to a "prqt1" Parquet file in a "mycontainer" container with a string col1 column and a big-integer col2 column:

CREATE EXTERNAL TABLE prqt1 (col1 string, col2 bigint) stored as parquet LOCATION 'v3io://mycontainer/prqt1';

You can then reference this table by its name from Trino queries that use the hive catalog. For example:

SELECT * FROM hive.mycontainer.prqt1;

Defining Table Partitions

The Hive connector can also be used to query partitioned tables (see Partitioned Tables in the Trino CLI reference), but it doesn't automatically identify table partitions. Therefore, you first need to use the Hive CLI to define the table partitions after creating an external table. You can do this by using either of the following methods

Use the MSCK REPAIR TABLE statement to automatically identify the table partitions and update the table metadata in the Hive Metastore:
```
MSCK REPAIR TABLE <table name>;
```
For example, the following command updates the partition metadata for an external "prqt1" table:
```
MSCK REPAIR TABLE prqt1;
```
This is the simplest method, but it only identifies partition directories whose names are of the format <column name>=<column value>.
Use the ALTER TABLE ADD PARTITION statement to manually define partitions — where <partition spec> is of the format <partition column> = <partition value>[, <partition column> = <partition value>, ...], and <partition path> is a fully qualified v3io path of the format v3io://<container name>/<relative table-partition path>:
```
ALTER TABLE <table name> ADD [IF NOT EXISTS] PARTITION (<partition spec>) LOCATION '<partition path>'[, PARTITION <partition spec> LOCATION '<partition path>'];
```
For example, the following command defines a partition named "year" whose value is "2019", which maps to a partition directory named year=2019 in a "prqt1" table in a "mycontainer" container:
```
ALTER TABLE prqt1 ADD PARTITION (year=2019) LOCATION 'v3io://mycontainer/prqt1/year=2019';
```

Introduction to Using Trino in the Platform

Introduction

Table Paths

Using the Hive Connector

Enabling Hive

Creating External Tables

Defining Table Partitions

See Also