Introduction to Using Trino in the Platform
Introduction
You can use the Trino open-source distributed SQL query engine to run interactive SQL queries and perform high-performance low-latency interactive analytics on data that is stored in the platform.
Running Trino over the platform's data services enables you to filter data as close as possible to the source.
The platform's Iguazio Trino connector defines a custom data source that enables you to use Trino to query data in the platform's NoSQL store — including support for table partitioning, predicate pushdown, column pruning, and performing optimized item-specific and range-scan queries.
You can also use Trino's built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers; see Using the Hive Connector.
In addition, it's possible to add an Oracle connector to Trino
The default v3.6.0 platform installation includes the following Trino version 370 components:
- The Trino command-line interface (CLI) for running queries.
The web-based shell service and the terminals of the Jupyter Notebook platform service are automatically connected to the predefined Trino service and include both the native Trino CLI (
trino-cli ) and atrino wrapper to simplify working with the Iguazio Trino connector. For more information, see The Trino CLI. - The Trino server.
The server address is the API URL of the Trino service (
Trino
), which you can copy from the dashboardServices page. - The Trino web UI for monitoring and managing queries.
This interface can be accessed by using the HTTP or HTTPS UI URLs of the Trino service, which are available from the dashboard
Services page.
Table Paths
For information about setting table paths when using the Iguazio Trino connector, see Table Paths in the Trino CLI reference. For information about setting table paths when using the Hive connector, see Using the Hive Connector.
Using the Hive Connector
You can use Trino's built-in Hive connector to query data of the supported file types, such as Parquet or ORC, that is stored in platform data containers, or to save table-query views to the default Hive schema (hive.default
).
Enabling Hive
To use the Trino Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform's Trino service:
-
On the
Services dashboard page, select to edit the Trino service and navigate to theCustom Parameters tab. -
Check the
Enable Hive check box and provide the required configuration parameters:Username — the name of a platform user for creating and accessing the Hive Metastore.Container — The name of the data container that contains the Hive Metastore.Hive Metastore path — The relative path to the Hive Metastore within the configured container. If the path doesn't exist, it will be created by the platform.
Select
Save Service to save your changes. -
Select
Apply Changes from the top action toolbar of theServices page to deploy your changes.
-
If you later select to disable Hive or change the Hive Metastore path, the previously configured Hive Metastore won't be deleted automatically. You can delete it like any other directory in the platform's distributed file system, by running a file-system command (such as
rm -rf ) from a command-line interface (a web shell or a Jupyter notebook or terminal). -
You cannot change the Hive user of the Trino service for an existing Hive Metastore. To change the user, you need to either also change the metastore path so as not to point to an existing metastore; or first delete the existing metastore — disable Hive for Trino, apply your changes, delete the current Hive Metastore directory (using a file-system command), and then re-enable Hive for Trino and configure the same metastore path with a new user.
Creating External Tables
To use the Hive connector to query data in a platform data container, you first need to use the Hive CLI to run a v3io
path of the format v3io://<container name>/<relative data path>
:
CREATE EXTERNAL TABLE <table name> (<column name> <column type>[, <column name> <column type>, ...]) stored as <file type> LOCATION '<data path>';
For example, the following command creates an external Hive table that links to a "prqt1" Parquet file in a "mycontainer" container with a string
CREATE EXTERNAL TABLE prqt1 (col1 string, col2 bigint) stored as parquet LOCATION 'v3io://mycontainer/prqt1';
You can then reference this table by its name from Trino queries that use the hive
catalog.
For example:
SELECT * FROM hive.mycontainer.prqt1;
Defining Table Partitions
The Hive connector can also be used to query partitioned tables (see Partitioned Tables in the Trino CLI reference), but it doesn't automatically identify table partitions. Therefore, you first need to use the Hive CLI to define the table partitions after creating an external table. You can do this by using either of the following methods
-
Use the
MSCK REPAIR TABLE statement to automatically identify the table partitions and update the table metadata in the Hive Metastore:MSCK REPAIR TABLE <table name>;
For example, the following command updates the partition metadata for an external "prqt1" table:
MSCK REPAIR TABLE prqt1;
This is the simplest method, but it only identifies partition directories whose names are of the format
<column name>=<column value> . -
Use the
ALTER TABLE ADD PARTITION statement to manually define partitions — where<partition spec>
is of the format<partition column> = <partition value>[, <partition column> = <partition value>, ...]
, and<partition path>
is a fully qualifiedv3io
path of the formatv3io://<container name>/<relative table-partition path>
:ALTER TABLE <table name> ADD [IF NOT EXISTS] PARTITION (<partition spec>) LOCATION '<partition path>'[, PARTITION <partition spec> LOCATION '<partition path>'];
For example, the following command defines a partition named "year" whose value is "2019", which maps to a partition directory named
year=2019 in a "prqt1" table in a "mycontainer" container:ALTER TABLE prqt1 ADD PARTITION (year=2019) LOCATION 'v3io://mycontainer/prqt1/year=2019';