The Trino Service (formerly Presto)

On This Page

Trino is an open-source distributed SQL query engine for running interactive analytic queries. The platform has a pre-deployed tenant-wide Trino service that can be used to run SQL queries and perform high-performance low-latency interactive data analytics. You can ingest data into the platform using your preferred method — such as using Spark, the NoSQL Web API, a Nuclio function, or V3IO Frames — and use Trino to analyze the data interactively with the aid of your preferred visualization tool. Running Trino over the platform's data services allows you to filter data as close as possible to the source.

You can run SQL commands that use ANSI SQL SELECT statements, which will be executed using Trino, from Jupyter Notebook, a serverless Nuclio function, or a local or remote Trino client. The platform comes pre-deployed with the native Trino CLI client (trino-cli), a convenience wrapper to this CLI that preconfigures some options for local execution (trino), and the Trino web UI — which you can log into from the dashboard's Services page. You can also integrate the platform's Trino service with a remote Trino client — such as Tableau or QlikView — to remotely query and analyze data in the platform over a Java database connectivity (JDBC) connector.

The Iguazio Trino connector enables you to use Trino to run queries on data in the platform's NoSQL store — including support for partitioning, predicate pushdown, and column pruning, which enables users to optimize their queries.

You can also use Trino's built-in Hive connector to query data of the supported file types, such as Parquet or ORC, or to save table-query views to the default Hive schema. Note that to use the Hive connector, you first need to create a Hive Metastore by enabling Hive for the platform's Trino service. For more information, see Using the Hive Connector in the Trino overview.

The platform also has a built-in process that uses Trino SQL to create a Hive view that monitors both real-time data in the platform's NoSQL store and historical data in Parquet or ORC tables [Tech Preview].

For more information about using Trino in the platform, see the Trino Reference. See also the Trino and Hive restrictions in the Software Specifications and Restrictions documentation.

Note
If you delete a Trino service, its Hive and Mariadb are automatically disabled.

Configuring the Service

Pod Priority

Pods (services, or jobs created by those services) can have priorities, which indicate the relative importance of one pod to the other pods on the node. The priority is used for scheduling: a lower priority pod can be evicted to allow scheduling of a higher priority pod. Pod priority is relevant for all pods created by the service.
Eviction uses these values to determine what to evict with conjunction to the pods priority. See more details in Interactions between Pod priority and quality of service.

Pod priority is specified through Priority classes, which map to a priority value. The priority values are: High, Medium, Low. The default is Medium.

Configure the default User functions default priority for a service, which is applied to the service itself or to all subsequently created user-jobs in the service's Common Parameters tab, User jobs defaults section, Priority class drop-down list.

Host Path Volumes

You can create host path volumes for use with Spill to Disk.

  1. In the Custom Parameters tab, under Create host path volumes type in the:
    • Host Path: an existing path on the app node with rwx permissions
    • Container Path: Path of the designated volume in the Trino worker pod (that will be mounted in the container). If used for Spill to Disk, this must be the parent folder of the spiller-spill-path.
  2. Repeat for additional volumes.
  3. Save the service.

Spill to Disk

The platform supports the Trino Spill to Disk feature: during memory intensive operations, Trino allows offloading intermediate operation results to disk(s). The goal is to enable execution of queries whose memory requirements exceed the per query or per node limits.

To configure Spill to Disk:

  1. Create a host path volume, if it doesn't already exist. The iguazio user must have rw access to the Host Path.
  2. In the Trino service configuration, Custom Parameters tab, press Workers and add these two parameters to the config.properties:
    • spiller-spill-path= the path, or paths, to the designated disks in the Trino worker pod. When using multiple spill paths, write a comma-separated list of paths.
    • spill-enabled=true
Note
  • Trino creates the leaf folder that's written in the spiller-spill-path property of the config.properties file.
  • The container path of the host path volume must be the parent of the spiller-spill-path.

For more details on the Trino configuration, see Spilling properties.

Node selection

You can assign jobs and functions to a specific node or a node group, to manage your resources, and to differentiate between processes and their respective nodes. A typical example is a workflow that you want to only run on dedicated servers.

When specified, the service or the pods of a function can only run on nodes whose labels match the node selector entries configured for the service. You can also specify labels that were assigned to app nodes by an iguazio IT Admin user. See Setting Labels on App Nodes.

Configure the key-value node selector pairs in the Custom Parameters tab of the service.

If node selection for the service is not specified, the selection criteria defaults to the Kubernetes default behavior, and jobs run on a random node.
Node selection is relevant for all cloud services.

See more about Kubernetes nodeSelector.

The node selection also affects any additional services that are directly affected by Trino, for example hive and mariadb, which are created if Enable hive is checked in the Trino service.

See Also