Best Practices for Defining Primary Keys and Distributing Data Workloads

On This Page

Overview

When creating a new collection — namely, a NoSQL table or a directory in the file system — you should select a primary key for uniquely identifying the objects in the collection; (for streaming, this is handled implicitly by the platform). An object's primary key serves also as the object's name (i.e., the name of the object file) and is stored by the platform in the __name system attribute. The primary-key value of a data object (such as a table item or a file) is composed of a sharding key — which determines the data shard (slice) on which the object is physically stored — and optionally also a sorting key — which determines how objects with the same sharding key are sorted on the assigned slice. For more information about object keys and names, see Object Names and Primary Keys.

Your aim should be to select a primary key that results in an even distribution of the data and the data-access workload across the available data slices. This should be done by taking into account the collection's expected data set and ingestion and consumption patterns. Following are best-practice guidelines to help you in this mission.

Using a Compound Primary Key for Faster NoSQL Table Queries

A major motivation for using a compound <sharding key>.<sorting key> primary key for a NoSQL-table collection is to optimize query performance. A compound key enables the platform to support optimized range-scan and item-specific queries that include sharding- and sorting-key filters — see NoSQL read optimization. Such queries can be significantly faster than the standard full-table queries, because the platform searches the names of the clustered and sorted object files on the relevant data slice instead of scanning all table items.

Range scans are most beneficial if the majority of the table queries are expected to include a common filter criteria that consists of the same one or two attributes. Therefore, when selecting your keys, consider which queries are likely to be issued most often. For example, if you have a connected-cars application and most of your queries will include car-ID and trip-ID filter criteria — such as "car_id = <value>", "car_id = <value> AND trip_id = <value>", or "car_id = <value> AND trip_id > <value> AND trip_id <= <value>" — it would make sense to use the car-ID attribute as your table's sharding key and the trip-ID attribute as the sorting key, resulting in a <car_id>.<trip_id> primary key.

Using a String Sorting Key for Faster Range-Scan Queries

To support Trino and Spark DataFrame range-scan queries, the table must have sharding- and sorting-key user attributes. For faster range scans, use sorting-key attributes of type string: for a string sorting key, after identifying the relevant data slice by using the query's sharding-key value, the platform scans only the items within the query's sorting-keys range; but for a non-string sorting-key, the platform ignores the sorting key and scans all items on the data slice. The reason for ignoring non-string sorting keys is that the lexicographic sorting-key sort order that's used for storing the items on the data slice might not match the logical sort order for a non-string sorting key.

Selecting a Sharding Key for Even Workload Distribution

As explained, a collection's sharding key affects the distribution of the data among the available storage slices, as all objects with the same sharding-key value are assigned to the same slice. To balance the system's workload and optimize performance, try to select a sharding key for which there are enough distinct values in your data set to distribute evenly across the available slices (1024 slices per container). In addition, consider the expected data-access patterns. Note that even if your data set has many potential distinct sharding-key values, if most of the ingested objects or most of the table queries are for the same few sharding-key values, the performance is likely to suffer. As a general rule, increasing the ratio of accessed sharding-key values to the total number of distinct sharding-key values in the collection improves the efficiency of the data processing.

For example, a username for a data set with many users is likely to be a good sharding key, unless most of the information applies only to a few users. But an area ZIP code for a data set with only a few ZIP codes relative to the number of objects would not make for a good sharding key.

Using a compound <sharding key>.<sorting key> primary key can also help improve your collection's data distribution. For example, you can use a date sharding key and a device-ID sorting key (with values such as 20180523.1273) to store all data generated on the same day on the same data slice, sorted by the ID of the device that generated the data.

If you can't identify a sufficiently good sharding key for your collection, consider recalculating the sharding-key values as part of the objects' ingestion — see Recalculating Sharding-Key Values for Even Workload Distribution.

Recalculating Sharding-Key Values for Even Workload Distribution

To better balance the system's workload and improve performance when working with a non-uniform data set, consider recalculating the sharding-key values during the objects' ingestion so as to split a single object sharding-key value into multiple values. The objective is to increase the number of distinct sharding-key values in the collection and thus distribute the data objects more evenly across the available data slices.

Note

This method relevant only for collections with a compound <sharding key>.<sorting key> primary key, which allows for multiple objects in the collection to have the same sharding-key value.
This method should typically be used with "flat" tables or with tables that have only one or two partitions. Data in multiple-partition tables should already be evenly distributed across the available data slices. For more information on partitioning of NoSQL tables, see Partitioned Tables.

One method for splitting a sharding-key value is to add a random number to the end of the original sharding-key value (separated from the original value by a predefined character, such as an underscore), thus randomizing the data ingestion across multiple slices instead of a single slice. For example, objects with the username sharding-value johnd, such as johnd.20180602 and johnd.20181015, might be ingested as johnd_1.20181002 and johnd_2.20181015.

However, using a random sharding-key suffix makes it difficult to retrieve a specific object by its original primary-key value, because you don't know which random suffix was used when ingesting the specific object. This can be solved by calculating the suffix that is added to the original sharding-key value based on the value of an attribute that you might use for retrieving (reading) objects from the collection. For example, you can apply a hash to the value of the sorting-key attribute, as done by the even-distribution write option of the platform's NoSQL Spark DataFrame (see details in the following subsection). The new object primary-key values look similar to the values for the randomized-suffix method — for example, johnd_1.20180602 and johnd_2.20181015. But because the numeric suffixes in this case were created by using a known formula that is based on an object attribute, you can always apply the same calculation to retrieve a specific object by its original sharding- and sorting-key values.

Note that regardless of the sharding-key recalculation method, retrieving all objects for a given sharding-key value from the original data set requires submitting multiple read requests (queries) — one for each of the new primary-key object values. However, the platform's NoSQL Spark DataFrame and Trino interfaces handle this implicitly, as outlined in the following subsection.

Using a NoSQL Spark DataFrame for Even Workload Distribution

The platform's NoSQL Spark DataFrame has a custom range-scan-even-distribution write option to simplify even distribution of ingested items by allowing the platform to recalculate the sharding- and primary-key values of the ingested items based on the value of the items' sorting-key.

In addition, the NoSQL Spark DataFrame and the Iguazio Trino connector allow you to query such tables by using the original sharding-key value (which remains stored in the item's sharding-key attribute), instead of submitting separate queries for each of the new sharding-key values that are used in the item's primary-key values.

For more information, see the NoSQL DataFrame Even Workload Distribution reference and the following behind-the-scenes implementation details.

Behind the Scenes

The platform calculates the new item sharding-key value for the NoSQL Spark DataFrame even-distribution write option in the following manner:

Apply the xxHash hash function to the item's sorting-key value.
Perform a modulo operation on the sorting-key hash result, using the value that is configured in the platform's v3io.kv.range-scan.hashing-bucket-num configuration property (default = 64) as the modulus.
Append to the original sharding-key value an underscore ('_') followed by the result of the modulo operation on the sorting-key hash.

The result is a new primary-key value of the format <original sharding-key value>_{xxHash(<sorting-key value>) % <preconfigured modulus>}.<sorting-key value>. For example, johnd_1.20180602 for an original johnd sharding-key value. With this design, items with the same original sharing-key value are stored on different data slices, but all items with the same sorting-key value (which remains unchanged) are stored on the same slice.

When you query a table with a NoSQL Spark DataFrame or with the Trino CLI, you use the original sharding-key value. Behind the scenes, the platform can identify whether the original sharding-key value was recalculated using the method of the Spark DataFrame even-distribution option (provided the table has a schema that was inferred with a NoSQL DataFrame or the Iguazio Trino connector. In such cases, the platform searches all relevant slices to locate and return information for all items with the same original sharding-key value.

Using the NoSQL Web API for Even Workload Distribution

The NoSQL Web API doesn't have specific support for even workload distribution, but you can select to implement this manually: when ingesting items with the PutItem or UpdateItem operation, recalculate the sharding-key value (using the provided guidelines) and set the item's primary-key value to <recalculated sharing-key value>.<sorting-key value>.

Note that when submitting a NoSQL Web API GetItem request, you need to provide the full item primary-key value (with the recalculated sharding-key value), and when submitting a GetItems range-scan request, you need to provide the full recalculated sharding-key value. As explained in this guide, to retrieve all items with the same original sharding-key value, you'll need to repeat the requests for each of the newly calculated sharding-key values.

NoSQL Spark DataFrame and Trino Notes

When ingesting an item with the web API, if you wish to also support table reads (queries) using Spark DataFrames or Trino —
- You must define user attributes for the original sharding-key and sorting-key values, regardless of whether you recalculate the sharding-key value. Take care not to modify the values of such attributes after the ingestion.
- If you select to do recalculate the sharding-key value, use the method that is used by the NoSQL Spark DataFrame's even-distribution option (as outlined in the previous subsection) — i.e., use primary-key values of the format <original sharding-key value>_{xxHash(<sorting-key value>) % <modulus set in the v3io.kv.range-scan.hashing-bucket-num property>}.<sorting-key value> — to allow simplified Spark and Trino queries that use the original sharding-key value.
To use the web API to retrieve items that were ingested by using the NoSQL Spark DataFrame's even-distribution write option (or an equivalent manual web-API ingestion), you need to repeat the get-item(s) request with <original sharding key>_1 to <original sharding key>_<n> sharding-key values, where <n> is the value of the v3io.kv.range-scan.hashing-bucket-num configuration property.

For example, for an original sharding-key value of johnd, a sorting-key value of 20180602, and the default configuration-property value of 64 — call GetItem with primary-key values from johnd_1.20180602 to johnd_64.20180602, and call GetItems with sharding-key values from johnd_1 to johnd_64.

Distributing the Data Ingestion Efficiently

It's recommended that you don't write (ingest) multiple objects with the same sharding-key value at once, as this might create an ingestion backlog on the data slice to which the data objects are assigned while other slices remain idle.

An alternative ingestion flow, which better balances the ingestion load among the available data slices, is to order the ingestion based on the values of the objects' sorting key. Objects with the same sorting-key value don't have the same sharding-key value and therefore won't be assigned to the same slice. For example, if you have a collection with a device-ID sharding key and a date sorting key, you can ingest all objects for a given date (sorting-key value) and then proceed to the next date in the data set.