Data Containers, Collections, and Objects

On This Page

Overview

All data in the platform is stored in user-defined data containers (**“containers”**). Within containers, you can store and retrieve data **objects** of any type — such as files, binary large objects (blobs), table items, or stream records. You can also group data objects of any type into **collections** (such as stream shards, NoSQL tables, or file-system directories), and perform high-level type-specific data manipulation. You can often optionally access the same data in different ways, using the various supported APIs. A single container can be used to store different types of data. The best practice is to have a dedicated container per application.

Predefined Containers

The standard platform installations have two predefined containers:

  • bigdata”, which is the default container.

  • users”, which is designed to provide individual user development environments and is used by the platform to manage application services.
    When creating a new web-based shell, Jupyter Notebook, or Zeppelin service, a <username> directory for the service’s running user is automatically created in this container (if it doesn’t already exist) and is set as the home directory ($HOME). This directory contains files required for managing the service, such as configuration files or tutorial notebooks.

    Note
    See the restrictions for this container in the Software Specifications and Restrictions.

For more information about the data containers and how to reference the contained data, see the Platform Fundamentals tutorial. For detailed information on how to view and edit the contents of the data container, see the Working with Data Containers and Ingesting and Preparing Data tutorials.

Container Names and IDs

Every container has a name, which is a user-assigned string that uniquely identifies the container within its tenant and in the dashboard. In addition, the platform assigns every container a unique numeric ID.

Container-ID Deprecation
Whenever possible, identify a container by its name (or “alias”) and not by its ID. For platform APIs that support both identification methods, the container-ID option is deprecated and will eventually be removed.

Container-Name Restrictions

Container names are subject to the general file-system naming restrictions and the following additional restrictions:

  • Contain only the following characters:

    • Lowercase letters (a–z) and numeric digits (0–9)
    • Hyphens (-)
    • Underscores (_)
  • Begin and end with a lowercase letter (a–z) or a numeric digit (0–9)

  • Contain at least one lowercase letter (a–z)

  • Not contain multiple successive hyphens (-) or underscores (_)

  • Length of 1–128 characters

Note
Container names cannot contain spaces.

Object Attributes

All data objects in the platform have attributes. An attribute provides information (metadata) about an object. NoSQL table-item attributes in the platform are the equivalent of columns in standard NoSQL databases. See the NoSQL databases overview, including a terminology comparison. For the supported attribute data types, see the Attribute Data Types Reference.

Attribute Types

Attributes are classified into three logical types:

User attributes
Attributes that the user assigns to a data object using the platform APIs.
System attributes
Attributes that the platform automatically assigns to all objects. The system attributes contain the object's name (__name) and miscellaneous information that was identified for the object by the system, such as the UID (__uid) and GID (__gid) of the object's owner, and the object's last-modification time __mtime_secs. For a full list of the supported system attributes, see the System-Attributes Reference.
Hidden attributes
Attributes that the user or platform optionally assign to an object and are used to store internal information that is not meant to be exposed.
Note
The platform’s naming convention is to prefix the names of system and hidden attributes with two underscores (__) to differentiate them from user attributes.

Attribute Names

Attribute names are subject to the general file-system naming restrictions and the following additional restrictions:

  • Contain only the following characters:

    • Alphanumeric characters (a–z, A–Z, 0–9)
    • Underscores (_)
  • Begin either with a letter (a–z, A–Z) or with an underscore (_)

  • Not identical to a reserved name — see Reserved Names

  • Length of 1–255 characters

Note
Spaces in attribute names are currently not supported.

Object Names and Primary Keys

When adding a new data object to the platform, you provide the object’s name or the required components of the name. The platform stores the object name in the __name system attribute, which is automatically created for each object, and uses it as the value of the object’s primary key, which uniquely identifies the object within a collection (such as a NoSQL table).

Sharding and Sorting Keys

Primary keys affect the way that objects are stored in the platform, which in turn affects performance. The platform supports two types of object primary keys:

Simple primary key
A simple primary key is composed of a single logical key whose value uniquely identifies the object. This key is known as the object’s sharding key. For example, a collection with a simple username primary key might have an object with the primary-key value “johnd”, which is also the value of the object’s sharding key.
Compound primary key
A compound primary key is composed of two logical keys — a sharding key and a sorting key — whose combined values uniquely identify the object. The value of a compound primary key is of the format <sharding-key value>.<sorting-key value>. All characters before the leftmost period in an object’s primary-key value define the object’s sharding-key value, and the characters to the right of this period (if exist) define its sorting-key value. For example, a collection with a compound primary key that is made up of a username sharding key and a date sorting key might have an object with the sharding-key value “johnd”, the sorting-key value “20180602”, and the combined unique compound primary-key value “johnd.20180602”.

The platform divides the physical data storage into multiple units — data slices (also known as data shards or buckets). When a new data object is added, a hash function is applied to the value of its sharding key and the result determines on which slice the object is stored. All objects with the same sharding-key value are stored in a cluster on the same slice, sorted in ascending lexicographic order according to their sorting-key values (if exist). This design enables the support for faster NoSQL table queries that include a sharding-key and optionally also a sorting-key filter (see NoSQL Read Optimization).

For best-practice guidelines for defining primary keys, optimizing data and workload distribution, and improving performance, see Best Practices for Defining Primary Keys and Distributing Data Workloads.

Note
  • The value of a sharding key cannot contain periods, because the leftmost period in an object’s primary-key value (name) is assumed to be a separator between sharding and sorting keys.

  • To work with a NoSQL table using Spark DataFrames or Presto, the table items must have a sharding-key user attribute, and in the case of a compound primary-key also a sorting-key user attribute; for more efficient range scans, use a sorting-key attribute of type string (see Best Practices for Defining Primary Keys and Distributing Data Workloads for more information). To work with a NoSQL table using V3IO Frames, the table items must have a primary-key user attribute. The values of such key user attributes must match the value of the item’s primary key (name) and shouldn’t be modified after the initial item ingestion. (The NoSQL Web API doesn’t require such attributes and doesn’t attach any special meaning to them if they exist.) To change an item’s primary key, delete the existing item and create a new item with the desired combination of sharding and sorting keys and matching user key attributes, if required.

Object-Name Restrictions

The names of all data objects in the platform (such as items and files) are subject to the general file-system naming restrictions, including a maximum length of 255 characters. In addition —

  • A period in an object name indicates a compound name of the format <sharding key>.<sorting key>. See Sharding and Sorting Keys.

See Also