Data Containers, Collections, and Objects
All data in the platform is stored in user-defined data containers (“containers”). Within containers, you can store and retrieve data objects of any type — such as files, binary large objects (blobs), table items, or stream records. You can also group data objects of any type into collections (such as stream shards, NoSQL tables, or file-system directories), and perform high-level type-specific data manipulation. You can often optionally access the same data in different ways, using the various supported APIs. A single container can be used to store different types of data. The best practice is to have a dedicated container per application.
The standard platform installations have two predefined containers:
- “bigdata”, which is the default container.
“users”, which is designed to provide individual user development environments and is used by the platform to manage application services.
When creating a new web-based shell, Jupyter Notebook, or Zeppelin service, a
<username>directory for the service’s running user is automatically created in this container (if it doesn’t already exist) and is set as the home directory (
$HOME). This directory contains files required for managing the service, such as configuration files or tutorial notebooks.NoteSee the restrictions for this container in the Software Specifications and Restrictions.
For detailed information about how to view and edit the contents of the data container, see the Working with Data Containers tutorial.
Container Names and IDs
Every container has a name, which is a user-assigned string that uniquely identifies the container within its tenant and in the dashboard. In addition, the platform assigns every container a unique numeric ID.
Container names are subject to the general file-system naming restrictions and the following additional restrictions:
Contain only the following characters:
- Lower-case letters (a–z) and numeric digits (0–9)
- Hyphens (-)
- Underscores (_)
Begin and end with a lower-case letter (a–z) or a numeric digit (0–9).
Contain at least one lower-case letter (a–z).
Not contain multiple successive hyphens (--) or underscores (__).
The length of the container name must be between 1 and 128 characters.
All data objects in the platform have attributes. An attribute provides information (metadata) about an object. NoSQL table-item attributes in the platform are the equivalent of columns in standard NoSQL databases. See the NoSQL databases overview, including a terminology comparison. For the supported attribute data types, see the Attribute Data Types Reference.
Attributes are classified into three logical types:
- User attributes
- Attributes that the user assigns to a data object using the platform APIs.
- System attributes
- Attributes that the platform automatically assigns to all objects.
The system attributes contain the object's name (
__name) and miscellaneous information that was identified for the object by the system, such as the UID ( __uid) and GID ( __gid) of the object's owner, and the object's last-modification time __mtime_secs. For a full list of the supported system attributes, see the System-Attributes Reference.
- Hidden attributes
- Attributes that the user or platform optionally assign to an object and are used to store internal information that is not meant to be exposed.
__) to differentiate them from user attributes.
Attribute names are subject to the general file-system naming restrictions and the following additional restrictions:
Contain only the following characters:
- Alphanumeric characters (a–z, A–Z, 0–9)
- Underscores (_)
Begin either with a letter (a–z, A–Z) or with an underscore (_).
The maximum length of the attribute name is 255 characters.
Object Names and Primary Keys
When adding a new data object to the platform, you provide the object’s name or the required components of the name.
The platform stores the object name in the
Sharding and Sorting Keys
Primary keys affect the way that objects are stored in the platform, which in turn affects performance. The platform supports two types of object primary keys:
- Simple primary key
- A simple primary key is composed of a single logical key whose value uniquely identifies the object. This key is known as the object’s sharding key. For example, a collection with a simple username primary key might have an object with the primary-key value “johnd”, which is also the value of the object’s sharding key.
- Compound primary key
- A compound primary key is composed of two logical keys — a sharding key and a sorting key — whose combined values uniquely identify the object.
The value of a compound primary key is of the format
<sharding-key value>.<sorting-key value>. All characters before the leftmost period in an object’s primary-key value define the object’s sharding-key value, and the characters to the right of this period (if exist) define its sorting-key value. For example, a collection with a compound primary key that is made up of a username sharding key and a date sorting key might have an object with the sharding-key value “johnd”, the sorting-key value “20180602”, and the combined unique compound primary-key value “johnd.20180602”.
The platform divides the physical data storage into multiple units — data slices (also known as data shards or buckets). When a new data object is added, a hash function is applied to the value of its sharding key and the result determines on which slice the object is stored. All objects with the same sharding-key value are stored in a cluster on the same slice, sorted in ascending lexicographic order according to their sorting-key values (if exist). This design enables the support for faster NoSQL table queries that include a sharding-key and optionally also a sorting-key filter (see NoSQL Read Optimization).
For best-practice guidelines for defining primary keys, optimizing data and workload distribution, and improving performance, see Best Practices for Defining Primary Keys and Distributing Data Workloads.
To work with a NoSQL table using Spark DataFrames or Presto, the table items must have a sharding-key user attribute, and in the case of a compound primary-key also a sorting-key user attribute; for more efficient range scans, use a sorting-key attribute of type string (see Best Practices for Defining Primary Keys and Distributing Data Workloads for more information). To work with a NoSQL table using V3IO Frames, the table items must have a primary-key user attribute. The values of such key user attributes must match the value of the item’s primary key (name) and shouldn’t be modified after the initial item ingestion. (The NoSQL Web API doesn’t require such attributes and doesn’t attach any special meaning to them if they exist.) To change an item’s primary key, delete the existing item and create a new item with the desired combination of sharding and sorting keys and matching user key attributes, if required.
In the current release, a NoSQL table with a compound primary key that’s created with a Spark DataFrame cannot be accessed with Frames, and a compound primary-key table that’s created with Frames cannot be accessed with Spark DataFrames or Presto.
The names of all data objects in the platform (such as items and files) are subject to the general file-system naming restrictions, including a maximum length of 255 characters. In addition —
- A period in an object name indicates a compound name of the format
<sharding key>.<sorting key>. See Sharding and Sorting Keys.