Data has become a primary business asset in the new digital age and we use three Vs to characterize big data – Variety, Volume and Velocity. Data is constantly generated by different sources and apps (Variety) in high speeds. It piles up to petabytes (Volume). We want to analyze it in order to extract value from it, which means we need fast access to the data (Velocity). We then share the data, while at the same time making sure only the right people or applications have access to the specific data we’re sharing, requiring granular security policies that do not compromise performance. Data is most useful when we’re able to build apps rapidly without imposing long IT processes or complex integrations. All this means that data platforms must operate as an automated and self-service cloud-native platform, deployed on-prem or in a hybrid fashion.
In order to meet those requirements, customers currently deploy several different data platforms and data services, building a complicated environment that is hard to manage and maintain.
To address this challenge, the Iguazio team had to think out of the box, combining the best technologies from multiple disciplines and re-architecting the stack to build a system that changes the paradigm of how data is stored and analyzed. Iguazio provides superior performance and fine-grained security, operating as a self-service portal and at the industry’s lowest cost per GB.
Redefining the Data Stack, Enabling Magnitudes-Faster Applications at Lower Costs
Access to low-level elements like memory, flash and networks can be extremely fast, yet we only see a fraction of that speed when we layer OS abstractions, middleware and apps on top, forcing us to use a lot more hardware resources and settling for high and unpredictable latencies.
The main problem stems from mistaken layering and heavy serialization across the stack. Every operation is translated into multiple blocking lower-level operations which are filled with context switches. Access to hardware resources is done through legacy OS abstractions that weren’t designed for modern hardware like many core CPUs, flash, non-volatile memories and fast network stacks.
The Iguazio team benefits from a long legacy of high-performance and real-time software development. We decided to redesign the stack from the ground up to deliver bare-metal application performance which run millions of application ops/sec per node and unprecedented latencies of sub 100 microseconds across 99% (percentile) of the calls. And we do all that at maximum hardware utilization with total costs as low as a few cents per GB per month.
We achieved the “impossible” by combining several cutting-edge technologies such as:
- Reducing the number of layers and inter-layer chatter to the minimum of one or less calls per op (V3IO™ is asynchronous, uses micro-batching, atomic compound and conditional operations).
- Asynchronous lock-free, zero-copy and microsecond level messaging between layers in the same or different nodes, making the system perform like one huge and linearly scaling machine.
- Real-time data processing engines utilizing CPU parallelism and vector instructions, with bare-metal implementations of real-time scheduling, memory management, network and disk IO. This bypasses serialized and blocking layers in the OS and guarantees low latency and low jitter.
- Non-volatile memory for zero latency write commits, metadata updates and indexes couples with fast asynchronous paging to NVMe flash, creating a giant virtual memory space to store indexes, metadata and warm data at 20x density and 20x lower costs compared to DRAM.
- Highly resilient petabyte capacity disk or flash enclosures (JBOD/F) to maximize density and throughput while cutting down the number of systems and cost, delivering higher resiliency and avoiding disk rebalancing in case of node failures.
Iguazio data engines also implement a variety of advanced classification, search, indexing, and data manipulation algorithms that were implemented using low-level distributed real-time code. These algorithms provide significant application acceleration and minimize traffic between the application and platform.
A Unified Data Model Storing Data Once and Reading Through Any API
A key factor driving complexity is having all these single model data repositories for files, objects, streams and records. In many cases, we use multiple repositories with the same API model but optimized for different access patterns (random, sequential, hierarchical), or different capacity, or cost vs performance tradeoffs.
We end up creating complex data pipelines with duplicated data, constant synchronization, ETL processes, API glue logic and a maintenance nightmare with the need to configure, secure, tune, handle failures and upgrade each of these independently.
The Iguazio platform exposes abstract “data container” services which are distributed and replicated across the system. These data containers store and organize data objects serving one or more applications, APIs and users. The applications can read, update, search or manipulate data objects, while the data service provides guaranteed data consistency, durability and availability. Various access control, QoS, or data lifecycle management policies can be applied to objects, groups or an entire data-container.
Data containers store normalized data elements and collections which can be viewed simultaneously as files, objects, streams, or table records by different APIs. The data is indexed, encoded and stored in the most efficient way to reduce data footprint, while in parallel maximizing search and scan performance per each data type. The APIs provide advanced atomic, query and vector operations on the data to offload some of the hardest application tasks, enable application concurrency and minimize IO and communication overhead. Iguazio’s Spark DataFrame API makes heavy use of those offloads to speed up analytics by a factor of 100x (using Spark predicate push-down).
The Iguazio data model maintains synchronized random, sequential and hierarchical indexes to the data, delivering high performance regardless of the access pattern and eliminating the need for multiple data stores and constant synchronization.
Having high-performance metadata search and a unified catalog enable new applications. For example, customers make SQL or API queries against file metadata to identify or manipulate specific objects without long and resource consuming directory traversals, eliminating separate and unsynchronized file metadata databases.