News

PadSquad Deploys the Iguazio Data Science Platform to Predict Ad Performance in Real-Time

A Look Under the Hood

Data Science Platform Powering
Machine Learning Pipelines

The Iguazio Data Science Platform enables you to develop, deploy and manage real-time AI applications at scale. It provides friendly pipeline orchestration tools, serverless functions and services for automation and an extremely fast multi-model data layer, all packaged in a managed an open platform.

End-to-End
Pipeline Orchestration

Manage your workflow end to-end using a full-stack, user-friendly environment, featuring fully integrated workflow management, experiment tracking and AutoML tools

AutoML:

Users run multiple experiments in parallel, each using a different combination of algorithm functions and/or parameter sets (hyper-parameters) to automatically select the best result. By running AutoML over parallel functions (microservices) and data, complex tasks run at a fraction of the time with fewer resources and the selected implementation is deployed to production in one click.

Experiment Tracking:

Iguazio provides a generic and and easy to use mechanism to describe and track code, metadata, inputs and outputs of machine learning related tasks (executions). Users track various elements, store them in a database and presents all running jobs as well as historical jobs in a single report.

Feature Store:

One of the biggest data science challenges is maintaining the same set of features in the training and inferencing (real-time) stages. Iguazio provides a unified feature store, in which datasets are registered, ingested/imported and transformed for training and batch processing, leveraging various managed services (Spark, Dask, Presto, Nuclio, etc.). Enriched features are accessed through low-latency real-time key/value or time-series APIs. The platform reaches the performance of memory with the scale and lower costs of SSD/Flash, eliminating the need for separate in-memory databases and constant synchronization between different online and offline feature stores.

Workflow Management:

Iguazio is natively integrated Kubeflow Pipelines to compose, deploy and manage end-to-end machine learning workflows with UI and a set of services. To enable scalability, KubeFlow Pipelines works with Iguazio’s MLRun, orchestrating various horizontal-scaling and GPU accelerated data and ML frameworks. A single logical pipeline step may run on a dozen parallel instances of TensorFlow, Spark, or Nuclio (Iguazio’s serverless functions).

Serverless Automation and
Managed Services

Data science teams waste months and resources on infrastructure tasks involving data collection, packaging, scaling, tuning and instrumentation. Iguazio’s serverless engines automate DevOps to deploy projects in one week as opposed to months.

Serverless enables developers to write code which automatically transforms to auto-scaling production workloads, significantly cutting time to market and reducing resources. While serverless typically tackles only stateless and event driven workloads, with Nuclio, Iguazio’s open source serverless framework, it automates every step of the machine learning pipeline, including packaging, scaling, tuning, instrumentation and continuous delivery. The shift to microservices enables collaboration and code re-use, gradually tuning functions without breaking pipelines and consuming the right amount of CPUs, GPUs and memory resources. Nuclio delivers out-of-the-box production readiness by automatically managing API security, rolling upgrades, A/B testing, logging and monitoring

 

Automating Ingestion, Data Preparation, Training and Serving

Ingestion:

Nuclio was built with machine learning pipelines in mind and tackles ingestion with the following capabilities:

  • Parallelizes work within a single pod so that different workers can ingest data simultaneously
  • Has high throughput and low latency as well as real-time features like zero copy, making it well equipped for data intensive workloads
  • Handles any type of trigger like Pub Sub, Kinesis, Kafka, Cron and HTTP. Users are not limited to streaming frameworks they’re working with as they can be integrated with Nuclio.
  • Is easy to use

Data Preparation and Training:

With Iguazio, customers still benfit from serverless advantages such as on-demand resource utilization, auto-scaling and automation even for data intensive and batch oriented tasks. Different engines are wrapped around open source tools such as Spark, Tensorflow, Horovod and Nuclio to handle scalability and parallelism. These operate over Iguazio’s real-time data layer and under an abstraction layer eliminating operational tasks like building, monitoring and artifact tracking, providing the ability to code once and run on different run-times with 2 lines of code.

Serving:

Nuclio, Iguazio’s open source serverless project, virtually works with any type of event trigger. Users code in a Jupyter notebook and with a simple click convert it to a deployable function. The result is better use of resources on demand as models can scale up and down as needed. Nuclio provides optimized utilization of GPUs and CPUs by using half the amount of resources while achieving better performance.

Real-time
Data Layer

Data in the modern digital world comes in many different forms and shapes: it can be structured or unstructured; arrive in streams; or be stored in records or files. This no longer fits the traditional data warehouse or data lake approach.

Iguazio provides fast, secure and shared access to real-time and historical data including NoSQL, SQL, time series and files. It runs as fast as in-memory databases on Flash memory, enabling lower costs and higher density.

Iguaizo’s real-time data layer supports simultaneous, consistent and high-performance access through multiple industry standard APIs. Users can ingest any type of data using a variety of protocols and concurrently read the data using a different method. For example, one application can ingest an event stream while another application reads that data as a table or file.

Iguazio built its solution from the ground-up to maximize CPU utilization and leverage the benefits of non-volatile memory, 100GbE RDMA, flash and dense storage, achieving extreme performance with consistency at the lowest cost. It shifts the balance from underutilized systems and inefficient code to extremely parallel, real-time and resource optimized implementation requiring fewer servers.

Enterprise Grade

Iguazio’s Data Science Platform is delivered as an integrated offering with enterprise resiliency and functionality in mind. IT operators do not need to create automation scripts and have tight management throughout the day. Instead, they set up the system through wizards, configure administration policies and register for system notifications.

Customers securely share data by providing access directly to it and not to copies. The same data is always accessed, but different users are exposed to different elements of it, according to predefined rules. Granular security is only possible if data has structure and metadata and when identity and security are enforced end to end. The Iguazio real-time data layer classifies data transactions with a built-in, data firewall that provides fine-grained policies to control access, service levels, multi-tenancy and data life cycles. Organizations can enable data collaboration and governance across apps and business units without compromising security or performance.

Learn More

Data Science Platform Tutorials

Tutorial

Get started with a comprehensive video tutorial

Data Science Platform Documentation

Documentation

Access overviews, tutorials, references and guides