MLOps Live

Join our webinar on Improving LLM Accuracy & Performance w/ Databricks - Tuesday 30th of April 2024 - 12 noon EST

Big Data Must Begin with a Clean Slate

Yaron Haviv | March 6, 2018

More than a decade has passed since we coined the term “big data,” and a decade in the tech world is almost infinity. Is big data now obsolete?

The short answer is that although big data in itself may still have its place for some apps, the focus has shifted to integrating data-driven insights into business applications, making sure they automate expensive manual operations or generate intelligent actions to help acquire new customers — that is, “actionable insights.” This requires very different tools and methodologies than the ones we used for “big data.”

Actionable Insights Require a Fundamental Change

Common practice involves collecting data from various sources, then running multiple aggregation and join queries on it to create a meaningful, contextual data set. The output data is fed into machine learning algorithms that try to find common patterns or anomalies. In many cases, this is an iterative process involving trial and error, producing an artificial intelligence model that’s used for prediction or classification.

Teams of data scientists and data engineers are used in the first step and, frankly, in most cases companies don’t get beyond this phase, because the biggest challenge is putting the pipeline into operation and integrating it into existing business applications or use cases. Having a data lake for data mining and data scientists who write experiments in R or Python languages over a year worth of historical data is not the end goal. The goal is stopping fraud, running predictive maintenance and providing real-time product recommendations.

That requires having production-quality code, written by application developers, that can scale, handle failures and address operational challenges such as upgrades and security. Production systems must combine events, fresh data, historical data and AI logic in order to act at the time an event takes place with no visible delay.

Most organizations hit the wall when they figure out elephants can’t fly. Technologies designed for log analysis using immutable column structures, or unorganized textual and unstructured data, aren’t so useful when data keeps on changing and responses are expected immediately. People add microbatch, streaming solutions or real-time or NoSQL databases to the unstructured and unindexed data lake, hoping it will solve the problem. Instead, they created a multiheaded beast built from discrete parts, which cannot be easily tamed. They spend days tuning performance or resource and memory allocations and handling “occasional” hiccups, fantasizing about a better future.

So let’s begin with a clean slate. Here’s what we want:

  • Simple and continuous development, followed by automated testing and deployment into production systems, without compromising application security, scalability or availability.
  • Analytics as part of a continuous workflow with requests, events and data flowing in on one end and returning responses on the other, driving actions or presenting dashboards as quickly as possible.

This is best served by a continuous analytics approach, combined with a cloud-native microservices-based architecture.

Delivering Actionable Insights with Continuous Analytics

After completing the first step in data science — building a model for predicting behavior or the classification of information — we deploy it in a production system and keep enhancing or tuning its model to maximize accuracy. We break the flow into several major steps:

  • Ingest
  • Contextualize
  • Infer (predict)
  • Serve
  • Reinforce learning

We ingest data from various sources in the first step. This includes web requests from clients, IoT sensor updates, logs, pictures, audio streams or a stream of updated records from operational databases. Ingested data contains a very partial view, since decisions often require an historical perspective, such as a temperature or stock ticker trend in the last hour or day. Other data could be related to individual info about the requesting user, such as current financial balance or gender, and environmental information such as weather. Some data maybe unstructured — photos, voice or text — and require classification, cleansing, decoding or validation.

The next step is to contextualize or enrich ingested event data with a historical and environmental perspective. Ingestion and contextualization are processed in batches in the traditional data lake approach, using slow pipelines. Preserving raw information, as one would with big data, is no longer viable when systems respond to delay-sensitive customer requests, such as when controlling vehicles in the road, responding to fraud or cybersecurity attacks, or managing machines in a factory.

The key in contentious analytics is to form contextualized data — also known as an enriched feature vector — in real-time, followed by immediate decisions or predictions, known as inferencing. This requires indexed, structured and real-time data later to search and update the context with minimal latency. We cannot use traditional data lakes, which are immutable — no updates allowed — unindexed or unorganized.

Once we have state information and predictions, we use a flexible method to serve them to users or external devices. We can alert external systems if we get indications for a hazard or run AI algorithms to deliver smart responses and up-to-date intelligent dashboards. Outputs vary from web user interfaces, external application programming interface calls, chat bots, voice responses or even generation of custom video streams.

Finally, we determine the accuracy of our decisions once results are served, or shortly after. For example, we predicted the stock price will go up or weather will get hotter, but the stock crashed or a blizzard flew into town. We predict a car will run out of gas and it doesn’t. This information is later used to improve our prediction model, factoring it back into future decisions, a process known as reinforced learning.

Faster Time to Production

OK, so we figured out a way to integrate intelligence into our workflow using continuous analytics, but our key challenge here is to rapidly develop services and continuously enhance them, just like cloud services or software-as-a-service products.

Let’s break our solutions into three major components:

  • Data services (databases, object store, messaging)
  • Analytics and AI services
  • Custom application microservices

We had better subscribe to a cloud provider or commercially supported data if we’re not Google Inc., Amazon.com Inc. or Microsoft Corp. so that we can focus on our application logic with continuous analytics. Unfortunately, engineers – and I know this because I myself am an engineer – have a tendency to download various components and stitch them together to a working solution, only to see it break when operations scale, finding themselves unable to diagnose and secure the service in production pipelines. The DIY approach can slow us down significantly and we’ll be better off skipping this attempt to reinvent the wheel, especially when under the gun to deliver new business services.

The best way is to adopt best practices from cloud companies and adopt a cloud-native architecture. In a nutshell, cloud-native addresses application durability, elasticity and continuous delivery. It leverages microservices, that is, small stateless application fragments which are deployed and auto-scaled using Docker or Kubernetes container orchestration software to address service elasticity and resiliency. Multiple tiers of microservices are part of a bigger and evolving application. Microservices use cloud-native storage and databases to store state.

Your application should be broken into functional microservices that run within an analytics framework or service, such as on Spark or TensorFlow, or use Python data science tools. It is important to run these services containerized on modern clusters such as Kubernetes that provide management, security, fault recovery, auto-scaling and the like. In many cases you can use pre-integrated AI services and access them via APIs — for example, uploading a picture via an API and getting back information about faces, or sending voice recording and getting back a natural language model. In the future we will see API and function marketplaces.

Writing your own application and taking care of persistency, auto-scaling, monitoring, security and versioning — not to mention managing the underlying server infrastructure — can be a long and frustrating task. This is where serverless functions come in.

Serverless platforms allow us to write code along with platform dependencies and it will be built, tested, deployed, secured and auto-scale automatically. Serverless leads to significant cost savings. More importantly, it allows us to develop services faster. There are cloud provider serverless frameworks such as AWS Lambda or Azure Functions, as well as several multicloud and open-source serverless frameworks, such as OpenWhisk, nuclio and OpenFaas.

To conclude, data at rest and data lakes store lots of useless data. Focus on a model which makes continuous use of data to improve the business bottom line: Acquire more customers, make sure they are happy and come back, and reduce operational costs through automation. Coupling a continuous analytics flow to generate actionable insights with cloud-native and serverless technologies is the productive way to deliver smarter business in shorter time with lower risks and resources.

(This post by Yaron Haviv was initially published on siliconANGLE).