Data Science in the Post Hadoop Era

Adi Hirschtein | June 18, 2019

With all the turmoil and uncertainty surrounding large Hadoop distributors in the past few weeks, many wonder what’s happening to the data framework we’ve all been working on for years?

Hadoop was formed a decade ago, out of the need to make sense of piles of unstructured weblogs in an age of expensive and non-scalable databases, data warehouses and storage systems. Since then Hadoop has evolved and tried to take on new challenges, adding orchestration (YARN) and endless Apache projects. But times have changed, and businesses are discovering simpler solutions to facilitate their more sophisticated machine learning applications. These applications are real-time and use data in motion, requirements that Hadoop was never designed to handle.

The Modern Data Science Toolkit

Today’s data scientists write code in Python using Jupyter notebook or PyCharm and work with modern machine learning frameworks like TensorFlowPyTorch and Scikit Learn. All of these tools are now offered by open-source applications outside of the Hadoop ecosystem running over Kubernetes.

Kubernetes is Everywhere

The popularity of Kubernetes is exploding. IBM acquired RedHat for its commercial Kubernetes version (OpenShift) and VMware purchased Heptio, a company founded by Kubernetes originators. This is a clear indication that more and more companies are betting on Kubernetes as their multi-cloud clustering and orchestration technology. While some still think it makes sense to manage big data as a technology silo on Hadoop, early adopters are realizing that they can run their big data stack (SparkPrestoKafka, etc.) on Kubernetes in a much simpler manner. Furthermore, they can run all of the modern post-Hadoop AI tools on the same cluster. Another Kubernetes advantage is its portability, enabling users to build clusters which span multiple clouds or are distributed across locations. Portability also facilitates the development or testing of microservices in the cloud and deployment in one or many edge locations automatically.

Machine Learning Pipeline Automation

CI/CD is well-known in the software development world, however, when it comes to data science, organizations are still in the dark. Hadoop offers different ETL tools for data engineering, but they are not suitable for machine learning pipelines. Open source tools like KubeFlow and MLFlow enable pipeline automation and are aligned with data science lifecycle elements including data collection, preparation, training models with hyperparameters and model deployment. Moreover, they enable developers to track experiments and run comparisons.

Shifting to the Cloud

Enterprises are increasingly moving to the cloud to manage and analyze data while saving costs and reducing engineering resources. Customers are building their data lakes in the cloud on AWS S3 or Azure Blob storage while leveraging additional services such as EMR and Azure ML Workbench. These solutions enable scalability and give users a managed experience which includes data management, AI and serverless functions for minimum devops. However, there are some catches. Cloud users still need to glue all these services together for a streamlined pipeline and face vendor-lockins.

Serverless in the Free World

Serverless functions automate infrastructure and server operations, including deployment, scaling and application management. Up until now, serverless was limited to proprietary cloud technologies, locking users into AWS or Azure. However, new open-source and multi-cloud serverless technologies such as OpenWhiskNuclio, and Fn designed to run over Kubernetes, are outperforming and out-featuring cloud provider serverless options. Frameworks like Nuclio have added specific features for big data, stream processing and AI workloads. Today anyone can overcome the barriers of code development, testing, scaling and operationalization without committing to one vendor.

From Data Platforms to Data Science Platforms

Traditional Hadoop-based data lakes served a different time and purpose and today, they are less-than-optimal in meeting the new needs of data scientists. In a post-Hadoop data science era, companies are re-envisioning how they approach data, transform their static and slow-moving data pipelines into dynamic and real-time data science pipelines, creating a new class of intelligent applications. Modern platforms like Iguazio enable scalability, collaboration and the ability to move projects from inception to production more seamlessly.