Python Pandas at Extreme Performance

Yaron Haviv | August 8, 2019

Today we all choose between the simplicity of Python tools (pandas, Scikit-learn), the scalability of Spark and Hadoop, and the operation readiness of Kubernetes. We end up using them all. We keep separate teams of Python-oriented data scientists, Java and Scala Spark masters, and an army of devops to manage those siloed solutions.

Data scientists explore with pandas. Then other teams of data engineers re-code the same logic and make it work at scale, or make it work with live streams using Spark. We go through that iteration again and again when a data scientist needs to change the logic or use a different data set for his/her model.

In addition to taking care of the business logic, we build clusters on Hadoop or Kubernetes or even both and manage them manually along with an entire CI/CD pipeline. The bottom line is that we’re all working hard, without enough business impact to show for it…

What if you could write simple code in Python and run it faster than using Spark, without requiring any re-coding, and without devops overhead to address deployment, scaling, and monitoring?

Continue reading on Towards Data Science.

Python Pandas at Extreme Performance

Latest Posts

Using Agentic Frameworks to Build New AI Services

7 RAG Evaluation Tools You Must Know

Introducing MLRun v1.10: New tools for building agents and monitoring gen AI

Latest Posts

Using Agentic Frameworks to Build New AI Services

7 RAG Evaluation Tools You Must Know

Introducing MLRun v1.10: New tools for building agents and monitoring gen AI

You Might Also Enjoy

Top 27 Free Healthcare Datasets for Machine Learning [UPDATED]

11 Best Free Retail Datasets for Machine Learning [UPDATED]

Kubeflow Vs. MLflow Vs. MLRun: Which One is Right for You?