What Are the Tradeoffs Between a Data Lake and a Data Warehouse?

A data warehouse is for structured reporting data that is typically used for reporting data with consistent business data points (for example, metrics like quarterly sales). The structure of data in a data warehouse is well known, and it holds critical business data for reporting purposes. In a data warehouse, lots of work goes into structuring and cleaning the data upfront, so that it’s selective and useful. A data warehouse housed in a costly proprietary system, like Oracle or Terradata, so it makes sense to store the most critical business data here.

A data lake offers massive storage for a much cheaper price point, so you don’t need to know or care about the structure or content of the data. The approach with a data lake is to “dump it in, and deal with it later”. This kind of setup is for any kind of data science process, where the data could potentially hold some value, but exploratory analysis is required to uncover it. A data lake requires very little work up front, and heavy data engineering—processing, running transformations and calculations, etc--to extract the value later.

It's worth noting that data infrastructure changes along with the maturity of data use cases in organizations. Data warehouses are primarily for business intelligence, and data lakes are built once some kind of data science work has begun. Once multiple models need to be built and maintained, the next step of maturity is a feature store with advanced data transformations, where data scientists and data engineers can work together to own and operate feature engineering and feature serving.

Need help?

Contact our team of experts or ask a question in the community.

Have a question?

Submit your questions on machine learning and data science to get answers from out team of data scientists, ML engineers and IT leaders.