What is Data Preprocessing?
Data preprocessing is the process of cleaning and preparing the raw data to enable feature engineering. After getting large volumes of data from sources like databases, object stores, data lakes, engineers prepare them so data scientists can create features. This includes basic cleaning, crunching and joining different sets of raw data. In an operational environment this preprocessing would run as an ETL job for batch processing, or it could be part of a streaming processing for live data.
Once the data is ready for the data scientist - then comes the feature engineering part.
What is Feature Engineering?
Feature engineering is the creation of features from raw data. Feature engineering includes:
- Determining required features for ML mode
- Analysis for understanding statistics, distribution, implementing one hot encoding and imputation, and more. Tools like Python and Python libraries are used.
- Preparing features for ML model consumption
- Building the models
- Testing if the features achieve what is needed
- Repeating the preparation and testing process, by running experiments with different features, adding, removing and changing features. During the process, the data scientist might find out data is missing from the sources. The data scientist will request preprocessing again from the data engineer.
- Deployment to the ML pipeline
How Do Data Preprocessing and Feature Engineering Relate?
In preprocessing data engineers get and clean data from the sources to be used for feature engineering. Feature engineering is the part of creating the actual features.