top of page

Data Preprocessing Pipeline utilizing Python

A data preprocessing pipeline streamlines the complex process by automating a series of steps, further enabling data professionals to efficiently and consistently preprocess diverse datasets.

Data preprocessing is a critical step in data science tasks. It ensures that raw data is transformed into a clean, organized, and structured format suitable for analysis. A data preprocessing pipeline automates this complex process by streamlining a series of steps. This allows data professionals to efficiently and consistently preprocess diverse datasets.


In this task, I will walk you through building a data preprocessing pipeline using Python and Sklearn. We will start by defining what a data preprocessing pipeline is and how it can help data professionals. Then, we will discuss the fundamental functions that every data preprocessing pipeline should perform. Finally, we will provide an example of a data preprocessing pipeline in Python.


What is a Data Preprocessing Pipeline?


A data preprocessing pipeline is a systematic and automated approach to data preprocessing. It consists of a series of interconnected steps, each of which is responsible for a specific preprocessing task. For example, a data preprocessing pipeline might include steps to:

  • Handle missing values

  • Standardize numerical features

  • Remove outliers

  • Encode categorical variables


By following a predefined sequence of operations, a data preprocessing pipeline ensures consistency, reproducibility, and efficiency in data preprocessing. This can save data professionals a significant amount of time and effort, and it can also help to ensure that the data is properly prepared for analysis.


How Does a Data Preprocessing Pipeline Help Data Professionals?


Data preprocessing pipelines can benefit a variety of data science professionals, including:

  • Data engineers: Data engineers can use data preprocessing pipelines to automate the tasks of data cleaning and transformation. This frees up their time to focus on more strategic work, such as designing scalable data architectures and optimizing data pipelines.

  • Data analysts: Data analysts can use data preprocessing pipelines to ensure that their data is clean and well-organized before they begin their analysis. This can help them to save time and effort, and it can also help to ensure that their results are accurate and reliable.

  • Data scientists: Data scientists can use data preprocessing pipelines to prepare their data for machine learning models. This can help them to improve the accuracy and performance of their models.

  • Machine learning engineers: Machine learning engineers can use data preprocessing pipelines to automate the tasks of data preparation and model deployment. This can help them to save time and effort, and it can also help to ensure that their models are deployed consistently and reliably.


How to Build a Data Preprocessing Pipeline in Python

Building a data preprocessing pipeline in Python is relatively straightforward. You can use the following steps as a guide:

  1. Import the necessary libraries.

  2. Load the data.

  3. Identify the preprocessing tasks that need to be performed.

  4. Create a function for each preprocessing task.

  5. Chain the functions together in a sequence.

  6. Apply the pipeline to the data.


Dataset:

U.S._Chronic_Disease_Indicators__CDI_
.csv
Download CSV • 359.32MB


First lets important the necessary libraries we need:


import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def data_preprocessing_pipeline(data):
    #Identify numeric and categorical features
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    #Handle missing values in numeric features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

    #Detect and handle outliers in numeric features using IQR
    for feature in numeric_features:
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                 data[feature].mean(), data[feature])

    #Normalize numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    #Handle missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    return data


This pipeline is designed to handle various preprocessing tasks on any given dataset. The following are the steps involved in the pipeline:


  1. Identifying the numeric and categorical features in the dataset.

  2. Addressing any missing values that are present in the numeric features. The missing values are filled with the mean value of each respective numeric feature. This ensures that missing data does not hinder subsequent analysis and computations.

  3. Identifying and handling outliers within the numeric features using the Interquartile Range (IQR) method. Outliers are values that fall outside the interquartile range (IQR). The IQR is calculated by subtracting the first quartile (Q1) from the third quartile (Q3) and multiplying the result by 1.5. Any values that fall outside the range defined by Q1 - 1.5 * IQR and Q3 + 1.5 * IQR are considered outliers and are replaced with the mean value of the respective numeric feature. This step helps prevent the influence of extreme values on subsequent analyses and model building.

  4. Normalizing the numeric features. Normalization is the process of scaling the numeric features so that they have a mean of 0 and a standard deviation of 1. This ensures that all numeric features contribute equally to subsequent analysis, avoiding biases caused by varying magnitudes.

  5. Handling the missing values in the categorical features. The missing values in the categorical features are filled with the mode value, which is the most frequently occurring category.


By following this pipeline, data professionals can not only automate, but now streamline the process of preparing data for analysis, ensuring data quality, reliability, and consistency.


data = pd.read_csv("U.S._Chronic_Disease_Indicators__CDI_.csv")

print("Original Data:")
print(data)
 

Output():

Original Data:

YearStart YearEnd ... StratificationCategoryID3 StratificationID3

0 2014 2014 ... NaN NaN

1 2018 2018 ... NaN NaN

2 2018 2018 ... NaN NaN

3 2017 2017 ... NaN NaN

4 2010 2010 ... NaN NaN

... ... ... ... ... ...

1185671 2020 2020 ... NaN NaN

1185672 2020 2020 ... NaN NaN

1185673 2017 2017 ... NaN NaN

1185674 2020 2020 ... NaN NaN

1185675 2019 2019 ... NaN NaN


[1185676 rows x 34 columns]

 


Now we can employ our pipeline to perform all the preprocessing steps necessary:

#Perform data preprocessing
cleaned_data = data_preprocessing_pipeline(data)

print("Preprocessed Data:")
print(cleaned_data)
 

Output():

Preprocessed Data:

YearStart YearEnd ... StratificationCategoryID3 StratificationID3

0 -0.332787 -0.548525 ... NaN NaN

1 0.872893 0.785671 ... NaN NaN

2 0.872893 0.785671 ... NaN NaN

3 0.571473 0.452122 ... NaN NaN

4 -1.538467 -1.882720 ... NaN NaN

... ... ... ... ... ...

1185671 1.475733 1.452769 ... NaN NaN

1185672 1.475733 1.452769 ... NaN NaN

1185673 0.571473 0.452122 ... NaN NaN

1185674 1.475733 1.452769 ... NaN NaN

1185675 1.174313 1.119220 ... NaN NaN


[1185676 rows x 34 columns]

 


An now we have our new cleaned data set after being preproccesed below!



## I ran the algorithim in a my datacamp workspace as opposed to my usual VS Code due to the file size being very large and would reccomend doing the same if you just want to test out/practice employing this process. ##



Result:

Went from 1048576 rows to a Trunacated 589 (There were 15369405 missing data cells in our set beforehand...... way to messy to draw clear data driven insights for good decisions and storytelling).



#Preprocessed and cleaned dataset::


datacamp_workspace_export_2023-06-28 22_39_55
.csv
Download CSV • 190KB

Summary


The best practices in Data Science are similar to good practices in the kitchen, it is important to clean as you cook (but in this case were cooking up insights and results so we always use our best recipes)!


Lets Review what to take away from this, what it means, and why it's important::


Data preprocessing is the process of transforming and manipulating raw data to improve its quality, consistency, and relevance for analysis. This can involve a variety of tasks, such as:


  • Identifying and removing missing values: Missing values can occur for a variety of reasons, such as human error or incomplete data collection. It is important to identify and remove missing values before analysis, as they can skew the results.

  • Formatting data: Data may need to be formatted in a specific way for analysis. For example, dates may need to be converted to a specific format, or categorical data may need to be encoded.

  • Cleaning data: Data may contain errors or inconsistencies that need to be cleaned before analysis. For example, duplicate records may need to be removed, or outliers may need to be identified and addressed.

  • Normalizing data: Normalization is the process of scaling data so that it has a mean of 0 and a standard deviation of 1. This can help to improve the performance of machine learning models.

  • Encoding categorical data: Categorical data is data that can be categorized into different groups, such as gender or age. Encoding categorical data involves converting it into numerical values that can be used by machine learning models.


A data preprocessing pipeline is a systematic and automated approach that combines multiple preprocessing steps into a cohesive workflow. This can help to ensure that data is prepared consistently and efficiently, and that the same steps are applied to all datasets.


The pipeline can also serve as a roadmap for data professionals, guiding them through the transformations and calculations needed to cleanse and prepare data for analysis. This can help to reduce the time and effort required for data preprocessing, and it can also help to improve the quality of the data.

Power in Numbers

Project Gallery

bottom of page