Preprocessing in Machine Learning?
What is data preprocessing?
Data preprocessing is the process of converting raw data into a clean and usable format for machine learning algorithms.
It involves handling missing values, scaling features, encoding categorical variables, and splitting data into training and testing sets.
In simple words, preprocessing makes sure that your data is in the right shape and scale for your model to understand.
Why is preprocessing important?
Without proper preprocessing:
-
Models may produce inaccurate results.
-
Algorithms can become biased due to inconsistent data.
-
Features with large ranges can dominate the training process.
-
Missing or invalid data may cause errors or poor predictions.
So, preprocessing is the foundation of reliable machine learning.
Main Steps in Data Preprocessing
Below are the key stages of preprocessing, along with explanations:
1. Data Cleaning
-
Handle missing data using mean, median, or mode.
-
Remove duplicates or irrelevant features.
-
Detect and treat outliers.
2. Data Encoding
-
Convert categorical (text) data into numerical form using:
-
Label Encoding (e.g., Yes → 1, No → 0)
-
One-Hot Encoding (creates separate columns for categories)
-
3. Feature Scaling
-
Normalize or standardize data so that all features are on a similar scale.
-
Common techniques:
-
Min-Max Scaling: scales values between 0 and 1.
-
Standardization: transforms data to have mean = 0 and standard deviation = 1.
-
4. Data Splitting
-
Split data into training and testing sets.
-
Example: 80% for training and 20% for testing.
-
-
Ensures the model learns and is later evaluated on unseen data.
Example: Preprocessing in Python
Let’s take an example using Python’s pandas and scikit-learn libraries.
# Importing libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# Sample dataset
data = {
'Age': [25, 30, 28, None, 35],
'Salary': [50000, 54000, 58000, 62000, None],
'Purchased': ['No', 'Yes', 'No', 'Yes', 'No']
}
df = pd.DataFrame(data)
# Step 1: Handle missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
# Step 2: Encode categorical data
encoder = LabelEncoder()
df['Purchased'] = encoder.fit_transform(df['Purchased'])
# Step 3: Feature scaling
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
# Step 4: Split the dataset
X = df[['Age', 'Salary']]
y = df['Purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Preprocessed Data:\n", df)
Output Preview:
| Age | Salary | Purchased |
|---|---|---|
| -1.23 | -1.41 | 0 |
| 0.00 | -0.71 | 1 |
| -0.62 | 0.00 | 0 |
| 0.62 | 0.71 | 1 |
| 1.23 | 1.41 | 0 |
Colorful Flowchart: Data Preprocessing Steps

Data Preprocessing
Conclusion
Preprocessing is not just the first step; it’s the most crucial step in machine learning.
A well-preprocessed dataset ensures your model learns accurately, performs efficiently, and generalizes better to unseen data.
Remember: “Better data beats a better algorithm.”
Comments
Post a Comment