Normalization – Data Science

Normalization is a data preprocessing technique commonly used in machine learning to scale the features of a dataset so that they all have similar ranges and distributions. This process involves transforming the data such that it has normalized scales and magnitudes.

Why do we normalize our data?

Normalization transforms the data to have mean of 0 and a standard deviation of 1, which effectively scales the features to be centered around zero and have a consistent range. This ensures that features with larger magnitudes do not dominate the learning process compared to features with smaller magnitudes.

By normalizing the data, we ensure that all features contribute equally to the learning process, regardless of their original scales. This can improve the performance and convergence of machine learning algorithms, especially those that are sensitive to feature scales, such as gradient descent-based algorithms.

Let’s break down the explanation with an example:

Consider a dataset containing two features of a group of individuals:

  • height (in centimeters) and,
  • weight (in kilograms)

The height values range from 150 to 190 centimeters, while the weight values range from 50 to 90 kilograms.

Without normalization, the features have different scales and magnitudes, the height values are larger than the weight values. When feeding this data into a machine learning model, features with larger magnitudes can dominate the learning process and have a disproportionate influence on the model’s predictions.

To address this issue, we can normalize the data. In our example:

  • Mean Centering: We subtract the mean of each feature from its values. This shifts the distribution of each feature so that its mean becomes 0.
  • Standardization: We divide each feature by its standard deviation. This scales the values of each feature so that they all have a standard deviation of 1.

After normalization, both the height and weight values will be centered around zero and have consistent ranges. For example, a normalized height value of 1.5 might correspond to a height that is one standard deviation above the mean, while a normalized weight value of -0.5 might correspond to a weight that is half a standard deviation below the mean.

Evaluating a simple example

Imagine you’re training a machine learning model to predict house prices based on two features:
square footage (in square meters) and number of bedrooms. Here’s a sample dataset:

HouseSquare FootageBedrooms
11503
23004
38005

Here, the square footage values (150, 300, 800) have a much larger range compared to the number of bedrooms (3, 4, 5). This can be problematic for some machine learning algorithms because they tend to weigh features with larger values more heavily during training.

Normalization addresses this issue. Let’s see how this works.

Calculating Mean and Standard Deviation

Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers.
And, Standard deviation is a statistical measure that tells you how expanded the data is from its average value (mean). A low standard deviation indicates that the data points tend to be very close to the mean, while a high standard deviation indicates that the data points are spread out over a larger range of values.

Square footage
Mean = (150 + 300 + 800) / 3 = 416.67, Standard Deviation = 268.22

Bedrooms
Mean = (3 + 4 + 5) / 3 = 4, Standard Deviation = 1

Formula for Standard Deviation (\sigma)

Standard Deviation (𝜎) = √(Σ(x - μ)² / n)

where,

  • Σ (summation) represents the sum of all the values
  • x is a data point
  • μ (mu) is the mean of the data set
  • n is the number of data points in the data set

Calculating Standard Deviation using above formula

Our data set has following square footage values: 150, 300, and 800.

Mean 
(150 + 300 + 800) / 3 = 416.67

Calculating Squared Deviations from the Mean
House 1: (150 – 416.67)² = 68888.89
House 2: (300 – 416.67)² = 12888.89
House 3: (800 – 416.67)² = 151871.56

Calculating Variance (\sigma^2)
Formula for Variance is given by:

 \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 

(68888.89 + 12888.89 + 151871.56) / 3 = 77652.78

Standard Deviation (\sigma)

√(77652.78) ≈ 268.22

Normalized Data

Formula for normalization

 x_{\text{norm}} = \frac{x - \mu}{\sigma} 

where,

  • x is an individual data point in the feature.
  • x_{\text{norm}} ​ is the normalized value of 𝑥.
  • μ (mu) is the is the mean of the feature in the dataset.
  • 𝜎 is the standard deviation of the feature in the dataset.

New Square Footage (normalized)

House 1: (150 – 416.67) / 268.22 = -0.99
House 2: (300 – 416.67) / 268.22 = -0.44
House 3: (800 – 416.67) / 268.22 = 1.43

New Bedrooms (normalized)

House 1: (3 – 4) / 1 = -1
House 2: (4 – 4) / 1 = 0
House 3: (5 – 4) / 1 = 1

This is the new normalized data,

HouseSquare Footage (Normalized)Bedrooms (Normalized)
1-0.99-1
2-0.440
31.431

Now, you can verify that both the features (Square footage and Bedrooms) after normalization have a mean of 0 and a standard deviation of 1 (approximately, due to rounding). This ensures that the machine learning algorithm pays similar attention to both features during training, leading to potentially better predictions.

In conclusion, Normalization prevents features with large scales from overshadowing features with smaller scales, improving the learning process for machine learning models.

In the future post, we’ll learn about performing Data Normalisation in Python using Python ML libraries.

Leave a Reply

Your email address will not be published. Required fields are marked *