Scaling and Descaling Data in Data Science Problems

Note: This same article appears in my median.com account as well

In data science problems it is often required to scale data so that the algorithms fit well with the learning goals.

How to scale efficiently? There are ways in which we can scale data manually or using some libraries.

In this article, I provide simple, quick and quite accurate way to scale data to a new range and then to descale data back.

Scaling back is mostly required to get the output in form it was given in input, such as a target class for instance, this is what I have covered it in this article.

Further, we have checked on this short data the errors in descaling. We have computed the errors in scaling and descaling. Because we don’t want more errors from this process, apart from at time algorithmic errors in misclassification.

The illustration is performed on popular IRIS data.

We shall use MinMaxScaler from sklearn

Lets import some libraries required in

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

Read the data using pandas, the data is in excel format

df = pd.read_excel('/content/IRIS_Edited.xlsx')
print(df)

Read the data and split data in test and train, this is to give the MinMax scaler distributed data

feature_names = [ 'sepal_length','sepal_width', 'petal_length','petal_width']
feature  = df[feature_names]

target_names = ['species']
target = df[target_names]

x = feature.values
y = target.values

train_x, test_x, train_y, test_y =train_test_split(x, y, test_size=0.20)

Scale the features and target class using two iinstances of scaler class

scaler1 = MinMaxScaler()
x_train_ = scaler1.fit_transform(train_x)
x_test_ = scaler1.transform(test_x)

print(x_test_[0])

scaler2 = MinMaxScaler()
y_train_ = scaler2.fit_transform(train_y)
y_test_ = scaler2.transform(test_y)

print(y_test_[0])

Let’s test the test target class for scaling and descaling errors

inverse_y_test_ = scaler2.inverse_transform(y_test_)
print(inverse_y_test_[0], test_y[0], test_x[0])

compute the error of scaled and descaled values

def computeError():
 
    error = 0
    for i in range(len(test_y)): 
        error = error + (test_y[i] - inverse_y_test_[i]) * (test_y[i] - inverse_y_test_[i]) 
    print("error is", error)     

computeError()

mse = mean_squared_error(test_y, inverse_y_test_)
print(mse)

Plot the errors

def plotResults():

    plt.figure(figsize=(25,10))
        
    plt.plot(inverse_y_test_, label='true')
    plt.plot(test_y, label='pred')
    plt.show()

plotResults()

Results

The error was nil in these computations

The spread of the graph is not uniform, because we have used a train-test data split, which caused it to be nonsequential as is in the input excel sheet.

Scaling and Descaling Data in Data Science Problems

Published by Nidhika

Leave a comment Cancel reply

Share this:

Published by Nidhika

Leave a comment Cancel reply