Scaling and Descaling Data in Data Science Problems

In data science problems it is often required to scale data so that the algorithms fit well with the learning goals.

How to scale efficiently? There are ways in which we can scale data manually or using some libraries.

In this article, I provide simple, quick and quite accurate way to scale data to a new range and then to descale data back.

Scaling back is mostly required to get the output in form it was given in input, such as a target class for instance, this is what I have covered it in this article.

Further, we have checked on this short data the errors in descaling. We have computed the errors in scaling and descaling. Because we don’t want more errors from this process, apart from at time algorithmic errors in misclassification.

The illustration is performed on popular IRIS data.

We shall use MinMaxScaler from sklearn

Lets import some libraries required in

`import numpy as npimport pandas as pdfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import mean_squared_errorimport matplotlib.pyplot as plt`

Read the data using pandas, the data is in excel format

`df = pd.read_excel('/content/IRIS_Edited.xlsx')print(df)`

Read the data and split data in test and train, this is to give the MinMax scaler distributed data

`feature_names = [ 'sepal_length','sepal_width', 'petal_length','petal_width']feature  = df[feature_names]target_names = ['species']target = df[target_names]x = feature.valuesy = target.valuestrain_x, test_x, train_y, test_y =train_test_split(x, y, test_size=0.20)`

Scale the features and target class using two iinstances of scaler class

`scaler1 = MinMaxScaler()x_train_ = scaler1.fit_transform(train_x)x_test_ = scaler1.transform(test_x)print(x_test_[0])scaler2 = MinMaxScaler()y_train_ = scaler2.fit_transform(train_y)y_test_ = scaler2.transform(test_y)print(y_test_[0]) `

Let’s test the test target class for scaling and descaling errors

`inverse_y_test_ = scaler2.inverse_transform(y_test_)print(inverse_y_test_[0], test_y[0], test_x[0])`

compute the error of scaled and descaled values

`def computeError():     error = 0    for i in range(len(test_y)):         error = error + (test_y[i] - inverse_y_test_[i]) * (test_y[i] - inverse_y_test_[i])     print("error is", error)     computeError()mse = mean_squared_error(test_y, inverse_y_test_)print(mse)`

Plot the errors

`def plotResults():    plt.figure(figsize=(25,10))            plt.plot(inverse_y_test_, label='true')    plt.plot(test_y, label='pred')    plt.show()plotResults()`

Results

The error was nil in these computations

The spread of the graph is not uniform, because we have used a train-test data split, which caused it to be nonsequential as is in the input excel sheet.