Machine Learning - Scale

Scale Features

If your data has different values, and different units of measurement, it can be difficult to compare them. What are the kilograms compared to the meters? Or height compared to time?

The answer to this problem is measuring. We can measure data into new comparisons that are easy to compare.

See the table below, which is the same set of data we used in the recurring chapter several times, but this time the volume column contains the values ​​in liters instead of cm3 (1.0 instead of 1000).


There are different ways to measure data, in this study we will use a method called standardization.

The standardization method formula:

z = (x - u) / s

When z is a new value, x is a real value, u also means s is a normal deviation.

ou take the weight column from the data set above, the initial value is 790, and the approximate value will be:

790 - 1292.23) / 238.74 = -2.1

If you take a volume column from the data set above, the starting value is 1.0, and the approximate value will be:

(1.0 - 1.61) / 0.38 = -1.59

You can now compare -2.1 with -59 instead of comparing 790 with 1.0.

You do not have to do this manually, the Python sklearn module has a method called StandardScaler () which returns the Scaler object by means of modifying data sets.

import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv("cars2.csv")

X = df[['Weight', 'Volume']]

scaledX = scale.fit_transform(X)

print(scaledX)