Before starting this blog you can explore the Machine Learning blog and there implementation https://ainewgeneration.com/category/machine-learning/. In this blog, we will focus on Principal Components Analysis i.e. PCA the basic idea behind the implementation. basically, Principal Components Analysis PCA is used to reduce the dimensions i.e. number of features or columns from the dataset. Now let’s start in more detail.
Table of Contents
- Principal Components Analysis.
- Direction of Spread .
- What is Variance ?
- Variance formula
- Standardization data before PCA.
- Example of Variance Calculations.
- Math’s behind Principal Components Analysis.
- Implementation of PCA.
Principal Components Analysis [PCA]
Principal Components Analysis PCA is used to reduce no. of features or columns from your dataset means PCA checks the useless or unwanted columns and then discards them.
For example, we are working on taxi fare prediction means we want to predict taxi fare In a dataset lets suppose we have two columns in the first column distance covered by cab in kilometer and the Second column distance covered by cab in Miles means both the columns are dealing with distance only i.e. one in kilometer & other in miles these two columns which are related. if two columns have similar data then the variance is less which means there is a lot of similarity in these two columns.
Note – whenever we want to work on any kind of problem we always need to have those features or columns in a dataset that have high variance because only our algorithm will learn from more relevant features.
Direction of Spread
In the above figure, the line direction is marked by blue line which is a straight line that passes through origin called as 1st principal component because variance is high in this direction theta is more spread out in this direction so this line will become a principal component.
The 2nd principal component is a green line which is perpendicular to 1st principal component i.e. blue line means our data is spread out in two directions.
PC simply means in which the data is more spread out the more spread data the higher will be the variance.
What is Variance ?
Variance means how far your data is spread out. by looking at the above image the left one has high variance as values is spread out and on the right side low variance as values are close to each other.
High Variance: It simply means data is spread out more.
Low Variance: It simply means data are close to each other means these values are related to each other as they have the same features.
As for the PCA algorithm, we need high variance you can read in the above note.
Standardization the data before PCA
Whenever you want to apply PCA always performs standardization on your data always scale your data because PCA is affected by the scaling of data so scale the features of the dataset before applying PCA.
In the below image, you can see as standardize the data before performing PCA the left side in the below image the feature /columns are the iris dataset we perform standardization of data using MinMaxScaler or StandardScaler after performing standardization the output we get on the right side of below image.
Calculate variance formula
For calculating variance, we simply calculate the standard deviation of a particular column divide by the sum of standard deviations of all columns.
Example of Variance Calculations
from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA sc = StandardScaler() #x iris dataset x1 = sc.fit_transform(x) pca = PCA(n_components =3) #always write components less than or equal to features pca.fit_transform(x1) #we never do PCA on target variable #and y is our target variable print(pca.0explained_variance_ratio_)
- StandardScaler – Standardization the data or scale the data.
- Dataset – iris dataset has 4 columns/features ( sepal length , sepal width, petal length , petal width)
- First step to standardization the dataset using StandardScaler.fit_transform().
- The 2nd step after scaling is applying PCA it takes number of components as parameter its depends upon us the number of components we are passing but you should always remember that the no. of components should be less than or equal to features.
- pca.explained_variance_ratio_ by this method we can know which particular direction has amount of datasets means how much data is there in every particular direction.
- output : [0.61 ,0.33,0.04] which means in 1st direction has 61% of data , in 2nd direction has 33% of data and in 3rd direction has 4% of data As we get total no. of direction as 3 because we specify 3 components.
Math’s behind Principal Components Analysis.
Eigen Vector (direction) and Eigen Values (distance)
Eigenvector simply means the direction in which values are spread and the Eigenvalues means the distance covered by values. in the below figure the blue line is the first principal component with towards upward as direction and distance from the starting point to end upward. whereas the green line is 2nd principal components with the same concept the direction and distance.
Implementation of PCA.
Dataset: This dataset is small as it has 10 values with 3 features and 1 target column. on the basis of these features, the dataset predicts whether the person will survive or not above 70.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.decomposition import PCA
Read dataset .csv file
df = pd.read_csv("Health.csv") df.head()
Dependent Variable and Independent Variable
#Independent Variable x = df.iloc[:,:3] #Depednent Variable y = df.iloc[:,3]
Replace Missing Values in the dataset
from sklearn.preprocessing import Imputer mvi = Imputer(missing_values = "NaN" , strategy = "mean" ,axis=0) mvi = mvi.fit(X[:,1:3]) x[:,1:3] = mvi.transfom(x[:,1:3])
In the above code, Imputer is used to fill NaN value in datasets as it takes missing_values parameter which is NaN and strategy means what we want to write in place of missing values. as we are only applying in the numeric column so we are ignoring the first categorical column which has string values that need to convert in numeric column using Label Encoder.
Applying Label Encoder
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() x[:,0] = le.fit_transform(x[:,0]) #target LE y_le = LabelEncoder() y = y_le.fit_transform(y)
print("before PCA: ") print(x) pc = PCA(n_components =2) pc.fit(x) x_pca = pc.transform(x) print("After doing PCA") print(x_pca)
In the above code, we have 3 columns/features in x after performing PCA the number of components we have passed is 2 is gets reduced to 2 features/columns where the data is more spread out.
The shape of the dataset before and after PCA
print("original shape :",x.shape) pint("transformed shape :",x_pca.shape) output: original shape : (9,3) transformed shape : (9,2)
Calculate vector of Variance
print(pc.explained_variace_ratio_) output: [0.9333 , 0.0587]
As in the above code 93% of data is in the first component and 5% data is in 2nd component
Make Data Frame for 2 components
df = pd.DataFrame(x_pca,columns = [pca1,pca2] df
Plotting the dataset above for two components i.e. 1st and 2nd PC
import matplotlib.pyplot as plt plt.plot(df) plt.show()
- 1st components – 93% of data which is in blue line has high variance.
- 2nd components – 5% of data which is in orange line has less variance.
The task of PCA is only to find out the direction in which data is more spread out means more spread out which indicates high variance which simply helps the algorithm to learn more with high variance.
I hope this blog was really helpful for you in understanding the concept behind the Principal Components Analysis PCA In the next blog, we will focus on more depth in the machine learning blog.