Telegram-канал datasciencefun - Data Science & Machine Learning: Unsorted

Data Science & Machine Learning

14 Jun 2024 04:21

Refer this for the complete overview on supervised, unsupervised and reinforcement learning

Data Science & Machine Learning

12 Jun 2024 06:32

Let's start with Day 9 today

30 Days of Data Science Series: /channel/datasciencefun/1708

Let's learn about Principal Component Analysis (PCA) today

Concept: Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform a large set of correlated features into a smaller set of uncorrelated features called principal components. These principal components capture the maximum variance in the data while reducing the dimensionality.

The steps involved in PCA are:
1. Standardization: Normalize the data to have zero mean and unit variance.
2. Covariance Matrix Computation: Compute the covariance matrix of the features.
3. Eigenvalue and Eigenvector Decomposition: Compute the eigenvalues and eigenvectors of the covariance matrix.
4. Principal Components Selection: Select the top \(k\) eigenvectors corresponding to the largest eigenvalues to form the principal components.
5. Transformation: Project the original data onto the new subspace formed by the selected principal components.

#### Benefits of PCA
- Reduces Dimensionality: Simplifies the dataset by reducing the number of features.
- Improves Performance: Speeds up machine learning algorithms and reduces the risk of overfitting.
- Uncovers Hidden Patterns: Helps visualize the underlying structure of the data.

#### Implementation

Let's consider an example using Python and its libraries.

##### Example
Suppose we have a dataset with multiple features and we want to reduce the dimensionality using PCA.

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plotting the principal components
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.colorbar()
plt.show()

# Explained variance
explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance by Component 1: {explained_variance[0]:.2f}")
print(f"Explained Variance by Component 2: {explained_variance[1]:.2f}")

#### Explanation of the Code

1. Libraries: We import necessary libraries like numpy, pandas, sklearn, and matplotlib.
2. Data Preparation: We use the Iris dataset with four features.
3. Standardization: We standardize the features to have zero mean and unit variance.
4. Applying PCA: We create a PCA object with 2 components and fit it to the standardized data, then transform the data to the new 2-dimensional subspace.
5. Plotting: We scatter plot the principal components with color indicating different classes.
6. Explained Variance: We print the proportion of variance explained by the first two principal components.

#### Explained Variance

- Explained Variance: Indicates how much of the total variance in the data is captured by each principal component. In our example, if the first principal component explains 72% of the variance and the second explains 23%, together they explain 95% of the variance.

#### Applications

PCA is widely used in:
- Data Visualization: Reducing high-dimensional data to 2 or 3 dimensions for visualization.
- Noise Reduction: Removing noise by retaining only the principal components with significant variance.
- Feature Extraction: Deriving new features that capture the essential information.

PCA is a powerful tool for simplifying complex datasets while retaining the most important information. However, it assumes linear relationships among variables and may not capture complex patterns in the data.

Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624

Credits: t.me/datasciencefun

ENJOY LEARNING 👍👍