Implementation of Gaussian Distribution in Machine Learning
Consider the famous Iris dataset consists of 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. We can examine the distribution of one of these features, such as sepal length, using a histogram to see if it approximately follows a Gaussian distribution.
- x = np.linspace(np.min(sepal_length), np.max(sepal_length), 100) : the np.linspace function is used to create an array of 100 evenly spaced numbers between the minimum and maximum values of the sepal length feature (sepal_length). This array is used to plot the Gaussian distribution curve.
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
# Load the Iris dataset
iris = load_iris()
sepal_length = iris.data[:, 0] # Extract sepal length (feature at index 0)
mu, std = np.mean(sepal_length), np.std(sepal_length)
x = np.linspace(np.min(sepal_length), np.max(sepal_length), 100)
y = (1 / (std * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / std)**2)
plt.figure(figsize=(8, 6))
plt.hist(sepal_length, bins=20, color='skyblue', edgecolor='black', alpha=0.7, density=True)
plt.plot(x, y, color='red', label='Gaussian Fit')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Density')
plt.title('Distribution of Sepal Length in Iris Dataset with Gaussian Fit')
plt.legend()
plt.show()
Output:
- Central Tendency: The peak of the distribution (mean) suggests that the most common sepal length among the iris flowers in the dataset is around 5.8 centimeters.
- Variability: The spread of the distribution (standard deviation) indicates how much the sepal lengths vary from the mean. A larger standard deviation would imply more variability in sepal lengths among the iris flowers.
- Normality: The distribution roughly follows a bell-shaped curve, which is characteristic of a normal (Gaussian) distribution. This suggests that sepal lengths in the Iris dataset may be normally distributed.
- Outliers: The presence of outliers, particularly on the right tail of the distribution, indicates that there are some iris flowers with unusually long sepal lengths compared to the rest of the dataset. These outliers could be due to measurement errors or represent a distinct subgroup of iris flowers.
The stability of Gaussian distributions under linear combinations facilitates analytical solutions for understanding the behavior of random variables and making predictions based on data making it a cornerstone in statistical modeling and analysis.
Gaussian Distribution In Machine Learning
The Gaussian distribution, also known as the normal distribution, plays a fundamental role in machine learning. It is a key concept used to model the distribution of real-valued random variables and is essential for understanding various statistical methods and algorithms.
Table of Content
- Gaussian Distribution
- Gaussian Distribution Curve
- Gaussian Distribution Table
- Properties of Gaussian Distribution
- Machine Learning Methods that uses Gaussian Distribution
- Implementation of Gaussian Distribution in Machine Learning