Need to calculate SSD
There are many objectives behind calculating SSD. They are mentioned below:
- Quantifying Variability: As SSD measures how data points are deviated from the mean value this helps in analysing the dispersion of the data points.
- Assessing Data Spread: SSD helps in understanding the spread of data points.
- Identifying Outliers: It also helps in identifying outliers and managing them so that they do not alter the prediction and reduce accuracy
- Evaluating Model Fit: It helps in checking model accuracy.
- Statistical Hypothesis Testing: It helps in managing the hypothesis testing whether we should reject it or accept it.
Verification of SSD using R and mathematics
Consider a simplified dataset: [2,4,4,4,5]. We will calculate SSD mathematically as well as using R language. Firstly, we need to calculate the mean.
- xmean = 2+4+4+4+5/5= 19/5=3.8
- SSD= (2-3.8)2 + (4- 3.8)2 + (4-3.8)2 +(4-3.8)2 + (5-3.8)2 = 4.8
We can also verify it using R:
R
# Manually create the dataset example_data <- c (2, 4, 4, 4, 5) # Calculate the mean mean_example <- mean (example_data) # Calculate the sum of squared deviations ssd_example <- sum ((example_data - mean_example)^2) # Print the result print ( paste ( "Sum of Squared Deviations (SSD) for example data:" , ssd_example)) |
Output:
[1] "Sum of Squared Deviations (SSD) for example data: 4.8"
Calculating SSD of Temperature
In this example, we will use multiple ways to calculate SSD of a fictional dataset. This dataset represents daily temperature of a certain city over a month.
R
# Creating a fictional dataset for daily temperatures set.seed (123) # For reproducibility days <- 1:365 # Day numbers for a year temperatures <- rep (75, 365) + rnorm (365, mean = 0, sd = 5) temperature_data <- data.frame (Day = days, Temperature = temperatures) # Displaying the dataset print ( head (temperature_data)) |
Output:
Day Temperature
1 1 72.19762
2 2 73.84911
3 3 82.79354
4 4 75.35254
5 5 75.64644
6 6 83.57532
Calculating SSD using formula
Now to calculate SSD with the help of formula we will consider the following given code. We can calculate SSD directly but here we will also print the mean values.
R
# Calculate the mean of daily temperatures mean_temperature <- mean (temperature_data$Temperature) # Calculate the Sum of Squared Deviations (SSD) ssd_temperature <- sum ((temperature_data$Temperature - mean_temperature)^2) # Print the results print ( paste ( "Mean Daily Temperature:" , mean_temperature)) print ( paste ( "Sum of Squared Deviations (SSD) for Daily Temperature:" , ssd_temperature)) |
Output:
[1] "Mean Daily Temperature: 75.1593605854571"
[1] "Sum of Squared Deviations (SSD) for Daily Temperature: 8520.02456165882"
Calculating SSD using Matrix Algebra
We can calculate SSD using matrix algebra as well, it will give the same value.
R
ssd_matrix <- t (temperature_data$Temperature - mean (temperature_data$Temperature)) %*% (temperature_data$Temperature - mean (temperature_data$Temperature)) print ( paste ( "Matrix Algebra SSD:" , ssd_matrix)) |
Output:
[1] "Matrix Algebra SSD: 8520.02456165883"
Calculating SSD of mtcars dataset
We can calculate the SSD of the famous in-built dataset in R. This dataset contains information about different cars and their models.
R
# Load the mtcars dataset data (mtcars) # Calculate the mean of the dataset mean_value <- mean (mtcars$mpg) # Calculate the sum of squared deviations ssd <- sum ((mtcars$mpg - mean_value)^2) # Print the result print ( paste ( "Sum of Squared Deviations (SSD):" , ssd)) |
Output:
[1] "Sum of Squared Deviations (SSD): 1126.0471875"
Here, we calculated the SSD value for mpg column present in our dataset. We can also visualize these points on a scatter plot using ggplot2 library.
R
# Create a scatter plot with the SSD highlighted ggplot (mtcars, aes (x = mpg, y = (mpg - mean_value)^2)) + geom_point (color = "blue" , size = 3) + geom_hline (yintercept = ssd, linetype = "dashed" , color = "red" , linewidth = 1) + labs (title = "Sum of Squared Deviations in mtcars Dataset" , x = "mpg" , y = "Squared Deviations from Mean" ) + theme_minimal () |
Output:
Conclusion
In this article, we calculated SSD using different datasets and we also verified it mathematically.
Calculating Sum Of Squared Deviations In R
Statistics plays an important role in data handling and analysis. Many such concepts are used to understand the nature of data, one of which is the Sum of Squared Deviations. It is a fundamental quantity in stats that helps in understanding the variability in our dataset.
In this article, we will understand how to calculate SSD mathematically and in R Programming Language.