Quantitative Descriptive Statistical Analysis

Now we will try to analyze different statistical measures of the iris data using functions like summary(), groupby(), fivenum(), and other central tendencies.

R

# iris is inbuilt data set in 
# r studio so we will use that
data(iris)
df <- iris
 
head(df) 
str(df) 

Output:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
'data.frame':    150 obs. of  5 variables:
$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Now let’s print the minimum and the maximum value of the SepalLength feature in the iris dataset for this we can use min(), max() functions.

R

print(min(df$Sepal.Length))
print(max(df$Sepal.Length))

Output:

[1] 4.3
[1] 7.9

To find particular central tendencies like mean, mode, median, quantile, and percentile we have inbuilt functions for them by their name only which can be used to find a particular measure.

R

# mean
print(mean(df$Sepal.Length))
 
# median
print(median(df$Sepal.Length))
print(quantile(df$Sepal.Length, 0.25))
print(quantile(df$Sepal.Length, 0.75))
print(IQR(df$Sepal.Length))
 
# mode
print(sort(-table(df$Sepal.Length))[1]) 
print(sort(table(df$Sepal.Length),
           decreasing = TRUE)[1]) 

Output:

[1] 5.843333
[1] 5.8
25% 
5.1 
75% 
6.4 
[1] 1.3
  5 
-10 
 5 
10

Similarly for the standard deviation and variance we have std() and var() functions in R Programming Language.

R

print(sd(df$Sepal.Length))
print(var(df$Sepal.Length))

Output:

[1] 0.8280661
[1] 0.6856935

The fivenum function in the R gives the five-number summary of the data which includes a minimum value, maximum value, median, first quartile, and third quartile.

R

# fivenum() provides the basic five number
# summary of the data like the boxplot
data <- iris
 
fivenum(iris$Sepal.Length)
fivenum(iris$Sepal.Width)

Output:

4.35.15.86.47.9
22.833.34.4

We can use the summary() function as well to calculate all the above-mentioned statistical measures as shown below.

R

summary(df)

Output:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

R

by(df, df$Species, summary)

Output:

df$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
--------------------------------------------------------------------- 
df$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
--------------------------------------------------------------------- 
df$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

We also have different packages as well which provide such functions which are mainly to get the descriptive statistics of the dataset. For example, we have stat.desc() function in the pastecs packages which provides all the statistical measures of the dataset.

R

library(pastecs)
 
stat.desc(df)

Output:

             Sepal.Length  Sepal.Width Petal.Length  Petal.Width Species
nbr.val      150.00000000 150.00000000  150.0000000 150.00000000      NA
nbr.null       0.00000000   0.00000000    0.0000000   0.00000000      NA
nbr.na         0.00000000   0.00000000    0.0000000   0.00000000      NA
min            4.30000000   2.00000000    1.0000000   0.10000000      NA
max            7.90000000   4.40000000    6.9000000   2.50000000      NA
range          3.60000000   2.40000000    5.9000000   2.40000000      NA
sum          876.50000000 458.60000000  563.7000000 179.90000000      NA
median         5.80000000   3.00000000    4.3500000   1.30000000      NA
mean           5.84333333   3.05733333    3.7580000   1.19933333      NA
SE.mean        0.06761132   0.03558833    0.1441360   0.06223645      NA
CI.mean.0.95   0.13360085   0.07032302    0.2848146   0.12298004      NA
var            0.68569351   0.18997942    3.1162779   0.58100626      NA
std.dev        0.82806613   0.43586628    1.7652982   0.76223767      NA
coef.var       0.14171126   0.14256420    0.4697441   0.63555114      NA

R

cor(df$Sepal.Length, df$Sepal.Width)

Output:

[1] -0.1175698

Another such example is the describe() function of the psych package which is similar to the describe() method available in the pandas library of Python.

R

install.packages("psych")
library(psych)
 
data <- iris
describe(iris)

Output:

             vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31    -0.61 0.07
Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31     0.14 0.04
Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27    -1.42 0.14
Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10    -1.36 0.06
Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00    -1.52 0.07

Descriptive Statistic in R

Data analysis is a crucial part of any machine learning model development cycle because this helps us get an insight into the data at hand and whether it is suitable or not for the modeling purpose or what are the main key points where we should work to make data cleaner and fit for future uses so, that the valuable insights can be extracted from them. In this article, we will perform the Descriptive Statistical Data Analysis on the iris dataset using R Programming Language.

Quantitative Descriptive Statistical Analysis

R

R

R

R

R

R

R

R

R

R

Descriptive Statistic in R

Categories

Contact US

Quantitative Descriptive Statistical Analysis

R

R

R

R

R

R

R

R

R

R

Descriptive Statistic in R

Similar Reads

Categories

Contact US