Exploratory Data Analysis (EDA)

Understanding and assessing the data you have for your project is one of the important steps in the modeling preparation process. This is accomplished through the use of data exploration, visualization, and statistical data summarization with a measure of central tendencies. You will gain an understanding of your data during this phase, and you will take a broad view of it to get ready for the modeling step. 

R




# Summary statistics of data
summary(iris)


Output:

 

R




# Visualizing the outliers by using boxplot
# As we use ggplot2 we will take numerical 
# variables by subsetting the entire of it
df <- subset(iris, select = c(Sepal.Length, 
                              Sepal.Width, 
                              Petal.Length, 
                              Petal.Width))


By using of reshape package we melt the data and plot it to check for the presence of any outliers. So, when we execute following code;

R




# plot and see the box plot of each variable
ggplot(data = melt(df), 
       aes(x=variable, y=value)) + 
        geom_boxplot(aes(fill=variable))


Output:

 

We can see that there is an outliers in the second column of the dataset.

R




# To plot the barplot for our 
# categorical variable
ggplot(data = iris, 
       aes(x = Species, fill = Species)) +
    geom_bar()


Output:

 

Let’s now use a histogram plot to visualize the distribution of our data’s continuous variables;

R




# In this we visualize the histogram of all 
# numerical variables of the dataset
a <- ggplot(data = iris, aes(x = Petal.Length)) +
    geom_histogram( color = "red"
                   fill = "blue"
                   alpha = 0.01) + geom_density()
  
b <- ggplot(data = iris, aes(x = Petal.Width)) +
    geom_histogram( color = "red"
                   fill = "blue"
                   alpha = 0.1) + geom_density()
c <- ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram( color = "red"
                   fill = "blue"
                   alpha = 0.1) + geom_density()
  
d <- ggplot(data = iris, aes(x = Sepal.Width)) +
    geom_histogram( color = "red"
                   fill = "blue"
                   alpha = 0.1) +geom_density()
  
ggarrange(a, b, c, d + rremove("x.text"), 
          labels = c("a", "b", "c", "d"),
          ncol = 2, nrow = 2)


Output : 

Histogram plot

Next, we will move to the Data Preparation phase of our machine learning process.  Before that, lets split our dataset into train, test and validation partition;

R




# Create train-test split of the data 
limits <- createDataPartition(iris$Species, 
                              p=0.80, 
                              list=FALSE)
  
# select 20% of the data for validation
testiris <- iris[-limits,]
  
# use the remaining to training 
# and testing the models
trainiris <- iris[limits,]


Implement Machine Learning With Caret In R

In today’s society, technological answers to human issues are knocking on the doors of practically all fields of knowledge. Every aspect of this universe’s daily operations generates data, and technology solutions base their decisions on these data-driven intuitions. In order to create a machine that learns from the obtained data and can address the aforementioned human problems, Machine Learning algorithms and approaches have emerged in this state. So, what exactly is Machine Learning? 

A computer system that learns to do a task from data without being given instructions using mathematical and statistical models is known as Machine Learning. 

In this article, we’ll examine fundamental machine learning ideas, methods, and a step-by-step procedure of machine learning model developments by utilizing the R programming language’s Caret library.

Accordingly, there are two categories in which we may place machine learning algorithms: 

  • Supervised learning:  It deals with labeled data or prediction purposes, 
  • Unsupervised learning: It deals with unlabeled data or descriptive reasons. 

Depending on the goal and the available data, one may select one of these two algorithms.

Similar Reads

Steps in Machine Learning using R

To get the intended outcomes, problems in data science must be decomposed into manageable tasks. We will walk through each step of implementing machine learning in R using the free and open-source Caret package in this part. The general steps to be followed in any Machine learning project are :...

Data collection and Importing

For modeling purposes, machine learning data should be gathered and imported into a R environment. These data sets may be electronically recorded on text or spreadsheets, SQL databases, or both. To begin your work, you must import the datasets into the R environment. However, in order to begin our task for this tutorial, we will import the data from the R dataset package. Install and import all prerequisite libraries that we will require for our project first....

Exploratory Data Analysis (EDA)

...

Data Preprocessing

...

Model training and Evaluation

...

Conclusions

...