Exploratory Data Analysis (EDA)
Understanding and assessing the data you have for your project is one of the important steps in the modeling preparation process. This is accomplished through the use of data exploration, visualization, and statistical data summarization with a measure of central tendencies. You will gain an understanding of your data during this phase, and you will take a broad view of it to get ready for the modeling step.
R
# Summary statistics of data summary (iris) |
Output:
R
# Visualizing the outliers by using boxplot # As we use ggplot2 we will take numerical # variables by subsetting the entire of it df <- subset (iris, select = c (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)) |
By using of reshape package we melt the data and plot it to check for the presence of any outliers. So, when we execute following code;
R
# plot and see the box plot of each variable ggplot (data = melt (df), aes (x=variable, y=value)) + geom_boxplot ( aes (fill=variable)) |
Output:
We can see that there is an outliers in the second column of the dataset.
R
# To plot the barplot for our # categorical variable ggplot (data = iris, aes (x = Species, fill = Species)) + geom_bar () |
Output:
Let’s now use a histogram plot to visualize the distribution of our data’s continuous variables;
R
# In this we visualize the histogram of all # numerical variables of the dataset a <- ggplot (data = iris, aes (x = Petal.Length)) + geom_histogram ( color = "red" , fill = "blue" , alpha = 0.01) + geom_density () b <- ggplot (data = iris, aes (x = Petal.Width)) + geom_histogram ( color = "red" , fill = "blue" , alpha = 0.1) + geom_density () c <- ggplot (data = iris, aes (x = Sepal.Length)) + geom_histogram ( color = "red" , fill = "blue" , alpha = 0.1) + geom_density () d <- ggplot (data = iris, aes (x = Sepal.Width)) + geom_histogram ( color = "red" , fill = "blue" , alpha = 0.1) + geom_density () ggarrange (a, b, c, d + rremove ( "x.text" ), labels = c ( "a" , "b" , "c" , "d" ), ncol = 2, nrow = 2) |
Output :
Next, we will move to the Data Preparation phase of our machine learning process. Before that, lets split our dataset into train, test and validation partition;
R
# Create train-test split of the data limits <- createDataPartition (iris$Species, p=0.80, list= FALSE ) # select 20% of the data for validation testiris <- iris[-limits,] # use the remaining to training # and testing the models trainiris <- iris[limits,] |
Implement Machine Learning With Caret In R
In today’s society, technological answers to human issues are knocking on the doors of practically all fields of knowledge. Every aspect of this universe’s daily operations generates data, and technology solutions base their decisions on these data-driven intuitions. In order to create a machine that learns from the obtained data and can address the aforementioned human problems, Machine Learning algorithms and approaches have emerged in this state. So, what exactly is Machine Learning?
A computer system that learns to do a task from data without being given instructions using mathematical and statistical models is known as Machine Learning.
In this article, we’ll examine fundamental machine learning ideas, methods, and a step-by-step procedure of machine learning model developments by utilizing the R programming language’s Caret library.
Accordingly, there are two categories in which we may place machine learning algorithms:
- Supervised learning: It deals with labeled data or prediction purposes,
- Unsupervised learning: It deals with unlabeled data or descriptive reasons.
Depending on the goal and the available data, one may select one of these two algorithms.