Create model using mtcars dataset
We can understand all these steps easily with the help of an example. For this example, we will load a sample dataset that comes with R, called the “mtcars” dataset. This dataset contains information about various car models. It is a built-in dataset in R. We will apply all the above-mentioned steps to our dataset:
1. Install and load necessary packages
R
#install packages install.packages ( "ggplot" ) #load packages library (ggplot2) |
2. Load your data
R
# Load the mtcars dataset data (mtcars) |
3. Explore and Understand the data:
R
# Explore and understand the data #show the summary of our data summary (mtcars) |
Output:
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mea :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
These functions gave us an idea about our data. head() function shows the first 6 rows of our dataset showing the various brands of cars, mpg(miles per gallon), cyl(number of cylinder), horsepower etc. These are the attributes of our dataset. Now we need to specify our model, we will have to define the formula parameter for the model, with the dependent variable on the left side of the tilde (~) and the independent variables on the right side, and then explore the model with the help of summary() function.
4. Create the model
R
# Fit a linear regression model model <- lm (mpg ~ hp, data = mtcars) |
5. Get a model summary:
R
# Get summary summary (model) |
Output:
Call:
lm(formula = mpg ~ hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.7121 -2.1122 -0.8854 1.5819 8.2360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
hp -0.06823 0.01012 -6.742 1.79e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
- summary() function returns the values of mean, median, mode(max), 1st quartile and 3rd quartile for our model.
- Residuals explain the differences between the observed values and the values predicted by the model.
- “Std. Error” represents the standard error of the coefficient estimates.
- “t value” is used to test the significance of the coefficients.
- “Pr(>|t|)” represents the p-value associated with each coefficient.
- multiple “R-squared” value which is 0.6024 here, represents the proportion of variance in the dependent variable (mpg).
- R-squared value, 0.5892 in this summary estimation, is a modified version of the R-squared that represents the number of predictors in our model.
- “F-statistic” shows the significance of our test, for which here p-value is very less describing that our test is significant statistically.
This shows the summary of our data to get insights from it and make further predictions. Now, when we know our data we can make predictions according to our needs.
6. Make Predictions
R
# Predict mpg for a car with 300 horsepower new_data <- data.frame (hp = 300) predicted_mpg <- predict (model, new_data) predicted_mpg |
Output:
9.630377
9.630 is our predicted value of mpg(miles per gallon) for a given horsepower. In this model we tried to predict mpg of a car whose horsepower is 300. We can also plot this value on graph.
7. Visualize our model
R
# Create a scatter plot using ggplot2 ggplot (mtcars, aes (x = hp, y = mpg)) + geom_point (color = "darkgreen" ) + geom_smooth (method = "lm" , color = "green" ) + labs (x = "Horsepower" , y = "Miles Per Gallon" , title = "Scatterplot of hp vs. mpg" ) |
Output:
The line is the result of the linear regression model fit to the data using the geom_smooth function. The dots represent cars we have in our dataset. This plot shows how the horsepower affects the mpg of a car and helps us in decision making.
In this example, we took the in-built dataset of R “mtcars” which has info about cars and their attributed. We explored over data and then predicted miles per gallon for a car having horsepower 300. This prediction helps in product analysis and comparison with the other cars. This also helps in performance analysis of a particular car. These summary data helps us to understand what is the average attribute of cars in the market. We also visualized this data by plotting a scatter plot of our predictions for better understanding.
In this example we will run linear regression on a small dataset in which we want to understand the relationship between the amount of rainfall and the yield of a specific crop. For this we will create our own fictional dataset and perform linear regression analysis.
Conclusion
In this article, we learned the seven necessary steps to run Linear regression analysis using R language. We understood the concept with the help of four different examples based on different fields such as education, weather forecasting, wage estimation and prediction using cars dataset. We used different ways to deal with our data as well from using the built-in dataset we have in R language to loading dataset from other websites and then performing analysis on it. We learned to built models and predict values based on the historical data. In this article, we dealt with different sizes of dataset as well. Linear regression analysis helps us to understand the relationship and dependency of variables on each other. This helps us in smart decision making and in avoiding any wastage by being prepared.
7 Steps to Run a Linear Regression Analysis using R
Linear Regression is a useful statistical tool for modelling the relationship between a dependent variable and one or more independent variables. It is widely used in many disciplines, such as science, medicine, economics, and education. For instance, several areas of education employ linear regression to estimate student performance or identify the factors influencing student performance. It can also be applied in the healthcare industry to comprehend how different elements, such as age and diet, affect a certain medical condition. This aids in inference and improves the accuracy of forecasts. The seven stages of performing a linear regression analysis will be covered in this post.
The seven steps to run linear regression analysis are
- Install and load necessary packages
- Load your data
- Explore and Understand the data
- Create the model
- Get a model summary
- Make predictions
- Plot and visualize your model
We can understand the above-mentioned steps in syntax and then with the help of different examples.
Step 1: Install and load the necessary packages
Before we start our linear regression analysis we must install the necessary packages, these packages help us in visualizing and plotting our data. For example, we can install packages like “ggplot2” and “dplyr” in R language for better analysis. Syntax to install these packages is.
# Install and load package
install.packages("ggplot2")
library(ggplot2)
Step 2:Load your data
We need to import our data into R using functions like “read.csv” or “read. table” ensuring that data is stored in a data frame. Here, “your_data.csv” is the path of the file you want to read in R.
# Load your data
data <- read.csv("your_data.csv")
Step 3: Explore and Understand the data
It is important to understand the data we are dealing with, to get this idea we can use “summary()”, str(), or head() or tail() functions in R. str() is used for displaying our data compactly especially when the data is huge in number.
summary() gives the minimum, maximum, mean, median, and 1st and 3rd quartiles for our data to get a better understanding.
head() returns the first parts of our data frame whereas the tail() function returns the last part.
# Explore and understand the data
summary(data)
head(data)
tail(data)
str(data)
Step 4: Create the model
Creating a model in linear regression means establishing a relationship between the two variables. In R, the lm() function is used to create linear regression models. It takes two parameters: formula and data. The formula defines the formula we want to apply to our data.
# Create a linear regression model
model <- lm(formula, data = data)
Step 5: Get a model summary
summary() function in R is used to get the summary of our model, it returns detailed information about our data like coefficients, R-squared, and p-values, and the minimum, maximum, mean, median, and 1st and 3rd quartiles for our model.
# Get model summary
summary(model)
Step 6: Make Predictions
Once our model is fit we can make predictions on our new data and conclude. we use predict() to get predictions for our model.
# Make predictions
predictions <- predict(model, data = data)
Step 7: Visualize our model
Visualizing our model is good for understanding, we can plot many graphs as per our need, and this helps us in assessing our fit. The packages we installed in step one help us in this step, “ggplot2” helps us in plotting graphs in R. se = FALSE means we don’t want to include a shaded confidence interval around the line.
# Create a scatterplot with the regression line
ggplot(data, aes(x = predictor_variable, y = response_variable))
# Add points to the plot
+ geom_point()
# Add a regression line to the scatterplot
+ geom_smooth(method = "lm", se = FALSE)