Feature Engineering
Feature Engineering helps to derive some valuable features from the existing ones. We convert the “date” column to a character format then split the date using “/” as the delimiter then create a new data frame with columns for date, day, month, and year.
R
# Feature engineering of date column data1$date <- as.character (data1$date) splitted <- strsplit (data1$date, "/" ) df <- data.frame ( date = data1$date, day = as.integer ( sapply (splitted, `[`, 1)), month = as.integer ( sapply (splitted, `[`, 2)), year = as.integer ( sapply (splitted, `[`, 3)), stringsAsFactors = FALSE ) head (df) |
Output:
Now in this code, we are filling the column day, month, and year values from the date column.
R
# data from date will be filled in day, # month , year column see_the_change <- data.frame ( date = data1$date, day = day (data1$date), month = month (data1$date), year = year (data1$date), stringsAsFactors = FALSE ) head (see_the_change) |
Output:
Now, we are including day, month, and year columns in the dataset data1.
R
# changes will be occur in data1 dataset data1$day <- day (data1$date) data1$month <- month (data1$date) data1$year <- year (data1$date) # Viewing the updated dataframe head (data1) |
Output:
We create a new column “is_quarter_end” in data1 to see whether a particular date falls at the end of a quarter. Using the modulo operator (%%), we are checking if the month value is divisible by 3. If it is, we assign 1 to indicate it’s a quarter-end otherwise, we assign 0.
R
# Quarterly data data1$is_quarter_end <- ifelse (data1$month %% 3 == 0, 1, 0) head (data1 , 5) |
Output:
First, we are taking the numeric columns (open, high, low, close) and the date column from “data1” to create a new data frame called “data_num”, then we calculate the year from the “date” column using the “lubridate::year” function and add it as a new column in “data_num” then group the data by year and calculate the mean for each numeric column. Create a bar plot for each numeric column, displaying the mean values for each year.
R
# bar plot nume_column <- c ( "open" , "high" , "low" , "close" ) data_num <- data1[, c ( "date" , nume_column)] data_num <- data_num[ apply (data_num[, nume_column], 1, function (x) all ( is.numeric (x))), ] data_num$year <- lubridate:: year (data_num$date) data_grouped <- data_num %>% group_by (year) %>% summarise ( across ( all_of (nume_column), mean)) par (mfrow = c (2, 2), mar = c (4, 4, 2, 1)) for (i in 1:4) { col <- nume_column[i] barplot (data_grouped[[col]], main = col, xlab = "Year" , ylab = "Mean" ) } |
Output:
And that’s a wrap! But remember, this is just the beginning of our data adventure.
S&P 500 Companies Data Analysis Tutorial using R
R is a powerful programming language and environment for statistical computation and data analysis. It is backed by data scientists, accountants, and educators because of its various features and capabilities. This project will use R to search and analyze stock market data for S&P 500 companies.
Tidyverse, ggplot2, and dplyr are just a few of the many libraries provided by R Programming Language that simplify data processing, visualization, and statistical modeling. These libraries allow us to perform many tasks such as data cleaning, filtering, aggregation, and visualization.
In this work, we will analyze the S&P 500 stock market dataset using these packages using R capabilities.
Hey! Hey! Hey! Welcome, adventurous data enthusiasts! Grab your virtual backpacks, put on your data detective hats, Ready to unravel this mysterious project journey with me.
- Dataset introduction – All files contain the following column.
- Date – In format: yy-mm-dd.
- Open – Price of the stock at the market open (this is NYSE data so everything is in USD).
- High – The highest value achieved for the day.
- Low Close – The lowest price achieved on the day.
- Volume – The number of transactions.
- Name – The stock’s ticker name.