The term “tidy data” refers to a specific format or organization of data, while “normal data” is a more general term and does not refer to any specific data format. Let’s clarify the differences between these two concepts.

Normal (Untidy) Data

In this example, we have a dataset where different attributes (e.g., “Name,” “Age,” “City”) are stored in separate columns, and each row represents an individual’s information.


# Create a normal (untidy) data frame
normal_data <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 28),
  City = c("New York", "Los Angeles", "Chicago")
# Display the normal data


     Name Age        City
1 Alice 25 New York
2 Bob 30 Los Angeles
3 Charlie 28 Chicago

In this representation, each variable (Name, Age, City) has its own column, and each row corresponds to a different individual. This format is not considered tidy because each variable should be in a single column.

Tidy Data

In tidy data, each variable is stored in its own column, and each row represents a single observation or data point. To transform the normal data into tidy data, we can use the gather() function from the tidyr package.


# Load the tidyr package
# Convert the normal data to tidy format using gather()
tidy_data <- gather(normal_data, key = "Variable", value = "Value", -Name)
# Display the tidy data


     Name Variable       Value
1 Alice Age 25
2 Bob Age 30
3 Charlie Age 28
4 Alice City New York
5 Bob City Los Angeles
6 Charlie City Chicago

In the tidy data representation, we have only three columns: “Name,” “Variable” (which stores the variable names), and “Value” (which stores the corresponding values). Each row represents a single observation, and the data is now structured in a way that follows the principles of tidy data.

Certainly, here’s another example that demonstrates the difference between normal (untidy) data and tidy data using R. In this example, we’ll work with a dataset related to sales data for different products.

Another example to demonstrate the Normal Data and Tidy Data

Normal (Untidy) Data:

In this normal (untidy) data representation, we have different products as columns, and each row represents a sales record for a specific date.


# Create a normal (untidy) data frame
normal_data <- data.frame(
  Date = as.Date(c("2023-01-01", "2023-01-02", "2023-01-03")),
  ProductA = c(100, 120, 90),
  ProductB = c(80, 75, 95),
  ProductC = c(60, 70, 80)
# Display the normal data


        Date ProductA ProductB ProductC
1 2023-01-01 100 80 60
2 2023-01-02 120 75 70
3 2023-01-03 90 95 80

In this representation, each product (ProductA, ProductB, ProductC) has its own column, and each row corresponds to sales data for a specific date. This format is not considered tidy because each variable (product) should be in a single column.

Tidy Data

In tidy data, we’ll restructure the data so that it follows the principles of tidy data. Each variable (product) will be stored in its own column, and each row will represent a single sales record.


# Load the tidyr package
# Convert the normal data to tidy format using gather()
tidy_data <- gather(normal_data, key = "Product", value = "Sales", -Date)
# Display the tidy data


      Date  Product Sales
1 2023-01-01 ProductA 100
2 2023-01-02 ProductA 120
3 2023-01-03 ProductA 90
4 2023-01-01 ProductB 80
5 2023-01-02 ProductB 75
6 2023-01-03 ProductB 95
7 2023-01-01 ProductC 60
8 2023-01-02 ProductC 70
9 2023-01-03 ProductC 80

In the tidy data representation, we have three columns: “Date,” “Product” (which stores the product names), and “Sales” (which stores the corresponding sales values). Each row represents a single sales record, and the data is now structured in a way that follows the principles of tidy data.

Introduction to Tidy Data in R

Tidy data is a data science and analysis notion that entails arranging data in a systematic and consistent manner, making it easier to work with and analyze using tools such as R. Tidy data is a crucial component of Hadley Wickham’s data science methodology, which he popularized by creating the “tidyverse,” a set of R packages that contains tools for data modification, visualization, and analysis. We’ll look at the basics of tidy data in R and why it’s necessary for good data analysis in this introduction.

