Handle Missing Values in Time Series in Python

Here’s an step by step guide of Python implementation for handling missing values in a time series dataset:

Step 1: Importing the Libraries

Here we are importing all the necessary libraries:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Step 2: Importing the Dataset

  1. Importing data: It imports pandas library (pd) and reads the data from the CSV file using pd.read_csv, assuming the first row doesn’t contain column names (header=None).
  2. Naming columns: It assigns “Date” and “Customers” as names for the two columns using df.columns.
  3. Converting date format: It converts the “Date” column into a proper datetime format with year, month, and day order using pd.to_datetime and specifying the original format string (%Y-%m).
  4. Setting Date index: It sets the “Date” column as the index of the DataFrame using df.set_index, making it the reference point for time-based operations.
  5. Checking data shape and preview: It checks the final data shape with df.shape and displays the first few rows with df.head().


# import the data
df= pd.read_csv('/Time-Series.csv', header=None)
# name the columns
# represent date column in date fromat in the order, Year, month and the day
df['Date']=pd.to_datetime(df['Date'], format='%Y-%m')
# set the Date column be the index of our dataset
df= df.set_index('Date')
# now check the data shape



(144, 1)
1949-01-01      114.0
1949-02-01      120.0
1949-03-01      134.0
1949-04-01       67.0
1949-05-01      123.0
  1. Identifying missing values:
    • nul_data = pd.isnull(df['Customers']): This line uses the pd.isnull function from pandas to create a new Boolean Series (nul_data) containing True for every missing value in the “Customers” column of the DataFrame df and False otherwise.
  2. Filtering and printing data:
    • df[nul_data]: This line uses Boolean indexing to filter the original DataFrame df based on the nul_data Series. It essentially selects only the rows where the “Customers” value is missing (i.e., True in the corresponding nul_data series).


nul_data = pd.isnull(df['Customers'])
# print only the data, Customers = NaN



1951-06-01    NaN
1951-07-01    NaN
1954-06-01    NaN
1960-03-01    NaN

Plot the Graph

This creates a line plot of the data in the DataFrame df. It automatically uses the index (assumed to be the date) as the x-axis and the “Customers” column as the y-axis. 


# plots our series
plt.plot(df, color='green')
plt.title('Customers visted shop since 1950')



Step 3: Imputing the Missing Values

Here is the explanation of the techniques mentioned for handling missing values in time series data:

  1. Mean Imputation: Replaces missing values with the average of the entire column. Simple and fast, but may not capture trends or local variations.
  2. Median Imputation: Replaces missing values with the median of the entire column. Less sensitive to outliers than mean, but still lacks local context.
  3. Last Observation Carried Forward (LOCF): Replaces missing values with the last known value. Works well for data with rising or constant trends, but can distort trends if they change direction.
  4. Next Observation Carried Backward (NOCB): Replaces missing values with the next known value. Similar to LOCF but for downward trends. Both LOCF and NOCB can introduce artificial jumps or dips.
  5. Linear Interpolation: Estimates missing values by drawing a straight line between the two nearest known data points. Good for capturing linear trends, but less accurate for complex patterns.
  6. Spline Interpolation: Estimates missing values by fitting a flexible, curved line through the data points. More accurate for capturing complex trends and subtle changes than linear interpolation, but computationally more expensive.

1. Mean imputation

It performs mean imputation on the “Customers” column of the DataFrame. It creates a new column named “FillMean” containing the original values where available and the average value of the “Customers” column where missing.


# fill the missing data using the mean of the present observations
df = df.assign(FillMean=df.Customers.fillna(df.Customers.mean()))
# pass the data and declared the colour of your curve, i.e., blue
plt.plot(df, color='green')
plt.title('Mean Imputation')



2. Median imputation

It performs median imputation on the dataset. It copies all existing columns and adds a new column named FillMedian. This new column fills in missing values in the Customers column using the median value of that column (df.Customers.median()).


# fill the missing data using the of the present observations
dataset = df.assign(FillMean=df.Customers.fillna(df.Customers.median()))
# pass the data and declared the colouyr opf our curve as blue
plt.plot(dataset, color='green')
plt.title('Median Imputation')



3. Last Observation Carried Forward(LOCF)

In this we are imputing missing values in time series data and visualizing the results using Last Observation Carried Forward (LOCF) technique imputes missing values in the “Customers” column by copying the previous values and then visualizes the resulting time series.


# On the customer column of our data, impute the missing values with the LOCF
df['Customers_locf']= df['Customers'].fillna(method ='bfill')
# plot our time series with imputed values
plt.plot(df['Customers_locf'], color='green')
plt.title('Last Observation Carried Forward')



4. Next Observation Carried Backward(NOCB)

In this we are imputing missing values in time series data but uses a different technique: Next Observation Carried Backward (NOCB) imputation to fill missing values in the “Customers” column by copying the next available observation and then visualizes the time series data.


# On the customer column of our data, impute the missing values with the LOCF
df['Customers_nocb']= df['Customers'].fillna(method ='ffill')
# plot our time series with imputed values
plt.plot(df['Customers_nocb'], color='green')
plt.title('Next Observation Carried Backward')



3. Linear Interpolation

In this we are imputing missing values in time series data using a technique called linear interpolation to estimate and fill in missing values in the “Customers” column.


# on our data, impute the missing values using rolling window method
df['Customers_L']= df['Customers'].interpolate(method='linear')
# plot the complete dataset
plt.plot(df['Customers_L'], color='green')
plt.title('Linear interpolatoin')



6. Spline Interpolation

In this we are imputing missing values in time series data using a technique called spline interpolation to estimate and fill in missing values in the “Customers” column.


# on our data, impute the missing values using the interpolation techniques and specifically, the lineare method
df['Customers_Spline']= df['Customers'].interpolate(option='spline')
# plot the complete dataset
plt.plot(df['Customers_Spline'], color='green')
plt.title('Spline Interpolation')




Dealing with missing values in your Python time series can be a frustrating experience. However, with careful analysis and the right imputation technique, you can transform fragmented data into a smooth and reliable flow for more accurate analysis. It’s important to note that there is no one-size-fits-all approach to imputation, so it’s essential to assess your data, understand the patterns of missingness, and choose the technique that best preserves the integrity and meaning of your time series. By embracing the power of imputation and bridging the gaps with confidence, you can take your time series analysis to new heights!

How to deal with missing values in a Timeseries in Python?

It is common to come across missing values when working with real-world data. Time series data is different from traditional machine learning datasets because it is collected under varying conditions over time. As a result, different mechanisms can be responsible for missing records at different times. These mechanisms are known as missingness mechanisms. In this article, we will discuss how to handle missing values in time series data using Python.

What is Timeseries Data?

Time series is a sequence of observations recorded at regular time intervals. Time series analysis can be useful to see how a given asset, security, or economic variable changes over time. Another big question is why we need to deal with missing values in the dataset and why the missing values are present in the data....

