Computing C.I. using Bootstrapping
Bootstrapping is a test/metric that uses random sampling with replacement. It gives the measure of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates. It allows the estimation of the sampling distribution for most of the statistics using random sampling methods. It may also be used for constructing hypothesis tests.
Example:
Python3
# import libraries import pandas import numpy from sklearn.utils import resample from sklearn.metrics import accuracy_score from matplotlib import pyplot as plt # load dataset x = numpy.array([ 180 , 162 , 158 , 172 , 168 , 150 , 171 , 183 , 165 , 176 ]) # configure bootstrap n_iterations = 1000 # here k=no. of bootstrapped samples n_size = int ( len (x)) # run bootstrap medians = list () for i in range (n_iterations): s = resample(x, n_samples = n_size); m = numpy.median(s); medians.append(m) # plot scores plt.hist(medians) plt.show() # confidence intervals alpha = 0.95 p = (( 1.0 - alpha) / 2.0 ) * 100 lower = numpy.percentile(medians, p) p = (alpha + (( 1.0 - alpha) / 2.0 )) * 100 upper = numpy.percentile(medians, p) print (f "\n{alpha*100} confidence interval {lower} and {upper}" ) |
After importing all the necessary libraries create a sample S with size n=10 and store it in a variable x. Using a simple loop generate 1000 artificial samples (=k) with each sample size m=10 (since m<=n). These samples are called the bootstrapped sample. Their medians are computed and stored in a list ‘medians’. Histogram of Medians from 1000 bootstrapped samples is plotted with the help of matplotlib library and using the formula confidence interval of a sample statistic calculates an upper and lower bound for the population value of the statistic at a specified level of confidence based on sample data is calculated.
How to Plot a Confidence Interval in Python?
Confidence Interval is a type of estimate computed from the statistics of the observed data which gives a range of values that’s likely to contain a population parameter with a particular level of confidence.
A confidence interval for the mean is a range of values between which the population mean possibly lies. If I’d make a weather prediction for tomorrow of somewhere between -100 degrees and +100 degrees, I can be 100% sure that this will be correct. However, if I make the prediction to be between 20.4 and 20.5 degrees Celsius, I’m less confident. Note how the confidence decreases, as the interval decreases. The same applies to statistical confidence intervals, but they also rely on other factors.
A 95% confidence interval, will tell me that if we take an infinite number of samples from my population, calculate the interval each time, then in 95% of those intervals, the interval will contain the true population mean. So, with one sample we can calculate the sample mean, and from there get an interval around it, that most likely will contain the true population mean.
Confidence Interval as a concept was put forth by Jerzy Neyman in a paper published in 1937. There are various types of the confidence interval, some of the most commonly used ones are: CI for mean, CI for the median, CI for the difference between means, CI for a proportion and CI for the difference in proportions.
Let’s have a look at how this goes with Python.