Introduction to Cox Proportional Hazards Model

The Cox Proportional Hazards Model is one of the most powerful statistical methods in survival analysis. Key Feature: An analysis that deals with studying the relationship between either survival time and/or failure time about one or more predictor variables. Key Features:

  • Non‐parametric Baseline Hazard: No assumption about the form of the baseline hazard function is needed.
  • Proportional Hazards Assumption: This means that the effect of a covariate on the hazard rate is multiplicative and does not change over the time span.

Data Preparation for Survival Analysis

Now we will repair the data for the Survival Analysis. there are 3 types of the data we used.

  • Time-to-Event Variable: A variable indicating the duration between an origin and an event or censor.
  • Censoring Indicator: You indicate whether the event occurred or whether the observation was censored.
  • Predictor Variables: Clean and properly format all of the independent variables.

Survival Package in R

R survival package helps in understanding survival comprehensively. Functions for creating a survival object, fitting Cox proportional-hazard models, and plotting the survival curves are included. Herein, we showed the usage of these functions towards the actual performance of specific survival analysis.

R
# Install and load the survival package
install.packages("survival")
library(survival)

# Load a sample dataset
data(lung)

# View the structure of the dataset
str(lung)

Output:

'data.frame':    228 obs. of  10 variables:
$ inst : num 3 3 3 5 1 12 7 11 1 7 ...
$ time : num 306 455 1010 210 883 ...
$ status : num 2 2 1 2 2 1 2 2 2 2 ...
$ age : num 74 68 56 57 60 74 68 71 53 61 ...
$ sex : num 1 1 1 1 1 1 2 2 1 1 ...
$ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
$ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
$ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
$ meal.cal : num 1175 1225 NA 1150 NA ...
$ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...

Create a survival object

Before fitting the Cox model, data should be prepared and this time-to-event variable with a censoring indicator should be formatted properly.

R
# Create a survival object
lung$surv_obj <- with(lung, Surv(time, status == 2))

# View the first few rows of the dataset
head(lung)

Output:

  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss surv_obj
1 3 306 2 74 1 1 90 100 1175 NA 306
2 3 455 2 68 1 0 90 90 1225 15 455
3 3 1010 1 56 1 0 90 90 NA 15 1010+
4 5 210 2 57 1 1 90 60 1150 11 210
5 1 883 2 60 1 0 100 90 NA 0 883
6 12 1022 1 74 1 1 50 80 513 0 1022+

Feature Engineering for Survival Analysis

Feature engineering encompasses handling missing values, scaling variables, and creating interaction terms.

R
# Handle missing values (example: using median imputation)
lung$age[is.na(lung$age)] <- median(lung$age, na.rm = TRUE)

# Scale the numeric predictor variables
lung$age <- scale(lung$age)

# Create an interaction term (example: interaction between age and sex)
lung$age_sex <- lung$age * lung$sex

This code help us in handling missing values, scaling numeric predictor variables, and creating interaction terms in the context of preparing data for Cox proportional hazards modeling.

Fitting Cox Proportional Hazards Models

Fit Cox proportional hazards models using the ‘coxph()’ function.

R
# Fit the Cox proportional hazards model
cox_model <- coxph(surv_obj ~ age + sex + ph.ecog + age_sex, data = lung)

# View the summary of the Cox model
summary(cox_model)

Output:

Call:
coxph(formula = surv_obj ~ age + sex + ph.ecog + age_sex, data = lung)

n= 227, number of events= 164
(1 observation deleted due to missingness)

coef exp(coef) se(coef) z Pr(>|z|)
age 0.4021 1.4950 0.2507 1.604 0.10876
sex -0.5402 0.5826 0.1684 -3.207 0.00134 **
ph.ecog 0.4920 1.6355 0.1160 4.241 2.22e-05 ***
age_sex -0.2248 0.7986 0.1745 -1.289 0.19752
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
age 1.4950 0.6689 0.9146 2.4439
sex 0.5826 1.7164 0.4188 0.8105
ph.ecog 1.6355 0.6114 1.3030 2.0530
age_sex 0.7986 1.2521 0.5673 1.1243

Concordance= 0.651 (se = 0.025 )
Likelihood ratio test= 32.14 on 4 df, p=2e-06
Wald test = 32.65 on 4 df, p=1e-06
Score (logrank) test = 33.33 on 4 df, p=1e-06

The summary of the Cox proportional hazards model provides several key pieces of information:

  • Concordance: 0.651, indicating the model’s ability to correctly rank the order of survival times (where 1 is perfect prediction and 0.5 is random chance).
  • Likelihood ratio test: p-value of 2e-06.
  • Wald test: p-value of 1e-06.
  • Score (logrank) test: p-value of 1e-06. These p-values suggest the overall model is statistically significant.
  • Age: Coefficient (coef) = 0.4021, not statistically significant (p = 0.10876).
  • Sex: Coefficient (coef) = -0.5402, statistically significant (p = 0.00134), with a hazard ratio (exp(coef)) of 0.5826, indicating males (coded as 1) have a lower risk than females.
  • ph.ecog: Coefficient (coef) = 0.4920, statistically significant (p = 2.22e-05), indicating higher ECOG scores are associated with increased risk.
  • Age_Sex interaction: Coefficient (coef) = -0.2248, not statistically significant (p = 0.19752).
  • Age: Hazard ratio (exp(coef)) = 1.4950, suggesting each unit increase in scaled age increases the hazard by 49.5%, but it’s not significant.
  • Sex: Hazard ratio = 0.5826, suggesting males have about 42% lower risk compared to females.
  • ph.ecog: Hazard ratio = 1.6355, indicating each unit increase in ECOG score increases the hazard by 63.55%.
  • Age_Sex: Hazard ratio = 0.7986, interaction term not significant.

Overall, the significant variables are sex and ph.ecog, indicating these have a meaningful impact on survival in the dataset.

Visualizing Survival Curves

The survival curves can be visualized to understand the point in time at which the survival probabilities occur.

R
# Plot survival curves
surv_fit <- survfit(cox_model)
plot(surv_fit, xlab = "Time (days)", ylab = "Survival Probability", 
     main = "Survival Curves")

Output:


cox model in R


  • The x-axis represents the time in days. This is the follow-up period during which the survival of the patients in the dataset was monitored.
  • The y-axis represents the survival probability. This is the estimated probability that a subject will survive beyond a given time point.

The plot shows one or more survival curves, depending on the number of strata in the model. Each curve represents the estimated survival probability over time for a group of subjects in the dataset. This plot is a powerful tool for visualizing and understanding the survival dynamics within the dataset, providing insights into how different variables in the Cox model influence survival probabilities.

Cox model in R

cox model in R is the part of Survival modeling and it is important in predictive analytics for defining the time of an event. in this article, we will discuss in detail the Cox model and implementation of the Cox model in R Programming Language.

Similar Reads

Introduction to Cox Proportional Hazards Model

The Cox Proportional Hazards Model is one of the most powerful statistical methods in survival analysis. Key Feature: An analysis that deals with studying the relationship between either survival time and/or failure time about one or more predictor variables. Key Features:...

Conclusion

Although the Cox model is one of the strongest tools for understanding time-to-event data, it finds applications across a wide range of disciplines and, more specifically, in survival analysis. Appropriate data preparation, model fitting, and validation are important components for the search for reliable answers using such methods. Continuous development of statistical methodologies with increasing computational tools, such as R, enhances the applicability and efficiency of survival analysis on most real-world datasets....