Survival Analysis: The Basics

What is Survival Analysis?

Survival Analysis describes the methods used to analyze data which is characterized by time until an event occurs. This could be any event, not just death (which is where the name comes from), but events like: diagnosis of disease, first bite of pizza, finishing a beer, first caught fish…etc. In traditional survival analysis, the event either occurs or does not occur during the time of observation; there is no in-between.

This kind of analysis occurs often in clinical studies, like when we are trying to determine the efficacy of a new drug or therapy. Do the subjects taking the new drug or therapy survive longer than those taking the standard treatment? Survival analysis can help us answer that.

Let’s look at an example of data which can be analyzed in this way.

The Data

Usually and ideally the data is set up somewhat like this:

surv_time relapse sex log_WBC Rx
35 0 1 1.45 0
34 0 1 1.47 0
32 0 1 2.2 0
32 0 1 2.53 0

Each row is an individual (a patient in this case). The column ‘surv_time’ states the time in weeks until the event of interest occurs. The column ‘relapse’ is that event of interest, 0 if the event did not occur and 1 if the event did occur. The column ‘Rx’ records whether the subject received the standard treatment, coded as ‘0’, or the new treatment, coded as ‘1’. Usually there are other variables in such datasets which are explanatory or confounding variables for control. Here they are ‘sex’ (0 or 1 for male or female) and ‘log_WBC’ which is the log of the subjects white blood cell count.

If you are paying careful attention, you may have noticed something peculiar. Why do we have two variables (‘surv_time’ and ‘relapse’) when we are just interested in when the event of interest occurred? Shouldn’t it be enough to just have the time when the event occurred in the dataset, and if it doesn’t occur, just list the endpoint of the study as the time of occurrence? No. If we did this, we couldn’t analyze the data properly. Here, we need to understand the concept of ‘cencsoring.’

What is Censoring?

Censoring is a term in survival analysis that we use when the event of interest doesn’t occur. It usually happens for one of three reasons:

  1. The event of interest doesn’t occur during the study period
  2. The subject leaves the study before the event occurs
  3. The subject is ‘lost to follow-up’ which just means that the subject left the study at some point without telling anyone.

So what’s the commonality here between these three points? The event doesn’t occur during the study period? Well, in point 2 and 3 it could have occurred; we just don’t know. The commonality is that we don’t know the exact survival time for each of these cases. If the event occurs (no censoring) we know when it occurred, if the event does not occur (or if we don’t observe it), we do not know when and if it will occur. (Obviously if death is the event of interest we know that it will occur someday) We’ll look more into censoring in a future post but in this post we won’t worry about it much Now, we just need to understand that it is a common situation in survival analysis and that there are many different ways it can affect the analysis but also many ways to deal with it.

Data Visualization

Now that we’ve got an idea of what censoring is, we can begin to look at how to analyze the survival data. One of the most common ways is with a Kaplan-Meier survival curve. When we plot a survival curve, we are looking at time (x-axis) versus proportion surviving (y-axis). If we are comparing two groups (like new treatment vs. standard treatment), we can display two curves on one plot. This can be a very helpful way to see if there may be a difference between the two groups in terms of survival time. Multiple groups can be compared as well.

What’s it look like? Survival curves look like a step function, or a stair-case going down towards the right. Let’s show a plot of two of them and it will become much more clear.

surv

We’ll see the R code to produce this plot later. The x-axis is time in weeks and the y-axis is proportion surviving. The blue line represents the survival time of the subjects who received the new treatment and the red line represents the survival time of the subjects who received the standard treatment.

A note on the proportion surviving:

We need to get the proportion surviving so that we can properly compare two or more groups. If the groups differ in number, we can’t just show the number surviving, as it wouldn’t be a proper comparison. To compute this proportion we just take the use the following equation:

S_{t}= \frac{(number \ of \ subjects \ living \ at \ start \ - \ number \ of \ subjects \ died \ up \ to \ time \ t)}{number \ of \ subjects \ living \ at \ start}

Note that the numerator can also be written as the number of subjects surviving past time t.

S_{t}= \frac{number \ of \ subjects \ suviving \ past \ time \ t \ }{number \ of \ subjects \ living \ at \ start}

For example, we have 50 total subjects. At week 3, 5 of them die. At t=0, the proportion surviving is 1, or 50/50. At t=3, the proportion surviving is (50-5)/50 = 0.9. So at week 3, the survival curve would drop down to .9 from 1. Then at week 5, 3 more subjects die. Now the survival curve drops to 0.84 because (50-(5+3))/50 = 0.84 We must add the subjects who died previously to the ones who just died in week 5, so at week 5 a total of 8 subjects have died. We re-calculating the proportion surviving each time a subject(s) die(s) at a certain time.

This is exactly what is happening in the chart above.

Please note that the above calculation changes and becomes more complicated when we have censored data. We’ll explore this further in a future post.

You may have surmised that this dataset has to do with the survival time of patients receiving a new treatment or the standard treatment. Specifically, it is a dataset consisting of remission survival times on 42 leukemia patients along with some explanatory variables.

Visual Analysis of the Kaplan-Meier Curve

Which group ‘survives’ longer? Are there any other interesting things we can get out of the plot?

Clearly it appears that the group who received the new treatment is ‘surviving’ longer than the group who received the standard treatment. It’s also worth noting that no one in the standard treatment group survived longer than 23 weeks, whereas a little less than 50% of those with the new treatment survived until the end of the study (35 weeks).

So is that the end of our analysis? Do we conclude that the new treatment is better?

NOPE! Not just yet. There are, of course, more mathematically rigorous methods which we will look into in future posts, using the same dataset.

A note on the dataset: it comes from an exercise from the textbook “Survival Analysis – A Self -Learning Text” by David G. Kleinbaum and Mitchel Klein. It’s a very good book if you want to get into this subject more deeply. The datasets are available on their website: Survival Analysis Data

R-code

Last, we’ll look at the R-code to produce this plot.

Text in italics is the R-code.

Load the required packages (you may need to install them if you haven’t already): l

library(GGally)
library(survival)
library(ggplot2)

read in the dataset directly from the website as ‘anderson’ which is the name of the dataset, and add column names (the column names came from the book):

anderson=read.table(‘http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/anderson.dat’ , sep=" ",  col.names=c("surv_time","relapse","sex","log_WBC","Rx"))

Now we must create a ‘survival object’ in order to analyze the data. The R function ‘Surv’ from the survival package does this. The code below adds the survival object to the anderson data frame, telling R that ‘relapse’ is the event of interest and ‘1’ represents an occurrence of the event of interest. It is not death, it is whether or not the patient’s leukemia relapsed. All patients were in remission previously.

anderson$SurvObj = with(anderson, Surv(surv_time, relapse == 1))

The survival object should be a vector like this: 35+ 34+ 32+ 32+ 25+ 23  22  20+ 19+ 17+ 16  13  11+ 10+ 10   9+  7   6+  6   6….

What are the plus signs after some of the numbers? You probably saw them on the plot above too. That means this patient was censored. We’ll ignore that for now and revisit it in a future post.

The next line of code creates the data necessary for the survival curve. There’s a lot happening behind the scenes here, but basically it’s a summary of what we discussed above (the proportion surviving) for each group plus calculation of confidence intervals. Here the group is defined by the variable ‘Rx’ which tells us who received the new (0) or standard (1) treatment.

and_surv = survfit(SurvObj ~ Rx, data = anderson)
summary(and_surv)

by printing the summary of the  and_surv object, we can see what this does.

time n.risk n.event survival std.err lower 95% CI upper 95% CI
6 21 3 0.857 0.0764 0.72 1
7 17 1 0.807 0.0869 0.653 0.996
10 15 1 0.753 0.0963 0.586 0.968

You should see a table like this for Rx= 0 and Rx = 1. Note the ‘survival’ column which states the proportion surviving at time t. The calculation is straightforward for Rx = 1, however the Rx = 0 group has censored data so the calculation of the survival curve is more complicated. We’ll see how to do this in a future post.

Let’s plot the curve!

You can do it very simply by just:

plot(and_surv)

but this will give a pretty ugly and uninformative plot. I prefer using ggplot style plots. The code below uses the package GGally and the function ggsurv to plot the survival curves. Note the ‘scale_colour_manual’ which allows me to set the colors and legend manually. I’ve added theme_bw() for a minimalist plot and to remove the gray background.

p = ggsurv(and_surv,cens.col="blue")+theme_bw()+xlab("Time in Weeks")+
ylab("Proportion Surviving")+
ggtitle("Survival Curves for New Treatment and Standard Treatment")

p = p + guides(linetype = FALSE) 

p = p+ scale_colour_manual(name = 'treatment', breaks = c(0,1), 
values = c("blue","indianred"),labels = c("New","Standard"))+
theme(legend.key = element_blank())

p

You can then save the plot to your working directory by:

ggsave("surv_plot.png",scale=1.2)

Here’s the code all together:

library(GGally)
library(survival)
library(ggplot2)

anderson=read.table("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/anderson.dat",
sep=" ", col.names=c("surv_time","relapse","sex","log_WBC","Rx"))

anderson$SurvObj = with(anderson, Surv(surv_time, relapse == 1))

and_surv = survfit(SurvObj ~ Rx, data = anderson)

p = ggsurv(and_surv,cens.col="blue")+theme_bw()+xlab("Time in Weeks")+
ylab("Proportion Surviving")+
ggtitle("Survival Curves for New Treatment and Standard Treatment")

p = p +  guides(linetype = FALSE)  

p = p + scale_colour_manual(name="treatment",breaks = c(0,1),
values=c("blue","indianred"),labels = c("New","Standard"))+
theme(legend.key = element_blank())

p

ggsave("surv_plot.png",scale=1.2)

2 thoughts on “Survival Analysis: The Basics

Leave a comment