This summarizes some very basic concepts in survival analysis from this amazing tutorial, which helps me a lot.
Chapter 4: Estimiating the Survival or Hazard Function (Non-Parametric)
If there is no censoring data, the empirical estimate of survival distribution is simply
In practice, most software packages usually count individuals with . We represent it as .
There are three common non-parametric models to estimate survival function .
- (1) Kaplan-Meier
- (2) Life-table (Actuarial Estimator)
- (3) via Cumulative harzard estimator
The Kaplan-Meier Estimator
Most popular, most famous one, which can be justified from three perspectives:
- Product limit estimator (conditional probability)
- maximum likelihood
- redistribute to the right estimator
As stated in previous section, if there is no censoring, calculating the empirical survival proportion is enough.
What about censoring data? Let us start from the conditional probability perspective.
Perspective1: Conditional Probability
In previous section, we have shown the relationship between and for discrete r.v. case, using conditional probability.
Suppose , then
Therefore,
where is the number of deaths at tiem and is the number of individuals alive right before time
For censoring people, we remove them from , if he/she is censored before .
For example, if a patient is censored between , then
- this patient is one member of , as his/her survival time is NOT earlier than
- this patient is NOT one member of , as his/her survival time is earlier than
Formal Description of Kaplan-Meier estimator
where
- is the set of distinct death times observed in sample
- : number of deaths observed at time
- : number of individuals alive right before time (i.e. every one died or censored after )
- is the number of censored observations between .
Two useful formulas:
Some interesting facts
- only changes at death times
- is 1 until the first death happens
- goes to 0 if the last event is a death
Perspective 2: Maximum Likelihood
For a discrete failure time variable, define
- : number of failures at
- : number of individuals alive right before
- : probability of dying in , given the fact alive before
Supposed that there are distinct time points, then the likelihood of observing is
which are just independent bernoulli trials.
Then it is pretty obvisous that the maximum likelihood estimator of is
Perspective 3: Redistribute to the Right Estimator
Please refer to p57-58 in the tutorial.
Properties of KM estimator
If there is no censoring
This is like an estimated probability from a bionomial distribution, thus we know
and we know the approximated variance.
means “asymptotically equal”.
How does censoring affect this?
- is still approximately normal
- The mean of converges to the true
- The variance is a bit more complicated
Once we get the variance, then we can construct (pointwise) confidence intervals (not bands) about , by
(More about pointwise confidence interval: to be continued..)