This summarizes some very basic concepts in survival analysis from this amazing tutorial, which helps me a lot.

Chapter 4: Estimiating the Survival or Hazard Function (Non-Parametric)

If there is no censoring data, the empirical estimate of survival distribution is simply

In practice, most software packages usually count individuals with . We represent it as .

There are three common non-parametric models to estimate survival function .

  • (1) Kaplan-Meier
  • (2) Life-table (Actuarial Estimator)
  • (3) via Cumulative harzard estimator

The Kaplan-Meier Estimator

Most popular, most famous one, which can be justified from three perspectives:

  • Product limit estimator (conditional probability)
  • maximum likelihood
  • redistribute to the right estimator

As stated in previous section, if there is no censoring, calculating the empirical survival proportion is enough.

What about censoring data? Let us start from the conditional probability perspective.

Perspective1: Conditional Probability

In previous section, we have shown the relationship between and for discrete r.v. case, using conditional probability.

Suppose , then

Therefore,

where is the number of deaths at tiem and is the number of individuals alive right before time

For censoring people, we remove them from , if he/she is censored before .

For example, if a patient is censored between , then

  • this patient is one member of , as his/her survival time is NOT earlier than
  • this patient is NOT one member of , as his/her survival time is earlier than

Formal Description of Kaplan-Meier estimator

where

  • is the set of distinct death times observed in sample
  • : number of deaths observed at time
  • : number of individuals alive right before time (i.e. every one died or censored after )
  • is the number of censored observations between .

Two useful formulas:

Some interesting facts

  • only changes at death times
  • is 1 until the first death happens
  • goes to 0 if the last event is a death

Perspective 2: Maximum Likelihood

For a discrete failure time variable, define

  • : number of failures at
  • : number of individuals alive right before
  • : probability of dying in , given the fact alive before

Supposed that there are distinct time points, then the likelihood of observing is

which are just independent bernoulli trials.

Then it is pretty obvisous that the maximum likelihood estimator of is

Perspective 3: Redistribute to the Right Estimator

Please refer to p57-58 in the tutorial.

Properties of KM estimator

If there is no censoring

This is like an estimated probability from a bionomial distribution, thus we know

and we know the approximated variance.

means “asymptotically equal”.

How does censoring affect this?

  • is still approximately normal
  • The mean of converges to the true
  • The variance is a bit more complicated

Once we get the variance, then we can construct (pointwise) confidence intervals (not bands) about , by

(More about pointwise confidence interval: to be continued..)

Comments