If you talk about survival analysis to biostatisticians, they live and breathe this stuff. However, I know myriad data scientists who do not know these models and about the extreme relevance of these models outside of the original domain they were developed for. Survival Analysis models are non-parametric statistical methods from biostatistics which are extensively used in modeling financial events like mortgages (which is where I used it).
Consider a clinical trial where patients have been provided with a drug, and every month they are tested for the underlying condition and its stage. Let us say there are 4 stages to this disease, additionally there is the hazardous event of death, and of course the wonderful event of remission – a total of 6 possible stages. In time t = 0, someone might come in with a Stage 3 diagnosis. In their test next month, they can move to any of the other stages (including death or remission).
Now many data scientists here would think that it sounds very similar to a multiclass classification problem. But there are a few key differences – and the primary one being that we deeply care about the time to Event as opposed to the probability of the event. We all know that the hazardous event (death) will come, but ideally, we would want to know when it is likely to come and what is the probability of it coming in the next month, the month after that, and so on.
The second difference is the concept of censoring. We will discuss censoring often in this series because machine learning algorithms do not deal with censoring well, making them useless in certain situations. Let us say you are analyzing a dataset today – and the data was collected between January 1st to December 31st of the previous year. In that period, two groups of folks were provided either a placebo or the trial drug. In that 12-month period, a few things can happen.
The person completes the study and survives through it.
The person has a hazardous event during the study period.
The person drops out of the study during the study period.
Thus, the time-to-event data can be censored, meaning that for some subjects, the exact event time is not observed within the study period. This is a unique feature of survival data that needs to be handled appropriately.
Now if you think about this a little bit, you will see the use cases in finance very easily. Let us say a loan has been provided. What is the likelihood that it will default (event) in any of the 120 months it has been provided for (time to event) is of great importance to the financial institution who has written out that loan. Similarly, in the case especially of mortgages you will find prepayments which are not default events (refinancing). And, also important is the fact that you will have a ton of live loans which may default later, but you are observing them as of today.
In short, the key concepts here are:
Event of Interest: The specific outcome we're interested in studying. Examples include death, machine failure, default etc.
Time to Event: The primary (dependent) variable, which is the time duration from a defined starting point to the occurrence of the event. This is often called "survival time."
Censoring: This occurs when the exact time-to-event is unknown for some subjects. This can happen if a subject leaves the study early, the study ends before the event occurs, or if the event never happens during the study period. Censoring is a unique challenge in survival analysis and must be handled appropriately.
Additionally, it should be noted that survival models are non-parametric or semi-parametric, so unlike most statistical classification there are no distributional assumptions on the data or the error terms. And because they are statistical models (and not machine learning algorithms) they also work well on limited data.
---
Now let us look at a simple example.
PatientID | Age | Treatment | Start_Time | End_Time | Event |
1 | 68 | A | 0 | 7 | 1 |
2 | 58 | B | 0 | 9 | 0 |
3 | 44 | A | 0 | 7 | 0 |
4 | 72 | B | 0 | 1 | 1 |
5 | 37 | B | 0 | 1 | 1 |
6 | 50 | B | 0 | 9 | 0 |
7 | 68 | A | 0 | 9 | 0 |
8 | 48 | A | 0 | 4 | 1 |
9 | 52 | A | 0 | 9 | 0 |
10 | 40 | A | 0 | 3 | 0 |
Where the data is self-explanatory – but, in short:
• PatientID: A unique identifier assigned to each patient in the study.
• Age: The age of the patient at the start of the observation period.
• Treatment: The type of treatment the patient received. There are two categories:
'A': Treatment type A
'B': Treatment type B
• Start_Time: The start time of the observation period, which is set to 0 for all patients as the observation begins uniformly.
• End_Time: The time (in years) at which either the event occurred or the observation was censored. This represents the duration from the start time to the event or censoring.
• Event: A binary indicator of whether the event of interest (e.g., relapse, failure) occurred:
1: The event occurred
0: The data was censored (i.e., the event did not occur during the observation period)
This data dictionary provides an overview of each variable in the dataset, its data type, and a brief description of what it represents.
---
Now, the longest end_time is the duration of the study or the observation period (in this case, 9 months). And the event may have occurred in these 9 months, or it may occur after, or in some cases (like Row 10) the Patient might have dropped off from the study.
The standard package that is used for these models in Python is lifeline and, in this case, we will run a Cox’s Proportional Hazard Model.
Let us jump straight into the code and the inferences.
What about causal ml. Can’t it be used complementarily with survival analysis for better results