
Creating cohorts for survival analyses
Source:vignettes/a00_Creating_cohorts_for_survival.Rmd
a00_Creating_cohorts_for_survival.RmdSet up
Let us first load the packages required.
We will use the example MGUS2 survival dataset included in CohortSurvival. In practice you would create a CDM reference with CDMConnector and then add the target, outcome, and optional competing outcome cohorts needed for the analysis.
cdm <- CohortSurvival::mockMGUS2cdm()Cohorts needed for survival
A CohortSurvival analysis starts from OMOP cohort tables. Each cohort table needs the standard cohort columns:
cdm$mgus_diagnosis |>
dplyr::select(cohort_definition_id, subject_id, cohort_start_date, cohort_end_date) |>
dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
#> $ cohort_start_date <date> 1981-01-01, 1968-01-01, 1980-01-01, 1977-01-01, …
#> $ cohort_end_date <date> 1981-01-01, 1968-01-01, 1980-01-01, 1977-01-01, …The target cohort defines who is at risk and when follow-up starts.
In most analyses, cohort_start_date is the index date.
cohort_end_date can also matter: if
censorOnCohortExit = TRUE, follow-up is censored at target
cohort exit.
The outcome cohort defines the event of interest. By default
CohortSurvival uses cohort_start_date in the outcome cohort
as the event date, but this can be changed with
outcomeDateVariable.
cdm$death_cohort |>
dplyr::select(cohort_definition_id, subject_id, cohort_start_date, cohort_end_date) |>
dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 1…
#> $ cohort_start_date <date> 1981-01-31, 1968-01-26, 1980-02-16, 1977-04-03, …
#> $ cohort_end_date <date> 1981-01-31, 1968-01-26, 1980-02-16, 1977-04-03, …For competing-risk analyses, a third cohort table defines the competing outcome. In this example, disease progression is the event of interest and death is the competing outcome.
cdm$progression |>
dplyr::select(cohort_definition_id, subject_id, cohort_start_date, cohort_end_date) |>
dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <int> 56, 81, 83, 111, 124, 127, 147, 163, 165, 167, 18…
#> $ cohort_start_date <date> 1978-01-30, 1985-01-15, 1974-08-17, 1993-01-14, …
#> $ cohort_end_date <date> 1978-01-30, 1985-01-15, 1974-08-17, 1993-01-14, …Target cohort design
The target cohort should represent the population and time origin for the study question. A common pattern is one record per person at first eligible diagnosis or treatment start, but repeated target cohort entries can be valid if the estimand is episode-based rather than person-based.
When designing a target cohort, decide:
| Question | Why it matters |
|---|---|
| What is the index date? | Survival time starts at target cohort entry. |
| Are people allowed to enter more than once? | Repeated records change the interpretation from people to episodes. |
| Is prior observation required? | Washout and baseline characteristics are only meaningful when prior observation is available. |
| Should follow-up end at cohort exit? | This determines whether censorOnCohortExit = TRUE is
appropriate. |
| Which covariates are needed as strata or weights? | Strata and weights must be columns in the target cohort table before estimation. |
For example, a target cohort for survival after MGUS diagnosis might require a first observed MGUS diagnosis after at least 365 days of prior observation. A target cohort for survival after treatment start might instead index on first treatment after diagnosis.
Outcome cohort design
The outcome cohort should contain the first relevant event dates for the outcome definition you want to study. For a death outcome, you may need to create a cohort from the CDM death table. CohortConstructor provides helpers for this, for example:
cdm <- CohortConstructor::deathCohort(
cdm = cdm,
name = "death_cohort",
subsetCohort = "mgus_diagnosis"
)For clinical outcomes, the outcome cohort is usually created from diagnosis, procedure, drug, measurement, or observation records. The exact cohort definition is study-specific, but the resulting table should be a normal OMOP cohort table.
Outcome washout is applied relative to target cohort entry. With
outcomeWashout = Inf, people with any prior outcome before
index are excluded from the survival analysis. With
outcomeWashout = 0, prior outcomes are not used to exclude
target cohort records. A finite value, such as
outcomeWashout = 365, excludes people with the outcome in
that many days before index.
estimateSingleEventSurvival(
cdm = cdm,
targetCohortTable = "mgus_diagnosis",
outcomeCohortTable = "death_cohort",
outcomeWashout = 365,
followUpDays = 365
) |>
tableSurvival()| Data source | Target cohort | Outcome name |
Estimate name
|
|||
|---|---|---|---|---|---|---|
| Number records | Number events | Median survival (95% CI) | Restricted mean survival (95% CI) | |||
| mock | mgus_diagnosis | death_cohort | 1,384 | 962 | 98.00 (92.00, 103.00) | 129.00 (122.00, 135.00) |
Competing outcome design
A competing outcome should be an event that prevents or changes the interpretation of the event of interest. Death is often a competing outcome for non-fatal clinical events. Competing outcome cohorts use the same cohort-table structure as outcome cohorts.
Competing-risk analyses allow separate washout choices for the event of interest and competing outcome:
estimateCompetingRiskSurvival(
cdm = cdm,
targetCohortTable = "mgus_diagnosis",
outcomeCohortTable = "progression",
competingOutcomeCohortTable = "death_cohort",
outcomeWashout = 365,
competingOutcomeWashout = 0,
followUpDays = 365
) |>
tableSurvival()| Data source | Target cohort | Outcome type | Outcome name |
Estimate name
|
||
|---|---|---|---|---|---|---|
| Number records | Number events | Restricted mean survival (95% CI) | ||||
| mock | mgus_diagnosis | outcome | progression | 1,384 | 105 | 26.00 (21.00, 31.00) |
| competing_outcome | death_cohort | 1,384 | 868 | 213.00 (205.00, 221.00) | ||
Use separate washouts when prior history has a different meaning for the two event processes. For example, you may want to exclude people with prior disease progression but still allow people with prior non-fatal competing events, depending on the study question.
Adding strata
Stratification variables must be present as columns in the target cohort table. In real studies these columns are often added with packages such as PatientProfiles before calling CohortSurvival.
cdm$target <- cdm$target |>
PatientProfiles::addDemographics(
ageGroup = list(c(0, 64), c(65, 74), c(75, Inf)),
sex = TRUE,
name = "target"
)The mock MGUS target cohort already contains several columns that can be used for strata.
cdm$mgus_diagnosis |>
dplyr::select(subject_id, cohort_start_date, age, age_group, sex) |>
dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> $ subject_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ cohort_start_date <date> 1981-01-01, 1968-01-01, 1980-01-01, 1977-01-01, 197…
#> $ age <dbl> 88, 78, 94, 68, 90, 90, 89, 87, 86, 79, 86, 89, 87, …
#> $ age_group <chr> ">=70", ">=70", ">=70", "<70", ">=70", ">=70", ">=70…
#> $ sex <fct> F, F, M, M, F, M, F, F, F, F, M, F, M, F, M, F, F, M…Strata are passed as a list. Each element is one stratification requested by the user. The following estimates overall survival, survival by sex, and survival by the combination of age group and sex.
estimateSingleEventSurvival(
cdm = cdm,
targetCohortTable = "mgus_diagnosis",
outcomeCohortTable = "death_cohort",
strata = list("sex", c("age_group", "sex")),
restrictedMeanFollowUp = 365
) |>
tableSurvival()| Data source | Target cohort | Sex | Age group | Outcome name |
Estimate name
|
|||
|---|---|---|---|---|---|---|---|---|
| Number records | Number events | Median survival (95% CI) | Restricted mean survival (95% CI) | |||||
| mock | mgus_diagnosis | overall | overall | death_cohort | 1,384 | 963 | 98.00 (92.00, 103.00) | 129.00 (122.00, 135.00) |
| F | overall | death_cohort | 631 | 423 | 108.00 (100.00, 121.00) | 139.00 (129.00, 150.00) | ||
| M | overall | death_cohort | 753 | 540 | 88.00 (79.00, 97.00) | 120.00 (111.00, 129.00) | ||
| F | <70 | death_cohort | 240 | 109 | 215.00 (179.00, 260.00) | 208.00 (189.00, 228.00) | ||
| M | <70 | death_cohort | 334 | 184 | 158.00 (139.00, 189.00) | 173.00 (157.00, 189.00) | ||
| F | >=70 | death_cohort | 391 | 314 | 82.00 (75.00, 94.00) | – | ||
| M | >=70 | death_cohort | 419 | 356 | 61.00 (54.00, 70.00) | – | ||
Setting restrictedMeanFollowUp is especially important
when comparing strata. If it is left as NULL, the
restricted mean horizon is left to the underlying survival summary,
which can use a common maximum follow-up time across fitted curves. A
group with shorter observed follow-up may then have its last estimate
carried forward beyond its own maximum follow-up, so the restricted mean
can be larger than the observed follow-up time for that group. A common
horizon, such as 365 days, makes the comparison refer to the same
follow-up window for every group where that follow-up is available.
Multiple cohorts in one table
CohortSurvival can estimate all combinations of selected target and
outcome cohort IDs in one call, provided those cohorts are in the
supplied cohort tables. When target or outcome cohorts live in separate
tables, you can either run the analysis separately and bind the
summarised_result objects, or create a combined cohort
table before estimation.
cdm <- omopgenerics::bind(
cdm$progression,
cdm$death_cohort,
name = "outcome_cohorts"
)
estimateSingleEventSurvival(
cdm = cdm,
targetCohortTable = "mgus_diagnosis",
outcomeCohortTable = "outcome_cohorts"
)When combining cohorts, check that the cohort set metadata identifies
each cohort_definition_id clearly. This is what
CohortSurvival uses to label outcomes in plots and tables.
Pre-analysis checklist
Before running a survival analysis, check:
- The target, outcome, and competing outcome tables are cohort tables in the same CDM reference.
- The target cohort index date is the intended time zero.
- The outcome date column is the intended event date.
- Prior observation and washout choices match the estimand.
- Censoring choices are explicit: observation period end, cohort exit, calendar date, and maximum follow-up.
- Strata or weight variables are present in the target cohort table and have sensible missingness.
-
restrictedMeanFollowUpis set to a common horizon when restricted mean survival will be compared across groups.
Disconnect from the cdm database connection
We finish by disconnecting from the cdm.
cdmDisconnect(cdm)