Get cohort intersections
Martí Català, Mike Du, Yuchen Guo, Kim López-Güell, Xintong Li, Núria Mercadé-Besora, and Edward Burn
2024-03-17
Source:vignettes/addCohortIntersections.Rmd
addCohortIntersections.Rmd
Introduction
In this vignette we present how functions from this package can be used to get intersections between cohorts. This can be useful, for instance, if we want to identify patients with previous conditions.
The PatientProfiles package is designed to work with data in the OMOP CDM format, so our first step is to create a reference to the data using the DBI and CDMConnector packages. The connection to a Postgres database would look like:
library(DBI)
library(CDMConnector)
# The input arguments provided are for illustrative purposes only and do not provide access to any database.
con <- DBI::dbConnect(RPostgres::Postgres(),
dbname = "omop_cdm",
host = "10.80.192.00",
user = "user_name",
password = "user_pasword"
)
cdm <- CDMConnector::cdm_from_con(con,
cdm_schema = "main",
write_schema = "main",
cohort_tables = "cohort_example"
)
In this vignette we will work with simulated data generated by the
mockPatientProfiles()
function provided in this package,
which mimics a database formatted in OMOP.
library(PatientProfiles)
library(duckdb)
library(dplyr)
cdm <- mockPatientProfiles(
patient_size = 1000,
drug_exposure_size = 1000
)
In this mock dataset there are the following cohort tables:
## Rows: ??
## Columns: 4
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 1, 1, 1, 2
## $ subject_id <dbl> 1, 1, 2, 3
## $ cohort_start_date <date> 2020-01-01, 2020-06-01, 2020-01-02, 2020-01-01
## $ cohort_end_date <date> 2020-04-01, 2020-08-01, 2020-02-02, 2020-03-01
## Rows: ??
## Columns: 4
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 1, 1, 2, 3, 1
## $ subject_id <dbl> 1, 3, 1, 2, 1
## $ cohort_start_date <date> 2019-12-30, 2020-01-01, 2020-05-25, 2020-01-01, 2…
## $ cohort_end_date <date> 2019-12-30, 2020-01-01, 2020-05-25, 2020-01-01, 2…
Example: addCohortIntersectFlag and addCohortIntersectCount functions
addCohortIntersectFlag()
: adds a binary column indicate
intersection with a cohort in a time frame.
Suppose cohort2 with definition_id = 1 contains stroke occurrences.
If we want to exclude patients from cohort1 who had a stroke event in
the last 180 days before entering the cohort, we can use the
addCohortIntersectFlag()
like this:
cohort1WashOut <- cdm$cohort1 %>%
addCohortIntersectFlag(
targetCohortTable = "cohort2",
window = list(c(-180, -1)),
targetCohortId = 1,
) %>%
filter(cohort_1_m180_to_m1 == 0)
cohort1WashOut %>%
glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 2, 1
## $ subject_id <dbl> 3, 2
## $ cohort_start_date <date> 2020-01-01, 2020-01-02
## $ cohort_end_date <date> 2020-03-01, 2020-02-02
## $ cohort_1_m180_to_m1 <dbl> 0, 0
addCohortIntersectCount()
: adds a column where it
indicates the number of intersections in a certain time window.
We can use the function to count the number of occurrences of an event of interest in different time windows before entering the study population. For example, we can look at the number of strokes in the 0-90 day, 90-365 day, and all prior history windows:
cohort1StrokeCounts <- cdm$cohort1 %>%
addCohortIntersectCount(
targetCohortTable = "cohort2",
window = list(c(-Inf, -366), c(-365, -181), c(-180, -1)),
targetCohortId = 1
)
cohort1StrokeCounts %>%
glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 1, 1, 2, 1
## $ subject_id <dbl> 1, 1, 3, 2
## $ cohort_start_date <date> 2020-01-01, 2020-06-01, 2020-01-01, 2020-01-02
## $ cohort_end_date <date> 2020-04-01, 2020-08-01, 2020-03-01, 2020-02-02
## $ cohort_1_minf_to_m366 <dbl> 0, 0, 0, 0
## $ cohort_1_m365_to_m181 <dbl> 0, 0, 0, 0
## $ cohort_1_m180_to_m1 <dbl> 1, 2, 0, 0
Let us comment on the targetEndDate
functionality found
in addCohortIntersectCount()
and
addCohortIntersectFlag()
functions. In both of them, there
are three reference dates which can be specified: *
indexDate
: date from the primary cohort table, which
contains the individuals for which we want to find the intersection
events * targetStartDate
: date from the events table used
for the intersection * targetEndDate
: date from the events
table used for the intersection
By default, indexDate = cohort_start_date
,
targetStartDate = cohort_start_date
and
targetEndDate = cohort_end_date
. This means that, if we are
intersecting two cohorts and specify window = c(-30,-1)
, we
will get any events from the intersecting cohort happening up to 30 days
previous to the cohort start date of the main cohort. Namely:
# This will be our "main" cohort
cohort1 <- dplyr::tibble(
cohort_definition_id = 1,
subject_id = c("1", "2"),
cohort_start_date = c(
as.Date("2010-03-01"),
as.Date("2012-03-01")
),
cohort_end_date = c(
as.Date("2015-01-01"),
as.Date("2016-03-01")
)
)
# This is the cohort with the events we are interested in
cohort2 <- dplyr::tibble(
cohort_definition_id = 1,
subject_id = c("1", "1", "1", "2"),
cohort_start_date = c(
as.Date("2010-03-03"),
as.Date("2010-02-27"),
as.Date("2010-03-25"),
as.Date("2013-01-03")
),
cohort_end_date = c(
as.Date("2010-03-03"),
as.Date("2010-02-27"),
as.Date("2012-03-25"),
as.Date("2013-01-03")
)
)
observation_period <- dplyr::tibble(
observation_period_id = 1:2,
person_id = c(1,2),
observation_period_start_date = as.Date(c("1990-01-01", "1995-08-16")),
observation_period_end_date = as.Date(c("2025-01-01", "2030-08-16")),
period_type_concept_id = 0
)
cdm <- mockPatientProfiles(
observation_period = observation_period,
cohort1 = cohort1,
cohort2 = cohort2
)
cdm$cohort1 <- cdm$cohort1 %>% addCohortIntersectCount(targetCohortTable = "cohort2", window = list(c(-30, -1)))
cdm$cohort1
## # Source: table<og_027_1710713493> [2 x 5]
## # Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## cohort_definition_id subject_id cohort_start_date cohort_end_date
## <dbl> <chr> <date> <date>
## 1 1 1 2010-03-01 2015-01-01
## 2 1 2 2012-03-01 2016-03-01
## # ℹ 1 more variable: cohort_1_m30_to_m1 <dbl>
We get one event for subject_id = 1
, the one starting
and ending on the 2010-02-27
, which is within the window
before the index date 2010-03-01
. The individual
subject_id = 2
does not have any of the intersecting events
of interest.
Note that, with the specifications by default, we pick one event (the second one), which is not incident in the window of interest, but overlaps it. Indeed, as the event start date is before the index date in the main cohort and the end date is after it, it is regarded as intersecting.
However, we could be interested in events starting in the window of
interest. To only screen for those, we can set
targetEndDate = cohort_start_date
.
cdm$cohort1 <- cdm$cohort1 %>% addCohortIntersectCount(targetCohortTable = "cohort2", window = list(c(-30, -1)), targetEndDate = "cohort_start_date")
cdm$cohort1
## # Source: table<og_034_1710713494> [2 x 5]
## # Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## cohort_definition_id subject_id cohort_start_date cohort_end_date
## <dbl> <chr> <date> <date>
## 1 1 1 2010-03-01 2015-01-01
## 2 1 2 2012-03-01 2016-03-01
## # ℹ 1 more variable: cohort_1_m30_to_m1 <dbl>
Now we do not pick the event which starts on 2010-01-25
,
which is more than 30 days before the index date of the main cohort,
2010-03-01
.
The input targetEndDate
allows, therefore, to select
whether to perform the intersection in an “overlapping” or “incident”
way.
As for the functions addCohortIntersectDays()
and
addCohortIntersectDate()
, they need a specific date in the
target cohort to calculate time outputs. Therefore, only
targetDate
needs to be specified, which is set to
“cohort_start_date” by default.
Example: addCohortIntersectDays function
addCohortIntersectDays()
: adds a new column that
indicates the number of days in which the subject intersects with
another cohort during a specific time frame. If there are multiple
intersections, only one will be computed, either the first or the latest
one in the time window (“order” argument).
The function can be utilized to calculate the time to the event of interest, such as time until the first stroke after index date. If the patient did not experience the event, the function will return NA.
cohort1TimeTo <- cdm$cohort1 %>%
addCohortIntersectDays(
targetCohortTable = "cohort2",
targetCohortId = 1,
order = "first"
)
cohort1TimeTo %>%
glimpse()
## Rows: ??
## Columns: 6
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 1, 1
## $ subject_id <chr> "1", "2"
## $ cohort_start_date <date> 2010-03-01, 2012-03-01
## $ cohort_end_date <date> 2015-01-01, 2016-03-01
## $ cohort_1_m30_to_m1 <dbl> 1, 0
## $ cohort_1_0_to_inf <dbl> 2, 308
Example: addCohortIntersectDate function
addCohortIntersectDate()
: appends a column containing
the start date of cohorts that are present in a certain window.
This function can be handy in obtaining the date of the next occurrence of a specific event. For instance, suppose cohort1 comprises patients who enrolled when they received their first vaccine dose. We could use this function to obtain the date of their second dose if we have a cohort with vaccine records (e.g. cohort2):
cohort1NextEvent <- cdm$cohort1 %>%
addCohortIntersectDate(
targetCohortTable = "cohort2",
order = "first",
targetCohortId = 1,
window = c(1, Inf)
)
cohort1NextEvent %>%
glimpse()
## Rows: ??
## Columns: 6
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 1, 1
## $ subject_id <chr> "1", "2"
## $ cohort_start_date <date> 2010-03-01, 2012-03-01
## $ cohort_end_date <date> 2015-01-01, 2016-03-01
## $ cohort_1_m30_to_m1 <dbl> 1, 0
## $ cohort_1_1_to_inf <date> 2010-03-03, 2013-01-03
Please note that the new columns added to the table (for all function presented) will have the format cohort_“cohort_definition_id”_ “time window”. If the window number is negative, a “m” will be added in front to indicate it and no sign will be added to positive numbers.
Example: addCohortIntersect function
addCohortIntersect()
: Compute the intersect with a
target cohort, you can compute the number of occurrences, a flag of
presence, a certain date and/or the days difference.
We can use the function to compute all the intersect information with
a target cohort. By default it will return output from
addCohortIntersectCount()
,
addCohortIntersectFlag()
,
addCohortIntersectDate()
and
addCohortIntersectDays()
in a time frame. Use this function
if you want to append all the intersection information with this
function. For information on what these function does, you can read
above example.
cohort1CohortIntersect <- cdm$cohort1 %>%
addCohortIntersect(
targetCohortTable = "cohort2",
order = "first",
targetCohortId = 1,
window = c(1, Inf)
)
cohort1CohortIntersect %>%
glimpse()
## Rows: ??
## Columns: 9
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 1, 1
## $ subject_id <chr> "1", "2"
## $ cohort_start_date <date> 2010-03-01, 2012-03-01
## $ cohort_end_date <date> 2015-01-01, 2016-03-01
## $ cohort_1_m30_to_m1 <dbl> 1, 0
## $ count_cohort_1_1_to_inf <dbl> 2, 1
## $ flag_cohort_1_1_to_inf <dbl> 1, 1
## $ date_cohort_1_1_to_inf <date> 2010-03-03, 2013-01-03
## $ days_cohort_1_1_to_inf <dbl> 2, 308
We can also control which columns to append to your data by using the flag, count, date and time options in the function, if we do not want everything. For example if we only want the cohort count and flag we can use below example.
cohort1CohortIntersect <- cdm$cohort1 %>%
addCohortIntersect(
targetCohortTable = "cohort2",
order = "first",
targetCohortId = 1,
window = c(1, Inf),
flag = TRUE,
count = TRUE,
date = FALSE,
days = FALSE
)
cohort1CohortIntersect %>%
glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v0.10.0 [unknown@Linux 6.5.0-1016-azure:R 4.3.3/:memory:]
## $ cohort_definition_id <dbl> 1, 1
## $ subject_id <chr> "1", "2"
## $ cohort_start_date <date> 2010-03-01, 2012-03-01
## $ cohort_end_date <date> 2015-01-01, 2016-03-01
## $ cohort_1_m30_to_m1 <dbl> 1, 0
## $ count_cohort_1_1_to_inf <dbl> 2, 1
## $ flag_cohort_1_1_to_inf <dbl> 1, 1