PregnancyIdentifier pipeline overview • PregnancyIdentifier

This vignette describes the PregnancyIdentifier pipeline, which implements the HIPPS approach to identify pregnancy episodes in OMOP CDM data. The pipeline has four main algorithm components, run in sequence:

HIP — Outcome-anchored episode builder: identifies pregnancy episodes from high-specificity pregnancy outcomes and gestational-age evidence. See the HIP vignette.
PPS — Gestational-timing concept episode builder: constructs episodes from pregnancy-related concepts and expected timing windows. See the PPS vignette.
Merge — Combines HIP and PPS episodes into a unified per-person timeline (HIPPS) by temporal overlap and best-match deduplication. See the Merge vignette.
ESD — Episode start date refinement: uses gestational timing evidence (GW and GR3m) to infer pregnancy start dates and precision. See the ESD vignette.

Running the full pipeline

The primary entry point is runPregnancyIdentifier(), which runs all steps end-to-end: it writes person-level and episode-level data (RDS, logs) to outputFolder, and shareable aggregated CSV files to exportFolder (default file.path(outputFolder, "export"); these CSVs can be used as input to the Shiny app).

Main parameters: - outputFolder — Directory for pipeline outputs (RDS artifacts, logs, runStart.csv). Created if missing. - exportFolder — Directory for shareable CSVs. Defaults to file.path(outputFolder, "export"). Optional; omit to use the default. - startDate, endDate — Study window (defaults: 1900-01-01 to today). - minCellCount — Small-cell suppression in exported summaries (default 5). - conformToValidation — FALSE (default), TRUE, or "both"; see step 5 and the Export vignette.

library(PregnancyIdentifier)
library(CDMConnector)

# A cdm_reference from CDMConnector (e.g. mock or real CDM)
cdm <- mockPregnancyCdm()

runPregnancyIdentifier(
  cdm                   = cdm,
  outputFolder             = "pregnancy_identifier_output",
  # exportFolder         = NULL,   # default: outputFolder/export
  startDate             = as.Date("2000-01-01"),
  endDate               = Sys.Date(),
  justGestation         = TRUE,
  minCellCount          = 5L,
  conformToValidation   = FALSE
)

Outputs written to disk

The pipeline writes intermediate RDS artifacts to outputFolder and always runs export: shareable CSVs are written to exportFolder (default outputFolder/export).

File	Description
`runStart.csv`	Run timestamp used in exported csv files
`log.txt`	Run log
`hip_episodes.rds`	HIP episodes (outcome and/or gestation-derived)
`pps_episodes.rds`	PPS episodes with inferred outcomes (input to merge)
`pps_concept_counts.csv`	PPS concept record counts per concept (used by export).
`pps_gest_timing_episodes.rds`	(Only when `debugMode = TRUE`) Record-level PPS concept rows with episode assignment.
`pps_min_max_episodes.rds`	(Only when `debugMode = TRUE`) Episode-level PPS min/max date summaries.
`hipps_episodes.rds`	Merged HIP + PPS episode table (standardized columns)
`final_pregnancy_episodes.rds`	Final episode table: one row per episode, with inferred start/end and outcomes
`esd.rds`	(Only when `debugMode = TRUE`) ESD episode-level start inference before merging back to HIPPS.
`export/`	De-identified summary CSVs (default: `outputFolder/export`). See the Export vignette for file names, columns, and analysis use.
`export/{date}_{cdm_name}_incidence.csv`	Incidence estimates (overall and yearly), written by `computeIncidencePrevalence()`.
`export/{date}_{cdm_name}_prevalence.csv`	Period prevalence estimates (overall and yearly), written by `computeIncidencePrevalence()`.
`export/{date}_{cdm_name}_characteristics.csv`	Cohort characteristics stratified by age group and outcome, written by `computeIncidencePrevalence()`.

Pipeline steps in order

initPregnancies — Loads pregnancy concept sets into the CDM, builds preg_hip_records and preg_pps_records from condition, procedure, observation, and measurement tables within the study window.
runHip — Builds outcome episodes, then gestation episodes, merges and cleans them, and writes hip_episodes.rds.
runPps — Builds PPS episodes from pregnancy-related concepts, assigns outcomes, and writes pps_episodes.rds.
mergeHipps — Reads hip_episodes.rds and pps_episodes.rds, merges by temporal overlap, deduplicates many-to-many matches, and writes hipps_episodes.rds.
runEsd — Reads hipps_episodes.rds, pulls gestational timing concepts (GW/GR3m), infers start dates and precision, harmonizes end and outcome from HIP vs PPS, and writes final_pregnancy_episodes.rds. When conformToValidation = TRUE, episode output is modified to remove overlapping episodes and episodes longer than 308 days.
computeIncidencePrevalence — Constructs outcome-specific cohort tables (HIP-only, PPS-only, and combined HIPPS windows), generates denominator cohorts stratified by age group, and estimates incidence, period prevalence, and cohort characteristics. Writes {date}_{cdm_name}_incidence.csv, {date}_{cdm_name}_prevalence.csv, and {date}_{cdm_name}_characteristics.csv to exportFolder.
exportPregnancies — Reads final_pregnancy_episodes.rds and writes shareable CSVs to exportFolder (default file.path(outputFolder, "export")). See the Export vignette.

What each step contributes to final start, end, and outcome

The pipeline’s most important analysis variables are final_episode_start_date, final_episode_end_date, and final_outcome_category. Here is what each step contributes to them.

Variable	Step	Contribution
Final start	initPregnancies	Supplies raw pregnancy-related concepts used later by HIP, PPS, and ESD.
	HIP	Produces estimated start (e.g. from outcome + term range); used in merge interval and as context for ESD.
	PPS	Produces episode bounds (min/max dates) from gestational timing; used in merge interval.
	Merge	Produces merge_episode_start / merge_episode_end, which define the evidence window ESD uses to pull GW/GR3m concepts. Does not set the final start.
	ESD	Sets final start. Infers final_episode_start_date from GW/GR3m concepts within the merged window; if no start is inferred, uses final end − max_term (Matcho term for the chosen outcome). See ESD vignette.
Final end	HIP	Produces hip_end_date (outcome or gestation end).
	PPS	Produces pps_end_date (outcome or episode max date).
	Merge	Puts HIP and PPS on one row per episode; produces merge_episode_end (used only as evidence window by ESD). Does not set the final end.
	ESD	Sets final end. Chooses final_episode_end_date by picking between hip_end_date and pps_end_date using harmonization rules (e.g. outcome match, or “take the one that occurs second”). Does not use merge_episode_end for the reported end. See ESD vignette.
Final outcome	HIP	Produces hip_outcome_category.
	PPS	Produces pps_outcome_category.
	Merge	Keeps both hip_outcome_category and pps_outcome_category on each row. Does not choose a single outcome.
	ESD	Sets final outcome. Chooses final_outcome_category by picking between hip_outcome_category and pps_outcome_category using the same rules as for the end date. See ESD vignette.

So: final start comes from ESD (GW/GR3m inference or end − max_term). Final end and final outcome come from ESD’s choice between HIP and PPS; the merge only puts both values on the same row so ESD can pick one.

Interpreting the final episode table

The primary analysis dataset is final_pregnancy_episodes.rds, with one row per pregnancy episode. Key fields include:

final_episode_start_date — Best estimate of pregnancy start for this episode. Set by ESD: when ESD has gestational timing (GW and/or GR3m) within the merged evidence window, the start is inferred from that (e.g. concept_date − 7×weeks for GW, or GR3m midpoint). When ESD has no inferred start, it is set to final_episode_end_date − max_term (Matcho term for the chosen outcome, e.g. 308 days for live birth).
final_episode_end_date — Best estimate of pregnancy end for this episode. Not taken from the merged interval (merge_episode_end). Set by ESD by picking one of hip_end_date or pps_end_date: when HIP and PPS agree (same outcome and end within 14 days), ESD uses hip_end_date; when one side is missing, ESD uses the side present; when they disagree, ESD uses rules (e.g. take the one that occurs second, or HIP when timing is similar). See the ESD vignette for the full logic.
final_outcome_category — Outcome category for this episode (e.g. LB, SB, PREG). Set by ESD by picking one of hip_outcome_category or pps_outcome_category using the same priority rules as for the end date.

Together, final_episode_start_date, final_episode_end_date, and final_outcome_category are the primary analysis variables: the pipeline’s final inferred pregnancy interval and outcome.

merge_episode_start — Start of the evidence window for this episode from the merge step. It is the minimum of: HIP’s first gestation date in the episode, PPS episode min date, and HIP estimated start. So it is the earliest date implied by either algorithm for this merged episode.
merge_episode_end — End of the evidence window for this episode from the merge step. It is the maximum of: PPS episode max date and HIP pregnancy end (visit/outcome date). So it is the latest date implied by either algorithm for this merged episode.

merge_episode_start and merge_episode_end describe the span of raw evidence (HIP + PPS) that was merged for this episode, before ESD refines start/end into the inferred interval.

hip_end_date — Episode end date from the HIP algorithm: either the outcome date (e.g. delivery) or the latest gestation date when there is no outcome. NA when the episode was found only by PPS (no overlapping HIP episode).
pps_end_date — Episode end from the PPS algorithm: the inferred outcome date when PPS assigned an outcome (LB, SB, etc.), otherwise the PPS episode max date. NA when the episode was found only by HIP (no overlapping PPS episode).

hip_end_date and pps_end_date are the end dates produced by each algorithm; at least one is non-NA when both algorithms contributed to the episode, and one is NA for one-sided (HIP-only or PPS-only) episodes.

An example case

library(PregnancyIdentifier)
library(CDMConnector)
library(dplyr, warn.conflicts = FALSE)

cdm <- mockPregnancyCdm()
#> 
#> Download completed!

cdm <- cdmSubset(cdm, personId = 24)

cdm %>% 
  cdmFlatten() %>% 
  select(-"type_concept_id", -"domain") %>% 
  arrange(start_date)
#> # Source:     SQL [?? x 6]
#> # Database:   DuckDB 1.5.2 [unknown@Linux 6.17.0-1010-azure:R 4.5.3//tmp/Rtmpzetqp8/file235a773a72a1.duckdb]
#> # Ordered by: start_date
#>   person_id observation_concept_id start_date end_date observation_concept_name 
#>       <int>                  <int> <date>     <date>   <chr>                    
#> 1        24                4132434 2023-03-15 NA       Gestation period, 8 weeks
#> 2        24                 437611 2023-03-15 NA       Ectopic pregnancy        
#> 3        24                4094910 2023-01-28 NA       Pregnancy test positive  
#> # ℹ 1 more variable: type_concept_name <chr>

td <- tempdir()
if (!dir.exists(td)) dir.create(td, recursive = TRUE, showWarnings = FALSE)
outputFolder <- file.path(td, "pregnancy_output")
dir.create(outputFolder, recursive = TRUE, showWarnings = FALSE)

invisible(capture.output(
  runPregnancyIdentifier(cdm, outputFolder = outputFolder)  # export runs by default to outputFolder/export
))

list.files(outputFolder)
#>  [1] "attrition.csv"                "esd_concept_counts.csv"      
#>  [3] "export"                       "final_pregnancy_episodes.rds"
#>  [5] "hip_concept_counts.csv"       "hip_episodes.rds"            
#>  [7] "hipps_episodes.rds"           "log.txt"                     
#>  [9] "pps_concept_counts.csv"       "pps_episodes.rds"            
#> [11] "runStart.csv"
readRDS(file.path(outputFolder, "final_pregnancy_episodes.rds"))
#> # A tibble: 1 × 45
#>   person_id merge_episode_number final_episode_start_date final_episode_end_date
#>       <int>                <int> <date>                   <date>                
#> 1        24                    1 2023-01-18               2023-03-15            
#> # ℹ 41 more variables: final_outcome_category <chr>,
#> #   merge_episode_start <date>, merge_episode_end <date>, hip_end_date <date>,
#> #   pps_end_date <date>, hip_outcome_category <chr>,
#> #   pps_outcome_category <chr>, esd_precision_days <dbl>,
#> #   esd_precision_category <chr>, esd_gestational_age_days_calculated <int>,
#> #   esd_gw_flag <dbl>, esd_gr3m_flag <dbl>, esd_outcome_match <dbl>,
#> #   esd_term_duration_flag <dbl>, esd_outcome_concordance_score <dbl>, …