Merging HIP and PPS episodes into HIPPS • PregnancyIdentifier

The merge step combines pregnancy episodes from HIP (outcome-anchored) and PPS (timing-anchored) into a single HIPPS episode table. It is run via mergeHipps() and writes hipps_episodes.rds, which is the input to the ESD step. See the pipeline overview, HIP, PPS, and ESD vignettes.

What the merge does

The merge takes two episode tables—one from HIP and one from PPS—and produces one table with one row per pregnancy episode per person. For each person:

Overlap join: HIP and PPS episodes are matched within person by temporal overlap of their intervals. Episodes that overlap become candidate merged rows; episodes that do not overlap any other are kept as one-sided rows (HIP-only or PPS-only).
Many-to-many: One HIP episode can overlap several PPS episodes, and one PPS episode can overlap several HIP episodes. To resolve this, the merge deduplicates overlapping episodes by selecting a single best HIP–PPS pair for each episode identifier. An episode identifier is defined as a unique combination of person and algorithm source: for HIP episodes, the identifier is hip_episode_id = {person_id}_{hip_episode}_hip, and for PPS episodes, it is pps_episode_id = {person_id}_{pps_episode_number}_pps. The merged table keeps one matched pair for each unique combination of these identifiers, ensuring each episode is represented only once in the output.
Final table: Merged dates and outcome fields are standardized, and episodes are ordered within person. The result is written to hipps_episodes.rds.

No rows are dropped except duplicate overlap candidates; one-sided episodes and one-to-one matches are preserved.

Point of the merge. The merge serves three purposes for the rest of the pipeline:

Define which episodes exist — One row per episode after overlap and deduplication. The set of episodes (and their identity) is fixed here.
Put HIP and PPS on the same row — Each row carries both hip_end_date / hip_outcome_category and pps_end_date / pps_outcome_category (when present). The next step, ESD, does not do another overlap-merge of HIP and PPS; it takes this single row and picks one end date and one outcome from the two values already on the row using harmonization rules.
Supply the evidence window — merge_episode_start and merge_episode_end define the interval ESD uses to pull gestational timing concepts (GW/GR3m) from the CDM for inferring pregnancy start. The final episode end date and outcome in the pipeline are not taken from this merged interval—they are chosen by ESD from hip_end_date and pps_end_date (and the two outcomes). See the ESD vignette for how ESD chooses final start, end, and outcome.

Inputs

mergeHipps(outputFolder, logger) reads from outputFolder:

File	Source	Main columns
hip_episodes.rds	`runHip()`	`person_id`, `hip_episode`, `hip_pregnancy_start`, `hip_pregnancy_end`, `hip_first_gest_date`, `hip_outcome_category`, etc.
pps_episodes.rds	`runPps()`	`person_id`, `pps_episode_number`, `pps_episode_min_date`, `pps_episode_max_date`, `pps_episode_max_date_plus_two_months`, outcome columns, etc.

Stage 1: Standardized episode intervals and IDs

HIP and PPS use different date columns. The merge standardizes them to common interval names and adds stable identifiers.

HIP (algorithm 1)
- pregnancy_start = hip_pregnancy_start (internal); output has hip_episode, hip_outcome_category, etc.
- pregnancy_end = hip_pregnancy_end (HIP episode end)
- first_gest_date = hip_first_gest_date (first gestation date in the episode)
- hip_episode_id = {person_id}_{hip_episode}_hip

PPS (algorithm 2)
- pps_episode_min_date = start of PPS episode evidence
- pps_episode_max_date_plus_two_months = PPS episode end extended by two months (allows delayed outcome capture)
- pps_episode_id = {person_id}_{pps_episode_number}_pps

Overlap is computed on:

HIP: [pregnancy_start, pregnancy_end]
PPS: [pps_episode_min_date, pps_episode_max_date_plus_two_months]

Stage 2: Merge by overlap (full join)

Episodes are merged within person using a full join on overlap:

Overlap condition:
pregnancy_start <= pps_episode_max_date_plus_two_months and
pregnancy_end >= pps_episode_min_date

So two episodes match if their intervals intersect (including containment).

HIP:  [pregnancy_start ---------------- pregnancy_end]
PPS:              [pps_episode_min_date -------- pps_episode_max_date_plus_two_months]
                  <-------- overlap ------->

Rows that match: one merged row per (HIP episode, PPS episode) pair.
Rows that do not match: HIP-only or PPS-only rows (one side is NA) are kept via the full join.

For each row (matched or one-sided), the merge computes:

merged_episode_start = min(first_gest_date, pps_episode_min_date, pregnancy_start)
merged_episode_end = max(pps_episode_max_date, pregnancy_end)
merged_episode_length = (merged_episode_end − merged_episode_start) in months (days / 30.25)

Duplicate flags (used in Stage 3):

duplicated_hip_episode_id = 1 if that hip_episode_id appears in more than one row (HIP episode matched to multiple PPS).
duplicated_pps_episode_id = 1 if that pps_episode_id appears in more than one row (PPS episode matched to multiple HIP).

Stage 3: Deduplication loop

When many HIP and PPS episodes overlap, the overlap join can produce many-to-many matches: one HIP episode paired with several PPS episodes, and/or one PPS episode paired with several HIP episodes. The deduplication step keeps one best match per HIP episode and per PPS episode (and keeps one-to-one and one-sided rows unchanged).

Split: non-duplicates vs duplicates

baseKeep: Rows where neither episode is duplicated:
(duplicated_hip_episode_id == 0 & duplicated_pps_episode_id == 0) or one-sided (one of the dup flags is NA).
These are left unchanged (one-to-one matches and HIP-only or PPS-only episodes).
dupDf: Rows where at least one side is duplicated and the row is an overlap (both HIP and PPS present):
(duplicated_hip_episode_id == 1 & pps_episode_id present) or (duplicated_pps_episode_id == 1 & hip_episode_id present).
Only these rows go through the “pick best” logic.

Scoring overlap rows

For each candidate row in dupDf, the algorithm computes:

date_diff = |pregnancy_end − pps_episode_max_date| (days).
Smaller values mean HIP and PPS agree on episode end; the best match minimizes this.
Missing PPS outcome (when choosing by HIP)
When selecting the best PPS match for a duplicated HIP episode, rows where pps_outcome_category is missing are penalized: date_diff is set to a large value (10000) so they are chosen only if no other match exists.
pps_days = |pps_episode_max_date − pps_episode_min_date| (PPS episode length in days).
Used for tie-breaking: PPS episodes longer than 310 days are treated as implausible and get pps_days = -1, so they lose ties. Among plausible episodes, longer PPS duration is preferred (larger pps_days wins).

Selection rule (“pick best”)

For a set of rows that share the same duplicated identifier (e.g. same hip_episode_id):

Keep rows with smallest date_diff.
Among ties, keep rows with largest pps_days (so plausible, longer PPS episodes win).

This is implemented as: slice_min(date_diff, n = 1) then slice_max(pps_days, n = 1). The first round uses with_ties = TRUE (keep all ties); later rounds use with_ties = FALSE for a deterministic single winner.

Iterative loop (why multiple rounds?)

After picking the best match by HIP (hip_episode_id), some pps_episode_id values can still appear in more than one row. After picking the best by PPS (pps_episode_id), some hip_episode_id values can again appear more than once. So one round is not enough.

The algorithm:

Initial pass: From dupDf, pick best by hip_episode_id (with missing-outcome penalty) and best by pps_episode_id (no penalty), combine them, then recompute duplicated_hip_episode_id and duplicated_pps_episode_id on this set.
Up to 10 rounds: Again pick best by hip_episode_id among rows still with duplicated_hip_episode_id == 1, and best by pps_episode_id among rows still with duplicated_pps_episode_id == 1; recombine and recompute dup flags. After each round, the rows that are no longer duplicated (dup flags 0 or NA) are retained for the final table.
Final set: baseKeep (unchanged) plus all rows that are non-duplicated after the last round, then distinct().

So the loop repeatedly reduces duplicate HIP and PPS ids until no overlap row is duplicated on either side, then merges those resolved rows with the non-duplicate rows.

Deduplication example

Setup: One HIP episode overlaps two PPS episodes.

Person 100, HIP episode 1:  [2020-01-01 -------- 2020-09-15]  (pregnancy_end = 2020-09-15)
Person 100, PPS episode 1:     [2020-02-01 --- 2020-09-10]   (pps_episode_max_date = 2020-09-10)
Person 100, PPS episode 2:              [2020-04-01 --- 2020-09-20]

After the overlap join there are 2 rows (HIP 1–PPS 1 and HIP 1–PPS 2). Both have duplicated_hip_episode_id == 1 (same HIP in two rows).

Scoring:

Row	pregnancy_end	pps_episode_max_date	date_diff	pps_days (e.g.)
HIP1–PPS1	2020-09-15	2020-09-10	5	222
HIP1–PPS2	2020-09-15	2020-09-20	5	172

Pick best by hip_episode_id: Same date_diff; tie-break by pps_days. Row HIP1–PPS1 wins (222 > 172). So the merged table keeps one row for HIP episode 1: the pair (HIP1, PPS1).

Another case: One PPS episode overlaps two HIP episodes.

Person 200, HIP episode 1: [2020-03-01 -- 2020-10-01]
Person 200, HIP episode 2:        [2020-05-01 -- 2020-11-15]
Person 200, PPS episode 1:    [2020-04-01 -------- 2020-10-20]

PPS 1 overlaps both HIP 1 and HIP 2. We have 2 rows; both have duplicated_pps_episode_id == 1. We pick the row with smaller |pregnancy_end − pps_episode_max_date| (and then by pps_days if tied). That chooses the single best HIP for this PPS episode.

Post-deduplication: standardized columns

After deduplication, addMergedEpisodeDetails():

Fills PPS outcome when missing: if a row has pps_episode_id but pps_outcome_category is NA, set pps_outcome_category = "PREG" and set pps_outcome_date = pps_episode_max_date.
Renames to the final names used in the output: pregnancy_end → hip_end_date, pps_outcome_date → pps_end_date, merged_episode_start → merge_episode_start, merged_episode_end → merge_episode_end, merged_episode_length → merge_episode_length, pregnancy_start → merge_pregnancy_start, first_gest_date → merge_first_gest_date. HIP supplies hip_outcome_category; PPS outcome column remains pps_outcome_category.
Adds hip_flag (1 if the row has a HIP episode, 0 otherwise) and pps_flag (1 if it has a PPS episode, 0 otherwise).
Recomputes merged episode dates from the retained rows, then orders episodes within person by merge_episode_start and assigns merge_episode_number (1, 2, 3, … per person).

Output file: hipps_episodes.rds

mergeHipps() writes a single file to outputFolder:

File	Description
hipps_episodes.rds	One row per merged pregnancy episode per person; input to `runEsd()`.

If there are no episodes, the same path is written with an empty tibble that has the correct column schema (see emptyHippsEpisodes()).

Output columns

The saved data frame includes the following. Columns listed as “always present” are in every run; “present when applicable” may be NA or omitted depending on whether the row came from HIP, PPS, or both.

Identifiers and ordering

Column	Type	Description
`person_id`	integer	Person identifier.
`merge_episode_number`	integer	Within-person episode index (1, 2, 3, …) by `merge_episode_start`.

Merged episode interval (always present)

Column	Type	Description
`merge_episode_start`	Date	Start of the merged episode: min of `merge_first_gest_date`, `pps_episode_min_date`, `merge_pregnancy_start`.
`merge_episode_end`	Date	End of the merged episode: max of `pps_episode_max_date`, `hip_end_date`.
`merge_episode_length`	numeric	Length in months (days / 30.25).

Algorithm-specific end dates (always present)

Column	Type	Description
`hip_end_date`	Date	HIP episode end (`hip_pregnancy_end`). NA if row is PPS-only.
`pps_end_date`	Date	PPS outcome/end date (inferred or `pps_episode_max_date`). NA if row is HIP-only.

Outcome categories (always present)

Column	Type	Description
`hip_outcome_category`	character	Outcome category from HIP (e.g. LB, SB, PREG). NA if PPS-only.
`pps_outcome_category`	character	Outcome category from PPS; set to `"PREG"` when PPS has no inferred outcome. NA if HIP-only.

Algorithm presence flags (always present)

Column	Type	Description
`hip_flag`	integer	1 if this row has a HIP episode, 0 otherwise.
`pps_flag`	integer	1 if this row has a PPS episode, 0 otherwise.

Optional / when applicable

These columns may be included when present in the merged data (e.g. for ESD or debugging). They can be NA for one-sided episodes.

Column	Type	Description
`merge_pregnancy_start`	Date	HIP `hip_pregnancy_start` (standardized name).
`merge_first_gest_date`	Date	First gestation date in the HIP episode (from `hip_first_gest_date`).
`pps_episode_min_date`	Date	PPS episode start.
`pps_episode_max_date`	Date	PPS episode end (before 2-month extension).
`pps_episode_max_date_plus_two_months`	Date	PPS end + 2 months (used for overlap).
`hip_episode`	integer	HIP episode index.
`pps_episode_number`	integer	PPS episode index.
`hip_episode_id`	character	HIP episode id: `{person_id}_{hip_episode}_hip`.
`pps_episode_id`	character	PPS episode id: `{person_id}_{pps_episode_number}_pps`.

Downstream, runEsd() expects at least: person_id, merge_episode_number, merge_pregnancy_start, merge_episode_start, merge_episode_end. See the ESD vignette and the pipeline overview.