Skip to contents

The merge step combines pregnancy episodes from HIP (outcome-anchored) and PPS (timing-anchored) into a single HIPPS episode table. It is run via mergeHipps() and writes hipps_episodes.rds, which is the input to the ESD step. See the pipeline overview, HIP, PPS, and ESD vignettes.

What the merge does

The merge takes two episode tables—one from HIP and one from PPS—and produces one table with one row per pregnancy episode per person. For each person:

  1. Overlap join: HIP and PPS episodes are matched within person by temporal overlap of their intervals. Episodes that overlap become candidate merged rows; episodes that do not overlap any other are kept as one-sided rows (HIP-only or PPS-only).
  2. Many-to-many: One HIP episode can overlap several PPS episodes, and one PPS episode can overlap several HIP episodes. To resolve this, the merge deduplicates overlapping episodes by selecting a single best HIP–PPS pair for each episode identifier. An episode identifier is defined as a unique combination of person and algorithm source: for HIP episodes, the identifier is hip_episode_id = {person_id}_{hip_episode}_hip, and for PPS episodes, it is pps_episode_id = {person_id}_{pps_episode_number}_pps. The merged table keeps one matched pair for each unique combination of these identifiers, ensuring each episode is represented only once in the output.
  3. Final table: Merged dates and outcome fields are standardized, and episodes are ordered within person. The result is written to hipps_episodes.rds.

No rows are dropped except duplicate overlap candidates; one-sided episodes and one-to-one matches are preserved.

Point of the merge. The merge serves three purposes for the rest of the pipeline:

  1. Define which episodes exist — One row per episode after overlap and deduplication. The set of episodes (and their identity) is fixed here.
  2. Put HIP and PPS on the same row — Each row carries both hip_end_date / hip_outcome_category and pps_end_date / pps_outcome_category (when present). The next step, ESD, does not do another overlap-merge of HIP and PPS; it takes this single row and picks one end date and one outcome from the two values already on the row using harmonization rules.
  3. Supply the evidence windowmerge_episode_start and merge_episode_end define the interval ESD uses to pull gestational timing concepts (GW/GR3m) from the CDM for inferring pregnancy start. The final episode end date and outcome in the pipeline are not taken from this merged interval—they are chosen by ESD from hip_end_date and pps_end_date (and the two outcomes). See the ESD vignette for how ESD chooses final start, end, and outcome.

Inputs

mergeHipps(outputFolder, logger) reads from outputFolder:

File Source Main columns
hip_episodes.rds runHip() person_id, hip_episode, hip_pregnancy_start, hip_pregnancy_end, hip_first_gest_date, hip_outcome_category, etc.
pps_episodes.rds runPps() person_id, pps_episode_number, pps_episode_min_date, pps_episode_max_date, pps_episode_max_date_plus_two_months, outcome columns, etc.

Stage 1: Standardized episode intervals and IDs

HIP and PPS use different date columns. The merge standardizes them to common interval names and adds stable identifiers.

HIP (algorithm 1)
- pregnancy_start = hip_pregnancy_start (internal); output has hip_episode, hip_outcome_category, etc.
- pregnancy_end = hip_pregnancy_end (HIP episode end)
- first_gest_date = hip_first_gest_date (first gestation date in the episode)
- hip_episode_id = {person_id}_{hip_episode}_hip

PPS (algorithm 2)
- pps_episode_min_date = start of PPS episode evidence
- pps_episode_max_date_plus_two_months = PPS episode end extended by two months (allows delayed outcome capture)
- pps_episode_id = {person_id}_{pps_episode_number}_pps

Overlap is computed on:

  • HIP: [pregnancy_start, pregnancy_end]
  • PPS: [pps_episode_min_date, pps_episode_max_date_plus_two_months]

Stage 2: Merge by overlap (full join)

Episodes are merged within person using a full join on overlap:

  • Overlap condition:
    pregnancy_start <= pps_episode_max_date_plus_two_months and
    pregnancy_end >= pps_episode_min_date

So two episodes match if their intervals intersect (including containment).

HIP:  [pregnancy_start ---------------- pregnancy_end]
PPS:              [pps_episode_min_date -------- pps_episode_max_date_plus_two_months]
                  <-------- overlap ------->
  • Rows that match: one merged row per (HIP episode, PPS episode) pair.
  • Rows that do not match: HIP-only or PPS-only rows (one side is NA) are kept via the full join.

For each row (matched or one-sided), the merge computes:

  • merged_episode_start = min(first_gest_date, pps_episode_min_date, pregnancy_start)
  • merged_episode_end = max(pps_episode_max_date, pregnancy_end)
  • merged_episode_length = (merged_episode_endmerged_episode_start) in months (days / 30.25)

Duplicate flags (used in Stage 3):

  • duplicated_hip_episode_id = 1 if that hip_episode_id appears in more than one row (HIP episode matched to multiple PPS).
  • duplicated_pps_episode_id = 1 if that pps_episode_id appears in more than one row (PPS episode matched to multiple HIP).

Stage 3: Deduplication loop

When many HIP and PPS episodes overlap, the overlap join can produce many-to-many matches: one HIP episode paired with several PPS episodes, and/or one PPS episode paired with several HIP episodes. The deduplication step keeps one best match per HIP episode and per PPS episode (and keeps one-to-one and one-sided rows unchanged).

Split: non-duplicates vs duplicates

  • baseKeep: Rows where neither episode is duplicated:
    (duplicated_hip_episode_id == 0 & duplicated_pps_episode_id == 0) or one-sided (one of the dup flags is NA).
    These are left unchanged (one-to-one matches and HIP-only or PPS-only episodes).

  • dupDf: Rows where at least one side is duplicated and the row is an overlap (both HIP and PPS present):
    (duplicated_hip_episode_id == 1 & pps_episode_id present) or (duplicated_pps_episode_id == 1 & hip_episode_id present).
    Only these rows go through the “pick best” logic.

Scoring overlap rows

For each candidate row in dupDf, the algorithm computes:

  1. date_diff = |pregnancy_endpps_episode_max_date| (days).
    Smaller values mean HIP and PPS agree on episode end; the best match minimizes this.

  2. Missing PPS outcome (when choosing by HIP)
    When selecting the best PPS match for a duplicated HIP episode, rows where pps_outcome_category is missing are penalized: date_diff is set to a large value (10000) so they are chosen only if no other match exists.

  3. pps_days = |pps_episode_max_datepps_episode_min_date| (PPS episode length in days).
    Used for tie-breaking: PPS episodes longer than 310 days are treated as implausible and get pps_days = -1, so they lose ties. Among plausible episodes, longer PPS duration is preferred (larger pps_days wins).

Selection rule (“pick best”)

For a set of rows that share the same duplicated identifier (e.g. same hip_episode_id):

  1. Keep rows with smallest date_diff.
  2. Among ties, keep rows with largest pps_days (so plausible, longer PPS episodes win).

This is implemented as: slice_min(date_diff, n = 1) then slice_max(pps_days, n = 1). The first round uses with_ties = TRUE (keep all ties); later rounds use with_ties = FALSE for a deterministic single winner.

Iterative loop (why multiple rounds?)

After picking the best match by HIP (hip_episode_id), some pps_episode_id values can still appear in more than one row. After picking the best by PPS (pps_episode_id), some hip_episode_id values can again appear more than once. So one round is not enough.

The algorithm:

  1. Initial pass: From dupDf, pick best by hip_episode_id (with missing-outcome penalty) and best by pps_episode_id (no penalty), combine them, then recompute duplicated_hip_episode_id and duplicated_pps_episode_id on this set.
  2. Up to 10 rounds: Again pick best by hip_episode_id among rows still with duplicated_hip_episode_id == 1, and best by pps_episode_id among rows still with duplicated_pps_episode_id == 1; recombine and recompute dup flags. After each round, the rows that are no longer duplicated (dup flags 0 or NA) are retained for the final table.
  3. Final set: baseKeep (unchanged) plus all rows that are non-duplicated after the last round, then distinct().

So the loop repeatedly reduces duplicate HIP and PPS ids until no overlap row is duplicated on either side, then merges those resolved rows with the non-duplicate rows.

Deduplication example

Setup: One HIP episode overlaps two PPS episodes.

Person 100, HIP episode 1:  [2020-01-01 -------- 2020-09-15]  (pregnancy_end = 2020-09-15)
Person 100, PPS episode 1:     [2020-02-01 --- 2020-09-10]   (pps_episode_max_date = 2020-09-10)
Person 100, PPS episode 2:              [2020-04-01 --- 2020-09-20]

After the overlap join there are 2 rows (HIP 1–PPS 1 and HIP 1–PPS 2). Both have duplicated_hip_episode_id == 1 (same HIP in two rows).

Scoring:

Row pregnancy_end pps_episode_max_date date_diff pps_days (e.g.)
HIP1–PPS1 2020-09-15 2020-09-10 5 222
HIP1–PPS2 2020-09-15 2020-09-20 5 172

Pick best by hip_episode_id: Same date_diff; tie-break by pps_days. Row HIP1–PPS1 wins (222 > 172). So the merged table keeps one row for HIP episode 1: the pair (HIP1, PPS1).

Another case: One PPS episode overlaps two HIP episodes.

Person 200, HIP episode 1: [2020-03-01 -- 2020-10-01]
Person 200, HIP episode 2:        [2020-05-01 -- 2020-11-15]
Person 200, PPS episode 1:    [2020-04-01 -------- 2020-10-20]

PPS 1 overlaps both HIP 1 and HIP 2. We have 2 rows; both have duplicated_pps_episode_id == 1. We pick the row with smaller |pregnancy_endpps_episode_max_date| (and then by pps_days if tied). That chooses the single best HIP for this PPS episode.


Post-deduplication: standardized columns

After deduplication, addMergedEpisodeDetails():

  • Fills PPS outcome when missing: if a row has pps_episode_id but pps_outcome_category is NA, set pps_outcome_category = "PREG" and set pps_outcome_date = pps_episode_max_date.
  • Renames to the final names used in the output: pregnancy_endhip_end_date, pps_outcome_datepps_end_date, merged_episode_startmerge_episode_start, merged_episode_endmerge_episode_end, merged_episode_lengthmerge_episode_length, pregnancy_startmerge_pregnancy_start, first_gest_datemerge_first_gest_date. HIP supplies hip_outcome_category; PPS outcome column remains pps_outcome_category.
  • Adds hip_flag (1 if the row has a HIP episode, 0 otherwise) and pps_flag (1 if it has a PPS episode, 0 otherwise).
  • Recomputes merged episode dates from the retained rows, then orders episodes within person by merge_episode_start and assigns merge_episode_number (1, 2, 3, … per person).

Output file: hipps_episodes.rds

mergeHipps() writes a single file to outputFolder:

File Description
hipps_episodes.rds One row per merged pregnancy episode per person; input to runEsd().

If there are no episodes, the same path is written with an empty tibble that has the correct column schema (see emptyHippsEpisodes()).

Output columns

The saved data frame includes the following. Columns listed as “always present” are in every run; “present when applicable” may be NA or omitted depending on whether the row came from HIP, PPS, or both.

Identifiers and ordering

Column Type Description
person_id integer Person identifier.
merge_episode_number integer Within-person episode index (1, 2, 3, …) by merge_episode_start.

Merged episode interval (always present)

Column Type Description
merge_episode_start Date Start of the merged episode: min of merge_first_gest_date, pps_episode_min_date, merge_pregnancy_start.
merge_episode_end Date End of the merged episode: max of pps_episode_max_date, hip_end_date.
merge_episode_length numeric Length in months (days / 30.25).

Algorithm-specific end dates (always present)

Column Type Description
hip_end_date Date HIP episode end (hip_pregnancy_end). NA if row is PPS-only.
pps_end_date Date PPS outcome/end date (inferred or pps_episode_max_date). NA if row is HIP-only.

Outcome categories (always present)

Column Type Description
hip_outcome_category character Outcome category from HIP (e.g. LB, SB, PREG). NA if PPS-only.
pps_outcome_category character Outcome category from PPS; set to "PREG" when PPS has no inferred outcome. NA if HIP-only.

Algorithm presence flags (always present)

Column Type Description
hip_flag integer 1 if this row has a HIP episode, 0 otherwise.
pps_flag integer 1 if this row has a PPS episode, 0 otherwise.

Optional / when applicable

These columns may be included when present in the merged data (e.g. for ESD or debugging). They can be NA for one-sided episodes.

Column Type Description
merge_pregnancy_start Date HIP hip_pregnancy_start (standardized name).
merge_first_gest_date Date First gestation date in the HIP episode (from hip_first_gest_date).
pps_episode_min_date Date PPS episode start.
pps_episode_max_date Date PPS episode end (before 2-month extension).
pps_episode_max_date_plus_two_months Date PPS end + 2 months (used for overlap).
hip_episode integer HIP episode index.
pps_episode_number integer PPS episode index.
hip_episode_id character HIP episode id: {person_id}_{hip_episode}_hip.
pps_episode_id character PPS episode id: {person_id}_{pps_episode_number}_pps.

Downstream, runEsd() expects at least: person_id, merge_episode_number, merge_pregnancy_start, merge_episode_start, merge_episode_end. See the ESD vignette and the pipeline overview.