The merge step combines pregnancy episodes from
HIP (outcome-anchored) and PPS
(timing-anchored) into a single HIPPS episode table. It is run via
mergeHipps() and writes hipps_episodes.rds,
which is the input to the ESD step. See the pipeline overview, HIP,
PPS, and ESD
vignettes.
What the merge does
The merge takes two episode tables—one from HIP and one from PPS—and produces one table with one row per pregnancy episode per person. For each person:
- Overlap join: HIP and PPS episodes are matched within person by temporal overlap of their intervals. Episodes that overlap become candidate merged rows; episodes that do not overlap any other are kept as one-sided rows (HIP-only or PPS-only).
-
Many-to-many: One HIP episode can overlap several
PPS episodes, and one PPS episode can overlap several HIP episodes. To
resolve this, the merge deduplicates overlapping
episodes by selecting a single best HIP–PPS pair for each episode
identifier. An episode identifier is defined as a unique combination of
person and algorithm source: for HIP episodes, the identifier is
hip_episode_id = {person_id}_{hip_episode}_hip, and for PPS episodes, it ispps_episode_id = {person_id}_{pps_episode_number}_pps. The merged table keeps one matched pair for each unique combination of these identifiers, ensuring each episode is represented only once in the output. -
Final table: Merged dates and outcome fields are
standardized, and episodes are ordered within person. The result is
written to
hipps_episodes.rds.
No rows are dropped except duplicate overlap candidates; one-sided episodes and one-to-one matches are preserved.
Point of the merge. The merge serves three purposes for the rest of the pipeline:
- Define which episodes exist — One row per episode after overlap and deduplication. The set of episodes (and their identity) is fixed here.
- Put HIP and PPS on the same row — Each row carries both hip_end_date / hip_outcome_category and pps_end_date / pps_outcome_category (when present). The next step, ESD, does not do another overlap-merge of HIP and PPS; it takes this single row and picks one end date and one outcome from the two values already on the row using harmonization rules.
- Supply the evidence window — merge_episode_start and merge_episode_end define the interval ESD uses to pull gestational timing concepts (GW/GR3m) from the CDM for inferring pregnancy start. The final episode end date and outcome in the pipeline are not taken from this merged interval—they are chosen by ESD from hip_end_date and pps_end_date (and the two outcomes). See the ESD vignette for how ESD chooses final start, end, and outcome.
Inputs
mergeHipps(outputFolder, logger) reads from
outputFolder:
| File | Source | Main columns |
|---|---|---|
| hip_episodes.rds | runHip() |
person_id, hip_episode,
hip_pregnancy_start, hip_pregnancy_end,
hip_first_gest_date, hip_outcome_category,
etc. |
| pps_episodes.rds | runPps() |
person_id, pps_episode_number,
pps_episode_min_date, pps_episode_max_date,
pps_episode_max_date_plus_two_months, outcome columns,
etc. |
Stage 1: Standardized episode intervals and IDs
HIP and PPS use different date columns. The merge standardizes them to common interval names and adds stable identifiers.
HIP (algorithm 1)
- pregnancy_start = hip_pregnancy_start
(internal); output has hip_episode,
hip_outcome_category, etc.
- pregnancy_end = hip_pregnancy_end (HIP
episode end)
- first_gest_date = hip_first_gest_date (first
gestation date in the episode)
- hip_episode_id =
{person_id}_{hip_episode}_hip
PPS (algorithm 2)
- pps_episode_min_date = start of PPS episode
evidence
- pps_episode_max_date_plus_two_months = PPS episode end
extended by two months (allows delayed outcome capture)
- pps_episode_id =
{person_id}_{pps_episode_number}_pps
Overlap is computed on:
- HIP:
[pregnancy_start, pregnancy_end] - PPS:
[pps_episode_min_date, pps_episode_max_date_plus_two_months]
Stage 2: Merge by overlap (full join)
Episodes are merged within person using a full join on overlap:
-
Overlap condition:
pregnancy_start <= pps_episode_max_date_plus_two_monthsandpregnancy_end >= pps_episode_min_date
So two episodes match if their intervals intersect (including containment).
HIP: [pregnancy_start ---------------- pregnancy_end]
PPS: [pps_episode_min_date -------- pps_episode_max_date_plus_two_months]
<-------- overlap ------->
- Rows that match: one merged row per (HIP episode, PPS episode) pair.
- Rows that do not match: HIP-only or PPS-only rows (one side is NA) are kept via the full join.
For each row (matched or one-sided), the merge computes:
-
merged_episode_start= min(first_gest_date,pps_episode_min_date,pregnancy_start)
-
merged_episode_end= max(pps_episode_max_date,pregnancy_end)
-
merged_episode_length= (merged_episode_end−merged_episode_start) in months (days / 30.25)
Duplicate flags (used in Stage 3):
-
duplicated_hip_episode_id= 1 if thathip_episode_idappears in more than one row (HIP episode matched to multiple PPS). -
duplicated_pps_episode_id= 1 if thatpps_episode_idappears in more than one row (PPS episode matched to multiple HIP).
Stage 3: Deduplication loop
When many HIP and PPS episodes overlap, the overlap join can produce many-to-many matches: one HIP episode paired with several PPS episodes, and/or one PPS episode paired with several HIP episodes. The deduplication step keeps one best match per HIP episode and per PPS episode (and keeps one-to-one and one-sided rows unchanged).
Split: non-duplicates vs duplicates
baseKeep: Rows where neither episode is duplicated:
(duplicated_hip_episode_id == 0 & duplicated_pps_episode_id == 0)or one-sided (one of the dup flags is NA).
These are left unchanged (one-to-one matches and HIP-only or PPS-only episodes).dupDf: Rows where at least one side is duplicated and the row is an overlap (both HIP and PPS present):
(duplicated_hip_episode_id == 1 & pps_episode_id present)or(duplicated_pps_episode_id == 1 & hip_episode_id present).
Only these rows go through the “pick best” logic.
Scoring overlap rows
For each candidate row in dupDf, the algorithm
computes:
date_diff = |
pregnancy_end−pps_episode_max_date| (days).
Smaller values mean HIP and PPS agree on episode end; the best match minimizes this.Missing PPS outcome (when choosing by HIP)
When selecting the best PPS match for a duplicated HIP episode, rows wherepps_outcome_categoryis missing are penalized:date_diffis set to a large value (10000) so they are chosen only if no other match exists.pps_days = |
pps_episode_max_date−pps_episode_min_date| (PPS episode length in days).
Used for tie-breaking: PPS episodes longer than 310 days are treated as implausible and getpps_days = -1, so they lose ties. Among plausible episodes, longer PPS duration is preferred (largerpps_dayswins).
Selection rule (“pick best”)
For a set of rows that share the same duplicated identifier
(e.g. same hip_episode_id):
- Keep rows with smallest
date_diff. - Among ties, keep rows with largest
pps_days(so plausible, longer PPS episodes win).
This is implemented as: slice_min(date_diff, n = 1) then
slice_max(pps_days, n = 1). The first round uses
with_ties = TRUE (keep all ties); later rounds use
with_ties = FALSE for a deterministic single winner.
Iterative loop (why multiple rounds?)
After picking the best match by HIP
(hip_episode_id), some pps_episode_id values
can still appear in more than one row. After picking the best by
PPS (pps_episode_id), some
hip_episode_id values can again appear more than once. So
one round is not enough.
The algorithm:
-
Initial pass: From
dupDf, pick best byhip_episode_id(with missing-outcome penalty) and best bypps_episode_id(no penalty), combine them, then recomputeduplicated_hip_episode_idandduplicated_pps_episode_idon this set. -
Up to 10 rounds: Again pick best by
hip_episode_idamong rows still withduplicated_hip_episode_id == 1, and best bypps_episode_idamong rows still withduplicated_pps_episode_id == 1; recombine and recompute dup flags. After each round, the rows that are no longer duplicated (dup flags 0 or NA) are retained for the final table. -
Final set:
baseKeep(unchanged) plus all rows that are non-duplicated after the last round, thendistinct().
So the loop repeatedly reduces duplicate HIP and PPS ids until no overlap row is duplicated on either side, then merges those resolved rows with the non-duplicate rows.
Deduplication example
Setup: One HIP episode overlaps two PPS episodes.
Person 100, HIP episode 1: [2020-01-01 -------- 2020-09-15] (pregnancy_end = 2020-09-15)
Person 100, PPS episode 1: [2020-02-01 --- 2020-09-10] (pps_episode_max_date = 2020-09-10)
Person 100, PPS episode 2: [2020-04-01 --- 2020-09-20]
After the overlap join there are 2 rows (HIP 1–PPS 1 and HIP 1–PPS
2). Both have duplicated_hip_episode_id == 1 (same HIP in
two rows).
Scoring:
| Row | pregnancy_end | pps_episode_max_date | date_diff | pps_days (e.g.) |
|---|---|---|---|---|
| HIP1–PPS1 | 2020-09-15 | 2020-09-10 | 5 | 222 |
| HIP1–PPS2 | 2020-09-15 | 2020-09-20 | 5 | 172 |
Pick best by hip_episode_id: Same
date_diff; tie-break by pps_days. Row
HIP1–PPS1 wins (222 > 172). So the merged table keeps
one row for HIP episode 1: the pair (HIP1, PPS1).
Another case: One PPS episode overlaps two HIP episodes.
Person 200, HIP episode 1: [2020-03-01 -- 2020-10-01]
Person 200, HIP episode 2: [2020-05-01 -- 2020-11-15]
Person 200, PPS episode 1: [2020-04-01 -------- 2020-10-20]
PPS 1 overlaps both HIP 1 and HIP 2. We have 2 rows; both have
duplicated_pps_episode_id == 1. We pick the row with
smaller |pregnancy_end − pps_episode_max_date|
(and then by pps_days if tied). That chooses the single
best HIP for this PPS episode.
Post-deduplication: standardized columns
After deduplication, addMergedEpisodeDetails():
- Fills PPS outcome when missing: if a row has
pps_episode_idbutpps_outcome_categoryis NA, setpps_outcome_category = "PREG"and setpps_outcome_date = pps_episode_max_date. - Renames to the final names used in the output:
pregnancy_end→hip_end_date,pps_outcome_date→pps_end_date,merged_episode_start→merge_episode_start,merged_episode_end→merge_episode_end,merged_episode_length→merge_episode_length,pregnancy_start→merge_pregnancy_start,first_gest_date→merge_first_gest_date. HIP supplieship_outcome_category; PPS outcome column remainspps_outcome_category. - Adds hip_flag (1 if the row has a HIP episode, 0 otherwise) and pps_flag (1 if it has a PPS episode, 0 otherwise).
- Recomputes merged episode dates from the retained rows, then orders
episodes within person by
merge_episode_startand assigns merge_episode_number (1, 2, 3, … per person).
Output file: hipps_episodes.rds
mergeHipps() writes a single file to
outputFolder:
| File | Description |
|---|---|
| hipps_episodes.rds | One row per merged pregnancy episode per person; input to
runEsd(). |
If there are no episodes, the same path is written with an empty
tibble that has the correct column schema (see
emptyHippsEpisodes()).
Output columns
The saved data frame includes the following. Columns listed as “always present” are in every run; “present when applicable” may be NA or omitted depending on whether the row came from HIP, PPS, or both.
Identifiers and ordering
| Column | Type | Description |
|---|---|---|
person_id |
integer | Person identifier. |
merge_episode_number |
integer | Within-person episode index (1, 2, 3, …) by
merge_episode_start. |
Merged episode interval (always present)
| Column | Type | Description |
|---|---|---|
merge_episode_start |
Date | Start of the merged episode: min of
merge_first_gest_date, pps_episode_min_date,
merge_pregnancy_start. |
merge_episode_end |
Date | End of the merged episode: max of pps_episode_max_date,
hip_end_date. |
merge_episode_length |
numeric | Length in months (days / 30.25). |
Algorithm-specific end dates (always present)
| Column | Type | Description |
|---|---|---|
hip_end_date |
Date | HIP episode end (hip_pregnancy_end). NA if row is
PPS-only. |
pps_end_date |
Date | PPS outcome/end date (inferred or
pps_episode_max_date). NA if row is HIP-only. |
Outcome categories (always present)
| Column | Type | Description |
|---|---|---|
hip_outcome_category |
character | Outcome category from HIP (e.g. LB, SB, PREG). NA if PPS-only. |
pps_outcome_category |
character | Outcome category from PPS; set to "PREG" when PPS has
no inferred outcome. NA if HIP-only. |
Algorithm presence flags (always present)
| Column | Type | Description |
|---|---|---|
hip_flag |
integer | 1 if this row has a HIP episode, 0 otherwise. |
pps_flag |
integer | 1 if this row has a PPS episode, 0 otherwise. |
Optional / when applicable
These columns may be included when present in the merged data (e.g. for ESD or debugging). They can be NA for one-sided episodes.
| Column | Type | Description |
|---|---|---|
merge_pregnancy_start |
Date | HIP hip_pregnancy_start (standardized name). |
merge_first_gest_date |
Date | First gestation date in the HIP episode (from
hip_first_gest_date). |
pps_episode_min_date |
Date | PPS episode start. |
pps_episode_max_date |
Date | PPS episode end (before 2-month extension). |
pps_episode_max_date_plus_two_months |
Date | PPS end + 2 months (used for overlap). |
hip_episode |
integer | HIP episode index. |
pps_episode_number |
integer | PPS episode index. |
hip_episode_id |
character | HIP episode id: {person_id}_{hip_episode}_hip. |
pps_episode_id |
character | PPS episode id:
{person_id}_{pps_episode_number}_pps. |
Downstream, runEsd() expects at least:
person_id, merge_episode_number,
merge_pregnancy_start, merge_episode_start,
merge_episode_end. See the ESD
vignette and the pipeline overview.
