Overview
The pipeline writes person-level and episode-level
data (e.g. final_pregnancy_episodes.rds) to
outputFolder. The export step turns those
results and related artifacts into shareable aggregated CSV
files for QA, cross-site comparison, and use as input to the
Shiny app. Export runs automatically as part of
runPregnancyIdentifier() and writes to
exportFolder (default
file.path(outputFolder, "export")). Small cell counts can
be suppressed using minCellCount. Alternatively, run
exportPregnancies() yourself to write to a chosen
directory.
This vignette lists every exported CSV: file name, columns, meaning of each column, and how each file is used in analysis.
How to run export
Export runs automatically in runPregnancyIdentifier().
Shareable CSVs go to exportFolder; default is
file.path(outputFolder, "export"). Override with the
exportFolder argument:
library(PregnancyIdentifier)
cdm <- mockPregnancyCdm()
# Export runs by default; CSVs go to "out/export":
runPregnancyIdentifier(cdm, outputFolder = "out")
# Or write export CSVs to a custom folder:
runPregnancyIdentifier(cdm, outputFolder = "out", exportFolder = "path/to/my_export")To run export alone (e.g. after a previous pipeline run), use
exportPregnancies() with the same outputFolder
and your chosen exportFolder. The same CSVs are produced;
exportFolder in runPregnancyIdentifier()
corresponds to exportFolder in
exportPregnancies().
Exported files: metadata and copied artifacts
cdm_source.csv
Source: CDMConnector::snapshot(cdm) —
one row of CDM metadata.
Columns: Depend on the CDMConnector version;
typically include identifiers and timestamps for the CDM
(e.g. cdm_name, snapshot_date, and other
snapshot fields).
Purpose: Documents which CDM and snapshot the export came from. Used when comparing or aggregating exports across sites or runs so you can track provenance and CDM version.
pps_concept_counts.csv
Source: Copied from outputFolder
(written by runPps()).
Content: Counts of pregnancy-related concepts used
by PPS per concept (pps_concept_id,
pps_concept_name, record_count,
person_count).
Purpose: Describes the PPS concept usage in this run. Used to compare concept coverage across sites and to check that expected pregnancy concepts are present.
hip_concept_counts.csv
Source: Copied from outputFolder when
present (written by the pipeline when HIP concept counts are
produced).
Content: Counts of HIP pregnancy-related concepts per concept.
Purpose: Describes HIP concept usage in this run; used for concept coverage and cross-site comparison.
esd_concept_counts.csv
Source: Copied from outputFolder when
present (written by the pipeline when ESD concept counts are
produced).
Content: Counts of ESD timing concepts used per concept.
Purpose: Describes ESD concept usage; used for QA and cross-site comparison.
attrition.csv
Source: Copied from outputFolder when
present (written by pipeline steps such as HIP and merge).
Content: Step-by-step record and person counts (prior/post, dropped) through the pipeline.
Purpose: Audit trail of attrition at each step; used for debugging and cohort flow.
log.txt
Source: Copied from outputFolder
(pipeline log from makeLogger()).
Content: Plain-text log of pipeline steps (init, HIP, PPS, merge, ESD).
Purpose: Audit trail for the run. Used for debugging and to verify which steps ran and when.
Exported files: derived from final_pregnancy_episodes.rds
All of the following are computed from
final_pregnancy_episodes.rds and share common metadata
columns when present: cdm_name,
date_run (pipeline run timestamp),
date_export (export/snapshot date),
pkg_version (PregnancyIdentifier version). Counts in
the range (0, minCellCount) may be replaced with
NA for disclosure control.
age_summary.csv
Columns: Summary of age_pregnancy_start (age in years at pregnancy start): final_outcome_category, n, min, Q25, median, Q75, max, mean, plus metadata.
Purpose: Describes the distribution of maternal age at pregnancy start. Used for cohort description, feasibility checks, and comparing age distributions across sites or time windows.
age_summary_groups.csv
Columns: colName,
age_pregnancy_start (value or group: integer year or
"<12", ">55"), n,
total, pct, plus metadata. Counts may
be suppressed by minCellCount.
Purpose: Counts and percentages of pregnancies by age (by year and by boundary groups <12 and >55). Used for age-stratified summaries and checking extreme-age pregnancy counts.
age_summary_first_pregnancy.csv
Columns: final_outcome_category, colName, min, Q25, median, Q75, max, mean, sd, plus metadata. One row per outcome group: overall (any outcome), each individual outcome (LB, SB, AB, SA, ECT, DELIV, PREG), and LB or PREG (combined).
Purpose: Age at first pregnancy start, stratified by outcome; used for cohort description and feasibility.
age_summary_first_pregnancy_end.csv
Columns: final_outcome_category,
colName, min, Q25,
median, Q75, max,
mean, sd, plus metadata. Same
structure as age_summary_first_pregnancy.csv but summarises
age_pregnancy_end (age at pregnancy end date) instead
of age at start.
Purpose: Age at first pregnancy end, stratified by outcome; used for cohort description and feasibility.
precision_days.csv
Columns: esd_precision_days (x-axis values), density (kernel density estimate), plus cdm_name, date_run, date_export, pkg_version.
Purpose: Distribution of ESD start-date precision (in days). Used to see how precise inferred start dates are (e.g. mostly week-level vs month-level) and to compare precision across sites.
precision_days_denominators.csv
Columns: total_episodes, episodes_with_precision_days, pct_with_precision_days, episodes_with_gw_timing, pct_with_gw_timing, plus metadata.
Purpose: Denominators for interpreting precision outputs (how many episodes had ESD precision and GW timing evidence).
episode_frequency.csv
Columns: total_episodes, total_individuals (distinct persons with at least one episode), plus metadata. total_individuals may be suppressed if < minCellCount.
Purpose: High-level counts: how many pregnancy episodes and how many people. Used as the main denominator for rates and for site-level summaries.
pregnancy_frequency.csv
Columns: freq (number of pregnancies per person: 1, 2, 3, …), number_individuals (count of persons with that frequency), plus metadata. number_individuals may be suppressed.
Purpose: Distribution of pregnancy count per person (parity-like). Used to describe repeat pregnancies and to check for implausible multiplicity.
episode_frequency_summary.csv
Columns: Summary of freq (episodes per person): colName, min, Q25, median, Q75, max, mean, sd, plus metadata.
Purpose: Numeric summary of how many episodes per person (min/median/max, etc.). Used for cohort description and cross-site comparison of pregnancy frequency.
gestational_age_days_summary.csv
Columns: Summary of esd_gestational_age_days_calculated (gestational length in days): colName, min, Q25, median, Q75, max, mean, sd, plus metadata.
Purpose: Distribution of gestational duration. Used to describe pregnancy length, flag implausible durations, and compare sites.
gestational_age_days_counts.csv
Columns: less_1day (count of episodes with gestational age < 1 day), over_308days (count > 308 days), plus metadata.
Purpose: Counts of out-of-range gestational ages. Used for data quality and to quantify implausible or missing duration.
gestational_weeks.csv
Columns: final_outcome_category, gestational_weeks (integer: floor of gestational age in days / 7), n (number of episodes), pct (percentage), plus metadata.
Purpose: Distribution of gestational age by week. Used for gestational-age histograms, preterm/term summaries, and cross-site comparison.
gestational_age_days_per_category_summary.csv
Columns: One row per final_outcome_category (e.g. LB, SB, PREG). For each: summary of esd_gestational_age_days_calculated — colName, min, Q25, median, Q75, max, mean, sd, plus final_outcome_category and metadata.
Purpose: Gestational duration by outcome type. Used to check that outcome-specific durations (e.g. live birth vs miscarriage) are plausible and to compare across sites.
yearly_trend.csv
Columns: column (name of the date field), year (integer), count (number of episodes with that date in that year), plus metadata.
Purpose: Episode counts by year for each date type (e.g. final_episode_start_date, final_episode_end_date). Used for temporal trends and study window checks.
yearly_trend_missing.csv
Columns: Same as yearly_trend but rows where year is NA (missing date).
Purpose: Count of episodes with missing dates by column. Used for completeness and missing-data reporting.
monthly_trends.csv
Columns: column (date field), month (month name), count, plus metadata.
Purpose: Episode counts by month (seasonality). Used for seasonal patterns and data completeness by month.
monthly_trend_missing.csv
Columns: Same as monthly_trends but rows where month is NA.
Purpose: Count of episodes with missing date by column, for completeness.
observation_period_range.csv
Columns: min_obs (minimum
observation period start year), max_obs (maximum
observation period end year), plus metadata. Derived from CDM
observation_period.
Purpose: Range of observation periods in the CDM. Used to interpret temporal coverage and to compare study windows across sites.
pregnancy_overlap_counts.csv
Columns: colName (value
"overlap" for all rows), overlap (FALSE or
TRUE per row), n (count of records in that category),
total (total episode records, same for all rows),
pct (percentage of records in that category), plus
cdm_name, date_run,
date_export, pkg_version. One row per
overlap category: FALSE (episode does not overlap another—no temporal
overlap with previous episode, or no previous episode for that person,
e.g. first episode), TRUE (episode overlaps another: episode start ≤
previous episode end within person).
Purpose: Summary counts of overlapping inferred pregnancy intervals. Used for data quality (overlaps may indicate algorithm or data issues).
missing_dates.csv
Columns: Number and percentage of missing start and end dates for HIP episodes, PPS episodes, and ESD (final) episodes: hip_start_n, hip_start_pct, hip_end_n, hip_end_pct, pps_start_n, pps_start_pct, pps_end_n, pps_end_pct, esd_start_n, esd_start_pct, esd_end_n, esd_end_pct, plus cdm_name, date_run, date_export, pkg_version. HIP uses merge_pregnancy_start (start) and hip_end_date (end); PPS uses pps_episode_min_date (start) and pps_episode_max_date (end); ESD uses final_episode_start_date and final_episode_end_date.
Purpose: Counts and percentages of missing start/end dates by episode type. Used to assess completeness of key dates and to compare across sites.
swapped_dates.csv
Columns: source (hip, pps, or esd), n_swapped (count where start > end), total (records with both dates for that source), pct (100 × n_swapped / total), plus cdm_name, date_run, date_export, pkg_version. One row per source: HIP (merge_pregnancy_start > hip_end_date), PPS (merge_pregnancy_start > pps_end_date), ESD (final_episode_start_date > final_episode_end_date).
Purpose: Count of episodes with reversed (swapped) start/end by source. Used for data quality and algorithm validation.
outcome_categories_count.csv
Columns: outcome_category (e.g. LB, SB, PREG), algorithm (hip, pps, hipps), n, pct (within algorithm), plus metadata.
Purpose: Outcome distribution by algorithm (HIP, PPS, final harmonized). Used to compare algorithm agreement and to describe outcome mix across sites.
delivery_mode_summary.csv
Columns: final_outcome_category, delivery-mode counts (e.g. cesarean, vaginal), n, plus metadata.
Purpose: Delivery mode (cesarean vs vaginal) by outcome category for live-birth analyses.
delivery_mode_by_year.csv
Columns: year, final_outcome_category, delivery-mode counts (e.g. cesarean, vaginal), n, plus metadata.
Purpose: Delivery mode (cesarean vs vaginal) stratified by year and outcome category. Year is derived from the episode end date.
concept_check.csv
Columns: Concept-timing check results (concept coverage and timing vs episodes), plus metadata.
Purpose: QA of concept usage and timing relative to pregnancy episodes.
ZIP archive
exportPregnancies() writes CSVs only. To create a ZIP
archive, run zipExportFolder(exportFolder) after export
(and after PET comparison outputs if you want one combined archive).
Summary table
| File | Main content | Main use in analysis |
|---|---|---|
| cdm_source.csv | CDM/snapshot metadata | Provenance, site/CDM identification |
| pps_concept_counts.csv | PPS concept counts | Concept coverage, PPS input description |
| hip_concept_counts.csv | HIP concept counts | HIP concept coverage (when present) |
| esd_concept_counts.csv | ESD concept counts | ESD concept coverage (when present) |
| attrition.csv | Pipeline step attrition | Audit, cohort flow (when present) |
| log.txt | Pipeline log | Audit, debugging |
| age_summary.csv | Age-at-start distribution (summary stats) | Cohort description, feasibility |
| age_summary_first_pregnancy.csv | Age at first pregnancy start by outcome | First-pregnancy age description |
| age_summary_first_pregnancy_end.csv | Age at first pregnancy end by outcome | First-pregnancy end age description |
| age_summary_groups.csv | Age counts (by year and <12, >55) | Age stratification, boundary checks |
| precision_days.csv | Density of ESD precision (days) | Start-date precision, site comparison |
| precision_days_denominators.csv | Precision/GW timing denominators | Context for precision completeness |
| episode_frequency.csv | Total episodes, total individuals | Denominators, site summaries |
| pregnancy_frequency.csv | Episodes per person (1, 2, 3, …) | Parity-like distribution |
| episode_frequency_summary.csv | Summary of episodes per person | Cohort description |
| gestational_age_days_summary.csv | Summary of gestational length (days) | Duration distribution |
| gestational_age_days_counts.csv | Counts <1 day, >308 days | Out-of-range duration QA |
| gestational_weeks.csv | Counts by gestational week | GA distribution, preterm/term |
| gestational_age_days_per_category_summary.csv | Gestational length by outcome | Outcome-specific duration |
| yearly_trend.csv | Episode count by year and date column | Temporal trends |
| yearly_trend_missing.csv | Missing-date count by column/year | Completeness |
| monthly_trends.csv | Episode count by month | Seasonality, completeness |
| monthly_trend_missing.csv | Missing-date count by column/month | Completeness |
| observation_period_range.csv | Min/max observation period years | Study window, coverage |
| pregnancy_overlap_counts.csv | Overlap summary (n_overlap_true/false, etc.) | Overlap QA |
| missing_dates.csv | N and % missing start/end (HIP, PPS, ESD) | Completeness by episode type |
| swapped_dates.csv | Count of start > end (HIP/PPS/ESD) | Date logic QA |
| outcome_categories_count.csv | Outcome counts by algorithm | Outcome mix, algorithm agreement |
| delivery_mode_summary.csv | Delivery mode by outcome | Cesarean/vaginal by outcome |
| delivery_mode_by_year.csv | Delivery mode by year and outcome | Cesarean/vaginal trends over time |
| concept_check.csv | Concept timing vs episodes | Concept QA |
| quality_check_cleanup.csv | Cleanup/validation QA | Cleanup QA |
| attrition_if_cleanup.csv | Cleanup-rule flagged counts | Cleanup QA and potential attrition |
