Skip to contents

Introduction

A summarised result is a table that contains aggregated summary statistics (result set with no patient-level data). The summarised result object consist in 2 objects: results table and settings table.

Results table

This table consist in 13 columns:

  • result_id (1), it is used to identify a group of results with a common settings (see settings below).
  • cdm_name (2), it is used to identify the name of the cdm object used to obtain those results.
  • group_name (3) - group_level (4), these columns work together as a name-level pair. A name-level pair are two columns that work together to summarise information of multiple other columns. The name column contains the column names separated by &&& and the level column contains the column values separated by &&&. Elements in the name column must be snake_case. Usually group aggregation is used to show high level aggregations: e.g. cohort name or codelist name.
  • strata_name (5) - strata_level (6), these columns work together as a name-level pair. Usually strata aggregation is used to show stratifications of the results: e.g. age groups or sex.
  • variable_name (7), name of the variable of interest.
  • variable_level (8), level of the variable of interest, it is usually a subclass of the variable_name.
  • estimate_name (9), name of the estimate.
  • estimate_type (10), type of the value displayed, the supported types are: numeric, integer, date, character, proportion, percentage, logical.
  • estimate_value (11), value of interest.
  • additional_name (12) - additional_level (13), these columns work together as a name-level pair. Usually additional aggregation is used to include the aggregations that did not fit in the group/strata definition.

The following table summarises the requirements of each column in the summarised_result format:

Column name Column type is NA allowed? Requirements
result_id integer No NA
cdm_name character No NA
group_name character No name1
group_level character No level1
strata_name character No name2
strata_level character No level2
variable_name character No NA
variable_level character Yes NA
estimate_name character No snake_case
estimate_type character No estimateTypeChoices()
estimate_value character No NA
additional_name character No name3
additional_level character No level3

Settings

The settings table provides one row per result_id with the settings used to generate those results, there is no limit of columns and parameters to be provided per result_id. But there is at least 3 values that should be provided:

  • resut_type (1): it identifies the type of result provided. We would usually use the name of the function that generated that set of result in snake_case. Example if the function that generates the summarised result is named summariseMyCustomData and then the used result_type would be: summarise_my_custom_data.
  • package_name (2): name of the package that generated the result type.
  • package_version (3): version of the package that generated the result type.

All those columns are required to be characters, but this restriction does not apply to other extra columns.

newSummarisedResult

The newSummarisedResult() function can be used to create objects, the inputs of this function are: the summarised_result table that must fulfill the conditions specified above; and the settings argument. The settings argument can be NULL or do not contain all the required columns and they will be populated by default (a warning will appear). Let’s see a very simple example:

library(omopgenerics)
library(dplyr)

x <- tibble(
  result_id = 1L,
  cdm_name = "my_cdm",
  group_name = "cohort_name",
  group_level = "cohort1",
  strata_name = "sex",
  strata_level = "male",
  variable_name = "Age group",
  variable_level = "10 to 50",
  estimate_name = "count",
  estimate_type = "numeric",
  estimate_value = "5",
  additional_name = "overall",
  additional_level = "overall"
)

result <- newSummarisedResult(x)
result |> 
  glimpse()
#> Rows: 1
#> Columns: 13
#> $ result_id        <int> 1
#> $ cdm_name         <chr> "my_cdm"
#> $ group_name       <chr> "cohort_name"
#> $ group_level      <chr> "cohort1"
#> $ strata_name      <chr> "sex"
#> $ strata_level     <chr> "male"
#> $ variable_name    <chr> "Age group"
#> $ variable_level   <chr> "10 to 50"
#> $ estimate_name    <chr> "count"
#> $ estimate_type    <chr> "numeric"
#> $ estimate_value   <chr> "5"
#> $ additional_name  <chr> "overall"
#> $ additional_level <chr> "overall"
settings(result)
#> # A tibble: 1 × 7
#>   result_id result_type package_name package_version group     strata additional
#>       <int> <chr>       <chr>        <chr>           <chr>     <chr>  <chr>     
#> 1         1 ""          ""           ""              cohort_n… sex    ""

We can also associate settings with our results. These will typically be used to explain how the result was created.

result <- newSummarisedResult(
  x = x, 
  settings = tibble(
    result_id = 1L, 
    package_name = "PatientProfiles",
    study = "my_characterisation_study"
  )
)

result |> glimpse()
#> Rows: 1
#> Columns: 13
#> $ result_id        <int> 1
#> $ cdm_name         <chr> "my_cdm"
#> $ group_name       <chr> "cohort_name"
#> $ group_level      <chr> "cohort1"
#> $ strata_name      <chr> "sex"
#> $ strata_level     <chr> "male"
#> $ variable_name    <chr> "Age group"
#> $ variable_level   <chr> "10 to 50"
#> $ estimate_name    <chr> "count"
#> $ estimate_type    <chr> "numeric"
#> $ estimate_value   <chr> "5"
#> $ additional_name  <chr> "overall"
#> $ additional_level <chr> "overall"
settings(result)
#> # A tibble: 1 × 8
#>   result_id result_type package_name    package_version group  strata additional
#>       <int> <chr>       <chr>           <chr>           <chr>  <chr>  <chr>     
#> 1         1 ""          PatientProfiles ""              cohor… sex    ""        
#> # ℹ 1 more variable: study <chr>

Combining summarised results

Multiple summarised results objects can be combined using the bind function. Result id will be assigned for each set of results with the same settings. So if two groups of results have the same settings althought being in different objects they will be merged into a single one.

result1 <- newSummarisedResult(
  x = tibble(
    result_id = 1L,
    cdm_name = "my_cdm",
    group_name = "cohort_name",
    group_level = "cohort1",
    strata_name = "sex",
    strata_level = "male",
    variable_name = "Age group",
    variable_level = "10 to 50",
    estimate_name = "count",
    estimate_type = "numeric",
    estimate_value = "5",
    additional_name = "overall",
    additional_level = "overall"
  ), 
  settings = tibble(
    result_id = 1L,
    package_name = "PatientProfiles",
    package_version = "1.0.0",
    study = "my_characterisation_study",
    result_type = "stratified_by_age_group"
  )
)

result2 <- newSummarisedResult(
  x = tibble(
    result_id = 1L,
    cdm_name = "my_cdm",
    group_name = "overall",
    group_level = "overall",
    strata_name = "overall",
    strata_level = "overall",
    variable_name = "overall",
    variable_level = "overall",
    estimate_name = "count",
    estimate_type = "numeric",
    estimate_value = "55",
    additional_name = "overall",
    additional_level = "overall"
  ),
  settings = tibble(
    result_id = 1L,
    package_name = "PatientProfiles",
    package_version = "1.0.0",
    study = "my_characterisation_study",
    result_type = "overall_analysis"
  )
)

Now we have our results we can combine them using bind. Because the two sets of results contain the same result ID, when the results are combined this will be automatically updated.

result <- bind(result1, result2)
result |> 
  dplyr::glimpse()
#> Rows: 2
#> Columns: 13
#> $ result_id        <int> 1, 2
#> $ cdm_name         <chr> "my_cdm", "my_cdm"
#> $ group_name       <chr> "cohort_name", "overall"
#> $ group_level      <chr> "cohort1", "overall"
#> $ strata_name      <chr> "sex", "overall"
#> $ strata_level     <chr> "male", "overall"
#> $ variable_name    <chr> "Age group", "overall"
#> $ variable_level   <chr> "10 to 50", "overall"
#> $ estimate_name    <chr> "count", "count"
#> $ estimate_type    <chr> "numeric", "numeric"
#> $ estimate_value   <chr> "5", "55"
#> $ additional_name  <chr> "overall", "overall"
#> $ additional_level <chr> "overall", "overall"
settings(result)
#> # A tibble: 2 × 8
#>   result_id result_type     package_name package_version group strata additional
#>       <int> <chr>           <chr>        <chr>           <chr> <chr>  <chr>     
#> 1         1 stratified_by_… PatientProf… 1.0.0           "coh… "sex"  ""        
#> 2         2 overall_analys… PatientProf… 1.0.0           ""    ""     ""        
#> # ℹ 1 more variable: study <chr>

Minimum cell count suppression

Once we have a summarised result, we can suppress the results based on some minimum cell count.

result |>
  suppress(minCellCount = 7) |> 
  glimpse()
#> Rows: 2
#> Columns: 13
#> $ result_id        <int> 1, 2
#> $ cdm_name         <chr> "my_cdm", "my_cdm"
#> $ group_name       <chr> "cohort_name", "overall"
#> $ group_level      <chr> "cohort1", "overall"
#> $ strata_name      <chr> "sex", "overall"
#> $ strata_level     <chr> "male", "overall"
#> $ variable_name    <chr> "Age group", "overall"
#> $ variable_level   <chr> "10 to 50", "overall"
#> $ estimate_name    <chr> "count", "count"
#> $ estimate_type    <chr> "numeric", "numeric"
#> $ estimate_value   <chr> NA, "55"
#> $ additional_name  <chr> "overall", "overall"
#> $ additional_level <chr> "overall", "overall"

The minCellCount suppression is recorded in the settings of the object:

resultSuppressed <- result |> suppress(minCellCount = 10)
settings(result)
#> # A tibble: 2 × 8
#>   result_id result_type     package_name package_version group strata additional
#>       <int> <chr>           <chr>        <chr>           <chr> <chr>  <chr>     
#> 1         1 stratified_by_… PatientProf… 1.0.0           "coh… "sex"  ""        
#> 2         2 overall_analys… PatientProf… 1.0.0           ""    ""     ""        
#> # ℹ 1 more variable: study <chr>
settings(resultSuppressed)
#> # A tibble: 2 × 9
#>   result_id result_type     package_name package_version group strata additional
#>       <int> <chr>           <chr>        <chr>           <chr> <chr>  <chr>     
#> 1         1 stratified_by_… PatientProf… 1.0.0           "coh… "sex"  ""        
#> 2         2 overall_analys… PatientProf… 1.0.0           ""    ""     ""        
#> # ℹ 2 more variables: min_cell_count <chr>, study <chr>

How suppression works

There are three levels of suppression:

  • record level suppression: records where the word ‘count’ is contained in the “estimate_name” value will be suppressed if their numeric value is smaller than minCellCount.

  • linked suppression: if a count of estimate_name: ‘my_precious_count’ is suppressed and it exist an estimate named ‘my_precious_percentage’ with the same: result_id, cdm_name, group_name, group_level, strata_name, strata_level, variable_name, variable_level, additional_name, additional_level; then this estimate will also be suppressed.

  • suppression at variable_name level: all the estimates with the same: result_id, cdm_name, group_name, group_level, strata_name, strata_level, variable_name, additional_name, additional_level; will be suppressed if there exist a there exist an estimate with estimate_name %in% c(“count”, “denominator_count”, “outcome_count”, “record_count”, “subject_count”) that is suppressed.

  • suppression at group level: all the estimates with the same: result_id, cdm_name, group_name, group_level, strata_name, strata_level, additional_name, additional_level; will be suppressed if there exist a variable_name %in% c(“number subjects”, “number records”) that is suppressed.

You can see the source code for cell suppression here: https://github.com/darwin-eu/omopgenerics/blob/main/R/methodSuppress.R.

Export and import summarised results

The summarised_result object can be exported and imported as a .csv file with the following functions:

  • importSummarisedResult()

  • exportSummarisedResult()

Note that exportSummarisedResult also suppresses the results.

x <- tempdir()
files <- list.files(x)

exportSummarisedResult(result, path = x, fileName = "result.csv")
setdiff(list.files(x), files)
#> [1] "result.csv"

Note that the settings are included in the csv file:

#> "result_id","cdm_name","group_name","group_level","strata_name","strata_level","variable_name","variable_level","estimate_name","estimate_type","estimate_value","additional_name","additional_level" 1,"my_cdm","cohort_name","cohort1","sex","male","Age group","10 to 50","count","numeric","5","overall","overall" 2,"my_cdm","overall","overall","overall","overall","overall","overall","count","numeric","55","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"result_type","character","stratified_by_age_group","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"package_name","character","PatientProfiles","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"package_version","character","1.0.0","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"group","character","cohort_name","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"strata","character","sex","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"additional","character","","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"min_cell_count","character","5","overall","overall" 1,NA,"overall","overall","overall","overall","settings",NA,"study","character","my_characterisation_study","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"result_type","character","overall_analysis","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"package_name","character","PatientProfiles","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"package_version","character","1.0.0","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"group","character","","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"strata","character","","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"additional","character","","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"min_cell_count","character","5","overall","overall" 2,NA,"overall","overall","overall","overall","settings",NA,"study","character","my_characterisation_study","overall","overall"

You can later import the results back with importSummarisedResult():

res <- importSummarisedResult(path = file.path(x, "result.csv"))
class(res)
#> [1] "summarised_result" "omop_result"       "tbl_df"           
#> [4] "tbl"               "data.frame"
res |>
  glimpse()
#> Rows: 2
#> Columns: 13
#> $ result_id        <int> 1, 2
#> $ cdm_name         <chr> "my_cdm", "my_cdm"
#> $ group_name       <chr> "cohort_name", "overall"
#> $ group_level      <chr> "cohort1", "overall"
#> $ strata_name      <chr> "sex", "overall"
#> $ strata_level     <chr> "male", "overall"
#> $ variable_name    <chr> "Age group", "overall"
#> $ variable_level   <chr> "10 to 50", "overall"
#> $ estimate_name    <chr> "count", "count"
#> $ estimate_type    <chr> "numeric", "numeric"
#> $ estimate_value   <chr> "5", "55"
#> $ additional_name  <chr> "overall", "overall"
#> $ additional_level <chr> "overall", "overall"
res |>
  settings()
#> # A tibble: 2 × 9
#>   result_id result_type     package_name package_version group strata additional
#>       <int> <chr>           <chr>        <chr>           <chr> <chr>  <chr>     
#> 1         1 stratified_by_… PatientProf… 1.0.0           "coh… "sex"  ""        
#> 2         2 overall_analys… PatientProf… 1.0.0           ""    ""     ""        
#> # ℹ 2 more variables: min_cell_count <chr>, study <chr>