Minimum cell count suppression
Minimum cell count suppression is very important in studies, as it is
an essential step to reduce reidentification risk. The minimum cell
count can vary from source to source, but in general a minimum cell
count of 5 is used. In this vignette we explain how the suppression
process works for summarised_result objects.
How suppression works
In general, a record is suppressed if three conditions are met:
- The
estimate_namefield contains the word ‘count’ (e.g ‘count’, ‘outcome_count’, ‘count_of_individuals’, …). - The
estimate_typefield is either numeric or integer. - The
estimate_valuenumeric value is less than minCellCount and greater than 0.
This simple rule determines record-level suppression. The suppressed
record is not removed from the results; instead, the
estimate_value field is populated as
‘<{minCellCount}’.
Once one record is suppressed, this can trigger suppression of other linked estimates. This suppression is done at different levels and affects different rows of the result object:
Suppression at group level: if, in the suppressed estimate, the field
variable_nameis populated with “number records” or “number subjects” (case-insensitive), then the whole group of records will be suppressed. Note that a group of records is a set of rows with the same:result_id,cdm_name,group_name,group_level,strata_name,strata_level,additional_name,additional_level. This level of suppression can be used to suppress all demographics for a cohort of individuals that has fewer than the minimum number of records required. For developers, creating a row withvariable_name= “number records/subjects” can have a big impact on suppression, but it also gives you the ability to link a group of estimates and suppress all of them at the same time.Suppression at variable_name level: if, in the suppressed estimate, the field
estimate_nameis populated with “count”, “denominator_count”, “outcome_count”, “record_count” or “subject_count”, then suppression is done at the variable level. This means that all estimates with the same:result_id,cdm_name,group_name,group_level,strata_name,strata_level,additional_name,additional_levelandvariable_namewill be suppressed. This level of suppression can be used to suppress statistics associated with an outcome count without affecting different outcomes. For developers, use one of these keywords to link estimates at the variable level.Suppression of percentages: if an estimate is suppressed any estimate in the same level (same
result_id,cdm_name,group_name,group_level,strata_name,strata_level,additional_name,additional_level,variable_nameandvariable_level) with the same estimate name but changing ‘count’ for ‘percentage’ (e.g. ‘event_count’ -> ‘event_percentage’) will be suppressed.
Note that linked estimate records will be suppressed as ‘-’.
You can view the source code for minimum cell suppression here.
Suppressing a summarised_result object
Once we have a summarised result, we can suppress the object based on
a desired minimum cell count value using the suppress()
function.
library(omopgenerics, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
result <- newSummarisedResult(
x = tibble(
result_id = 1L,
cdm_name = "my_cdm",
group_name = "cohort_name",
group_level = "cohort1",
strata_name = "sex",
strata_level = "male",
variable_name = "Age group",
variable_level = "10 to 50",
estimate_name = "count",
estimate_type = "numeric",
estimate_value = "5",
additional_name = "overall",
additional_level = "overall"
),
settings = tibble(
result_id = 1L,
package_name = "PatientProfiles",
package_version = "1.0.0",
study = "my_characterisation_study",
result_type = "stratified_by_age_group"
)
)
suppressedResult <- suppress(result = result, minCellCount = 7)Is a summarised_result object suppressed?
The minCellCount suppression is recorded in the settings of the object:
glimpse(settings(result))
#> Rows: 1
#> Columns: 9
#> $ result_id <int> 1
#> $ result_type <chr> "stratified_by_age_group"
#> $ package_name <chr> "PatientProfiles"
#> $ package_version <chr> "1.0.0"
#> $ group <chr> "cohort_name"
#> $ strata <chr> "sex"
#> $ additional <chr> ""
#> $ min_cell_count <chr> "0"
#> $ study <chr> "my_characterisation_study"
glimpse(settings(suppressedResult))
#> Rows: 1
#> Columns: 9
#> $ result_id <int> 1
#> $ result_type <chr> "stratified_by_age_group"
#> $ package_name <chr> "PatientProfiles"
#> $ package_version <chr> "1.0.0"
#> $ group <chr> "cohort_name"
#> $ strata <chr> "sex"
#> $ additional <chr> ""
#> $ min_cell_count <chr> "7"
#> $ study <chr> "my_characterisation_study"Because a result object can be partially suppressed (e.g. binding an
object that has already been suppressed with another one that is not
suppressed), and settings of results objects can be long, we also have a
utility function to check whether an object has been suppressed: isResultSuppressed():
isResultSuppressed(result = result, minCellCount = 5)
#> Warning: ✖ 1 set (1 row) not suppressed.
#> [1] FALSE
isResultSuppressed(result = suppressedResult, minCellCount = 5)
#> Warning: ! 1 set (1 row) suppressed with minCellCount > 5.
#> [1] FALSE
isResultSuppressed(result = suppressedResult, minCellCount = 7)
#> ✔ The <summarised_result> is suppressed with
#> minCellCount = 7.
#> [1] TRUE
isResultSuppressed(result = suppressedResult, minCellCount = 10)
#> Warning: ✖ 1 set (1 row) suppressed with minCellCount < 10.
#> [1] FALSE