2. Retrieving tables from the database
02_tables.Rmd
We have already seen how to create an interface object for an OMOP
CDM database. Now, let’s see how to use it to retrieve tables from it in
deeper detail. For this purpose, we will use the get
method
of the interface object. This method has several arguments that allow
filtering the data that is retrieved and customizing the output.
2.1 Features of the get
method
2.1.1 Table reshaping
The get
method is an smart method that returns the table
after subjecting it to a series of operations. If it detects that this
table can present various records linked to the same OMOP CDM entity
(for example, to a Person
), it will transform the table
into a wide format automatically using the in-built DataSHIELD reshape
function, ready to be merged with other tables. For instance, when we
take the Person
table, it will return a table without
altering its original format:
o$get("person")
ds.summary("person")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 11
##
## $opal_demo$`variables held`
## [1] "person_id" "gender_concept_id" "year_of_birth"
## [4] "month_of_birth" "day_of_birth" "birth_datetime"
## [7] "race_concept_id" "ethnicity_concept_id" "location_id"
## [10] "provider_id" "care_site_id"
Whereas if we take the Measurement
table, which has
records potentially linkable to the Person
table, it will
return a table in wide format, preserving the column
person_id
so that it can be merged with the
Person
table:
# 3027018 is the concept ID for `Heart rate`
o$get(table = "measurement", conceptFilter = 3027018)
ds.summary("measurement")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 15
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "heart_rate.measurement_id"
## [3] "heart_rate.measurement_date"
## [4] "heart_rate.measurement_datetime"
## [5] "heart_rate.measurement_time"
## [6] "heart_rate.measurement_type_concept_id"
## [7] "heart_rate.operator_concept_id"
## [8] "heart_rate.value_as_number"
## [9] "heart_rate.value_as_concept_id"
## [10] "heart_rate.unit_concept_id"
## [11] "heart_rate.range_low"
## [12] "heart_rate.range_high"
## [13] "heart_rate.provider_id"
## [14] "heart_rate.visit_occurrence_id"
## [15] "heart_rate.visit_detail_id"
2.1.2 Concept translation
Another feature you may have noticed in the previous example is that
the get
method automatically translates the concepts of the
table that we are retrieving. For instance, in the case of the
Measurement
table, the concept 3027018
was
translated to Heart rate
.
The same happens with the values of the rows in the tables. For
example, in the Person
table, in the
gender_concept_id
column, the value 8532
was
translated automatically to FEMALE
, and the value
8507
was translated to MALE
:
ds.summary("person$gender_concept_id")
## $opal_demo
## $opal_demo$class
## [1] "factor"
##
## $opal_demo$length
## [1] 100
##
## $opal_demo$categories
## [1] "female" "male"
##
## $opal_demo$`count of 'female'`
## [1] 43
##
## $opal_demo$`count of 'male'`
## [1] 57
This will happen with all concepts in the database as long as they
are recognized and properly named in the Concept
table of
the database. Otherwise, the concept will remain as
concept_id_
+ the numeric value of the concept. For
example, if the concept FEMALE
was not properly named in
the Concept
table, it would remain as
concept_id_8532
so that it can at least be identified.
2.1.3 Longitudinal data sequencing
When you have multiple records for the same entity (e.g., a patient)
across time points, dsOMOP
offers two parameters to
transform and align these data in a wide format:
-
wideLongitudinal = TRUE
reshapes repeated measurements for each concept into extra columns, so each row in your table corresponds to one entity (e.g., one patient). -
completeTimePoints = TRUE
further ensures that all entities share the same set of time points. Rows (and columns) for missing measurements at particular dates are filled withNA
.
o$get(
table = "measurement",
conceptFilter = 3027018,
columnFilter = c("measurement_date", "value_as_number"),
wideLongitudinal = TRUE,
completeTimePoints = TRUE
)
ds.summary("measurement")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 9099
##
## $opal_demo$`variables held`
## [1] "person_id" "heart_rate.1.measurement_date"
## [3] "heart_rate.1.value_as_number" "heart_rate.2.measurement_date"
## [5] "heart_rate.2.value_as_number" "heart_rate.3.measurement_date"
## [7] "heart_rate.3.value_as_number" "heart_rate.4.measurement_date"
## [9] "heart_rate.4.value_as_number" "heart_rate.5.measurement_date"
## [ reached getOption("max.print") -- omitted 9089 entries ]
Because DataSHIELD does not allow row-level inspection on the client
side, you will not see the exact final structure of the data.
Internally, however, dsOMOP
arranges your dataset so that
every individual is aligned on a consistent time axis. Below is a
simplified illustration of how this reshaping would look for a small
example of three patients (A, B, C). It does not reflect real
data—merely a conceptual snapshot.
Case 1: wideLongitudinal = TRUE
but
completeTimePoints = FALSE
In this scenario, dsOMOP
just generates a wide table
indexed by the chronological order of each patient’s events. It
does not force every patient to share the same time points.
Suppose the patients have data as follows:
- Patient A has 3 heart-rate measurements (Dates: 2023-01-01, 2023-01-15, 2023-01-31)
- Patient B has 2 measurements (Dates: 2023-01-01, 2023-01-02)
- Patient C has 1 measurement (Date: 2023-01-10)
The resulting table, conceptually, might look like this:
person_id | heart_rate.measurement_date.1 | heart_rate.measurement_value.1 | heart_rate.measurement_date.2 | heart_rate.measurement_value.2 | heart_rate.measurement_date.3 | heart_rate.measurement_value.3 |
---|---|---|---|---|---|---|
A | 2023-01-01 | 70 | 2023-01-15 | 71 | 2023-01-31 | 72 |
B | 2023-01-01 | 60 | 2023-01-02 | 65 | NA | NA |
C | 2023-01-10 | 67 | NA | NA | NA | NA |
- Columns “.1”, “.2”, “.3” represent the 1st, 2nd, and 3rd measurement recorded for each patient, in that patient’s own chronological order.
- If a patient doesn’t have a second or third measurement, those
columns are
NA
. - Different patients can have entirely different date values in, say,
.2
or.3
.
Case 2: wideLongitudinal = TRUE
and
completeTimePoints = TRUE
Now dsOMOP
additionally ensures that all
patients align on every time point found across all patients in
the dataset. For instance, if the union of dates in your dataset is
{2023-01-01, 2023-01-02, 2023-01-10, 2023-01-15, 2023-01-31}
,
then each row (i.e., each patient) will have columns for five
distinct measurements, even if some are NA
:
person_id | heart_rate.measurement_date.1 | heart_rate.measurement_value.1 | heart_rate.measurement_date.2 | heart_rate.measurement_value.2 | heart_rate.measurement_date.3 | heart_rate.measurement_value.3 | heart_rate.measurement_date.4 | heart_rate.measurement_value.4 | heart_rate.measurement_date.5 | heart_rate.measurement_value.5 |
---|---|---|---|---|---|---|---|---|---|---|
A | 2023-01-01 | 70 | NA | NA | 2023-01-15 | 71 | NA | NA | 2023-01-31 | 72 |
B | 2023-01-01 | 60 | 2023-01-02 | 65 | NA | NA | NA | NA | NA | NA |
C | NA | NA | NA | NA | 2023-01-10 | 67 | NA | NA | NA | NA |
- For privacy reasons, it still uses index-based columns
(e.g.,
.1, .2, .3
), but now behind the scenes, each suffix corresponds to the same date for every patient. - Patients missing a measurement on a particular date have
NA
in both themeasurement_date
andmeasurement_value
columns for that slot. - This approach ensures a consistent date axis across all individuals, which is typically needed for time-series analysis or for creating time-aligned comparisons.
Practical usage notes
-
Date sorting: Under the hood,
dsOMOP first discovers all unique dates in the data,
sorts them in ascending order, and then indexes them as
.1, .2, .3, …
. -
Flexibility: You can still opt not to
align time points by leaving
completeTimePoints
asFALSE
. This can be useful for scenarios where each patient’s internal timeline is more relevant than a global date alignment. -
Merging: Whether you choose to use alignment or
not, your wide-format table remains mergeable with other wide-format
tables in DataSHIELD if they share the same
person_id
(or whichever merge column(s) you specify).
In short, wideLongitudinal
provides a wide “spread” of
repeated measurements within each row, and
completeTimePoints
ensures that these measurements line up
on a shared calendar of dates. You can fine-tune these parameters to
suit your analytic design.
2.2 Using the get
method
The get
method has several arguments that allow for the
customization of the data that is retrieved. These arguments are:
-
table
: the name of the table to be retrieved. -
symbol
: the name of the symbol that will be used to store the table in the server’s R environment. If not specified, the table will be stored in a symbol with the same name as the table. -
conceptFilter
: a numeric vector with the concept IDs to be retrieved. It can also be a single numeric value. -
columnFilter
: a character vector with the names of the columns to be retrieved. It can also be a single character value. -
personFilter
: the name of another symbol in the server’s R environment which contains the person IDs to be retrieved by the currentget
operation. If not specified, all existing person IDs will be retrieved. -
mergeColumn
: the name of the column that will be used to merge the table with another table. If not specified, theperson_id
column will be used as the default merge column. -
dropNA
: a logical value indicating whether empty columns should be automatically filtered out from the table. The default value isTRUE
.
Note that only the table
parameter is mandatory for
retrieving a table. However, it is highly recommended to utilize the
available data filtering parameters, as they can significantly enhance
the performance of operations and ensure that only the relevant data for
your study is selected.
2.2.1 Applying basic filters
This part is quite straightforward. We use the tables()
command to know which tables we can take, columns()
to know
what columns we have available, and concepts()
to
understand what concepts are present as entities in the table and select
those we wish to work with.
Once we have identified the specific data we are interested in
analyzing, we can apply the columnFilter
and
conceptFilter
parameters to retrieve the desired
table
.
Let’s proceed with an example.
We check the available tables:
o$tables()
## $opal_demo
## [1] "attribute_definition" "care_site" "cdm_source"
## [4] "cohort" "cohort_attribute" "cohort_definition"
## [7] "concept" "concept_relationship" "condition_era"
## [10] "condition_occurrence" "cost" "death"
## [13] "device_exposure" "dose_era" "drug_era"
## [16] "drug_exposure" "fact_relationship" "location"
## [19] "measurement" "metadata" "note"
## [22] "note_nlp" "observation" "observation_period"
## [25] "payer_plan_period" "person" "procedure_occurrence"
## [28] "provider" "specimen" "visit_detail"
## [31] "visit_occurrence" "vocabulary"
Let’s say we want to retrieve some measurements from the
Measurement
table. We can use the concepts()
command to see what concepts are available in this table:
o$concepts("measurement")
## $opal_demo
## concept_id concept_name
## 1 0 No matching concept
## 2 1175625 Breath rate spontaneous
## 3 3000067 Parathyrin.intact [Mass/volume] in Serum or Plasma
## 4 3000068 oxyCODONE [Presence] in Urine
## 5 3000099 Nuclear Ab [Units/volume] in Serum by Immunoassay
## [ reached 'max' / getOption("max.print") -- omitted 348 rows ]
We have decided to retrieve the information related to the concept
3010421
(which corresponds to
pH of blood
).
But we don’t want to retrieve all the columns from the table, so we
can use the columns()
command to see what columns are
available in the Measurement
table:
o$columns("measurement")
## $opal_demo
## [1] "measurement_id" "person_id"
## [3] "measurement_concept_id" "measurement_date"
## [5] "measurement_datetime" "measurement_time"
## [7] "measurement_type_concept_id" "operator_concept_id"
## [9] "value_as_number" "value_as_concept_id"
## [11] "unit_concept_id" "range_low"
## [13] "range_high" "provider_id"
## [15] "visit_occurrence_id" "visit_detail_id"
## [17] "measurement_source_value" "measurement_source_concept_id"
## [19] "unit_source_value" "value_source_value"
We are only interested in the measurement_date
and
value_as_number
from the Measurement
table
related to the concept ID 3010421
. We can use the
get
method to retrieve this information:
o$get(table = "measurement", # The table we want to retrieve
symbol = "my_table", # The name of the symbol that will store the table
conceptFilter = 3010421, # The concept we are interested in
columnFilter = c("measurement_date", "value_as_concept_id"), # The only columns we want to retrieve
dropNA = FALSE # Whether we want to drop empty columns or not
)
ds.summary("my_table")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 78
##
## $opal_demo$`number of columns`
## [1] 3
##
## $opal_demo$`variables held`
## [1] "person_id" "ph_of_blood.measurement_date"
## [3] "ph_of_blood.value_as_concept_id"
Now this information is ready to be used within the DataSHIELD
environment for any analysis or data manipulation operations that we
want to perform. In the next chapter, we will see how to use the
built-in functions of DataSHIELD’s dsBaseClient
package to
combine it with other tables.
2.2.2 Selecting a merge column
When retrieving a table, it is possible to specify a column that will
be used to merge the table with another table. This is done using the
mergeColumn
argument. For example, when retrieving the
Visit_detail
table, we can specify that the column
visit_occurrence_id
will be used to merge this table with
the Visit_occurrence
table, and it will perform the
reshaping operation to make this merge possible.
For this example, we will retrieve the Visit_detail
table related to the concept ID 4149943
(which corresponds
to Cardiac intensive care unit
, meaning that we are
interested in retrieving only the visits to this unit):
o$get(table = "visit_detail",
conceptFilter = 4149943, # `Cardiac intensive care unit`
columnFilter = c("visit_detail_start_date", "visit_detail_end_date"),
mergeColumn = "visit_occurrence_id")
ds.summary("visit_detail")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 39
##
## $opal_demo$`number of columns`
## [1] 4
##
## $opal_demo$`variables held`
## [1] "visit_occurrence_id"
## [2] "cardiac_intensive_care_unit.person_id"
## [3] "cardiac_intensive_care_unit.visit_detail_start_date"
## [4] "cardiac_intensive_care_unit.visit_detail_end_date"
If no mergeColumn
is specified, the get
method will assume person_id
as the default merge column,
as this is the most common column that is used to merge tables in an
OMOP CDM database.
2.2.3 Filtering by person IDs
The personFilter
argument allows for the retrieval of
data for a specific subset of person IDs. This can be useful when
working with a large dataset and only needing to retrieve data for a
specific group of individuals.
To use the personFilter
parameter, we have to specify
the symbol of another table that already exists in the R environment.
This referenced table has to contain person_id
values that
may represent a subset of the person_id
values in the table
that we aim to retrieve. By applying this parameter, the retrieval
process filters the target table to include only records corresponding
to the person_id
values present in the referenced table,
thereby excluding any records associated with person_id
values not contained in the referenced table.
Utilizing the subsetting functions provided by the
dsBaseClient
package can significantly enhance the
versatility of filtering database records based on cohorts, conditions,
or any other characteristics defining a subset of individuals.
Furthermore, the dsOMOP
package implements a disclosure
control process that prevents the retrieval of tables containing
information on a number of patients below a predefined threshold set in
the server’s configuration. This ensures the security and
confidentiality of the data at all times.
Let’s perform a simple example to illustrate the potential of this feature.
Suppose that, in this case, we are interested in analyzing data from
patients who have records of the condition Chest pain
(Concept ID 77670
in Condition_occurrence
). We
can save this information to a specific symbol
and then
pass it as the personFilter
argument to the
get
method for the Person
table:
o$get(table = "condition_occurrence",
conceptFilter = 77670, # `Chest pain`
columnFilter = "condition_start_date",
symbol = "chest_pain_condition")
ds.summary("chest_pain_condition")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 4
##
## $opal_demo$`number of columns`
## [1] 2
##
## $opal_demo$`variables held`
## [1] "person_id" "chest_pain.condition_start_date"
We stored the measurements under the
chest_pain_condition
symbol. Now we can use this table
(referenced through the chest_pain_condition
symbol) to
filter the Person
table and retrieve only the data related
to these patients:
o$get(table = "person",
symbol = "person_with_chest_pain", # The name of the symbol that will store the table
personFilter = "chest_pain_condition") # The symbol that contains the only person IDs to be retrieved
Now we have a subset of the Person
table that only
contains the data related to the patients who have records of the
condition Chest pain
, stored in the
person_with_chest_pain
symbol. We can pass it to the
personFilter
argument of the get
method for a
third table to retrieve only the data related to these patients. For
example, we can retrieve the Observation
table using the
person_with_chest_pain
symbol as the
personFilter
, and it will only retrieve the data related to
the the patients who are present in the
person_with_chest_pain
symbol:
o$get(table = "observation",
symbol = "observation_from_person_with_chest_pain",
columnFilter = "observation_date",
personFilter = "person_with_chest_pain")
ds.summary("observation_from_person_with_chest_pain")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 4
##
## $opal_demo$`number of columns`
## [1] 15
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "primary_insurance.observation_date"
## [3] "preferred_language.observation_date"
## [4] "cost_containment.observation_date"
## [5] "pain_severity_reported.observation_date"
## [6] "acceptable_pain_level_status.observation_date"
## [7] "indicators_of_pain_or_possible_pain.observation_date"
## [8] "body_temperature_measurement_site.observation_date"
## [9] "pain_management_specialty.observation_date"
## [10] "bowel_sounds.observation_date"
## [ reached getOption("max.print") -- omitted 5 entries ]
The Observation
table we retrieved only contains all the
observation records related to the patients who have records of the
condition Chest pain
.
This way, we can easily filter the data we are interested in and work with a subset of the data that is relevant to our study. This can also be done with cohorts, conditions, and any other characteristics that define a subset of individuals. It is done at query time, which makes data retrieval efficient.
In the next article, we will see how to use the
dsBaseClient
package to combine the data we have retrieved
to create a combined dataset that can be used for analysis.