2. Retrieving tables from the database

We have already seen how to create an interface object for an OMOP CDM database. Now, let’s see how to use it to retrieve tables from it in deeper detail. For this purpose, we will use the get method of the interface object. This method has several arguments that allow filtering the data that is retrieved and customizing the output.

2.1 Features of the `get` method

2.1.1 Table reshaping

The get method is an smart method that returns the table after subjecting it to a series of operations. If it detects that this table can present various records linked to the same OMOP CDM entity (for example, to a Person), it will transform the table into a wide format automatically using the in-built DataSHIELD reshape function, ready to be merged with other tables. For instance, when we take the Person table, it will return a table without altering its original format:

o$get("person")

ds.summary("person")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 100
## 
## $opal_demo$`number of columns`
## [1] 11
## 
## $opal_demo$`variables held`
##  [1] "person_id"            "gender_concept_id"    "year_of_birth"       
##  [4] "month_of_birth"       "day_of_birth"         "birth_datetime"      
##  [7] "race_concept_id"      "ethnicity_concept_id" "location_id"         
## [10] "provider_id"          "care_site_id"

Whereas if we take the Measurement table, which has records potentially linkable to the Person table, it will return a table in wide format, preserving the column person_id so that it can be merged with the Person table:

# 3027018 is the concept ID for `Heart rate`
o$get(table = "measurement", conceptFilter = 3027018)

ds.summary("measurement")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 100
## 
## $opal_demo$`number of columns`
## [1] 15
## 
## $opal_demo$`variables held`
##  [1] "person_id"                             
##  [2] "heart_rate.measurement_id"             
##  [3] "heart_rate.measurement_date"           
##  [4] "heart_rate.measurement_datetime"       
##  [5] "heart_rate.measurement_time"           
##  [6] "heart_rate.measurement_type_concept_id"
##  [7] "heart_rate.operator_concept_id"        
##  [8] "heart_rate.value_as_number"            
##  [9] "heart_rate.value_as_concept_id"        
## [10] "heart_rate.unit_concept_id"            
## [11] "heart_rate.range_low"                  
## [12] "heart_rate.range_high"                 
## [13] "heart_rate.provider_id"                
## [14] "heart_rate.visit_occurrence_id"        
## [15] "heart_rate.visit_detail_id"

2.1.2 Concept translation

Another feature you may have noticed in the previous example is that the get method automatically translates the concepts of the table that we are retrieving. For instance, in the case of the Measurement table, the concept 3027018 was translated to Heart rate.

The same happens with the values of the rows in the tables. For example, in the Person table, in the gender_concept_id column, the value 8532 was translated automatically to FEMALE, and the value 8507 was translated to MALE:

ds.summary("person$gender_concept_id")

## $opal_demo
## $opal_demo$class
## [1] "factor"
## 
## $opal_demo$length
## [1] 100
## 
## $opal_demo$categories
## [1] "female" "male"  
## 
## $opal_demo$`count of 'female'`
## [1] 43
## 
## $opal_demo$`count of 'male'`
## [1] 57

This will happen with all concepts in the database as long as they are recognized and properly named in the Concept table of the database. Otherwise, the concept will remain as concept_id_ + the numeric value of the concept. For example, if the concept FEMALE was not properly named in the Concept table, it would remain as concept_id_8532 so that it can at least be identified.

2.1.3 Longitudinal data sequencing

When you have multiple records for the same entity (e.g., a patient) across time points, dsOMOP offers two parameters to transform and align these data in a wide format:

wideLongitudinal = TRUE reshapes repeated measurements for each concept into extra columns, so each row in your table corresponds to one entity (e.g., one patient).
completeTimePoints = TRUE further ensures that all entities share the same set of time points. Rows (and columns) for missing measurements at particular dates are filled with NA.

o$get(
  table = "measurement",
  conceptFilter = 3027018,
  columnFilter = c("measurement_date", "value_as_number"),
  wideLongitudinal = TRUE,
  completeTimePoints = TRUE
)

ds.summary("measurement")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 100
## 
## $opal_demo$`number of columns`
## [1] 9099
## 
## $opal_demo$`variables held`
##  [1] "person_id"                     "heart_rate.1.measurement_date"
##  [3] "heart_rate.1.value_as_number"  "heart_rate.2.measurement_date"
##  [5] "heart_rate.2.value_as_number"  "heart_rate.3.measurement_date"
##  [7] "heart_rate.3.value_as_number"  "heart_rate.4.measurement_date"
##  [9] "heart_rate.4.value_as_number"  "heart_rate.5.measurement_date"
##  [ reached getOption("max.print") -- omitted 9089 entries ]

Because DataSHIELD does not allow row-level inspection on the client side, you will not see the exact final structure of the data. Internally, however, dsOMOP arranges your dataset so that every individual is aligned on a consistent time axis. Below is a simplified illustration of how this reshaping would look for a small example of three patients (A, B, C). It does not reflect real data—merely a conceptual snapshot.

Case 1: wideLongitudinal = TRUE but completeTimePoints = FALSE

In this scenario, dsOMOP just generates a wide table indexed by the chronological order of each patient’s events. It does not force every patient to share the same time points. Suppose the patients have data as follows:

Patient A has 3 heart-rate measurements (Dates: 2023-01-01, 2023-01-15, 2023-01-31)
Patient B has 2 measurements (Dates: 2023-01-01, 2023-01-02)
Patient C has 1 measurement (Date: 2023-01-10)

The resulting table, conceptually, might look like this:

person_id	heart_rate.measurement_date.1	heart_rate.measurement_value.1	heart_rate.measurement_date.2	heart_rate.measurement_value.2	heart_rate.measurement_date.3	heart_rate.measurement_value.3
A	2023-01-01	70	2023-01-15	71	2023-01-31	72
B	2023-01-01	60	2023-01-02	65	NA	NA
C	2023-01-10	67	NA	NA	NA	NA

Columns “.1”, “.2”, “.3” represent the 1st, 2nd, and 3rd measurement recorded for each patient, in that patient’s own chronological order.
If a patient doesn’t have a second or third measurement, those columns are NA.
Different patients can have entirely different date values in, say, .2 or .3.

Case 2: wideLongitudinal = TRUE and completeTimePoints = TRUE

Now dsOMOP additionally ensures that all patients align on every time point found across all patients in the dataset. For instance, if the union of dates in your dataset is {2023-01-01, 2023-01-02, 2023-01-10, 2023-01-15, 2023-01-31}, then each row (i.e., each patient) will have columns for five distinct measurements, even if some are NA:

person_id	heart_rate.measurement_date.1	heart_rate.measurement_value.1	heart_rate.measurement_date.2	heart_rate.measurement_value.2	heart_rate.measurement_date.3	heart_rate.measurement_value.3	heart_rate.measurement_date.4	heart_rate.measurement_value.4	heart_rate.measurement_date.5	heart_rate.measurement_value.5
A	2023-01-01	70	NA	NA	2023-01-15	71	NA	NA	2023-01-31	72
B	2023-01-01	60	2023-01-02	65	NA	NA	NA	NA	NA	NA
C	NA	NA	NA	NA	2023-01-10	67	NA	NA	NA	NA

For privacy reasons, it still uses index-based columns (e.g., .1, .2, .3), but now behind the scenes, each suffix corresponds to the same date for every patient.
Patients missing a measurement on a particular date have NA in both the measurement_date and measurement_value columns for that slot.
This approach ensures a consistent date axis across all individuals, which is typically needed for time-series analysis or for creating time-aligned comparisons.

Practical usage notes

Date sorting: Under the hood, dsOMOP first discovers all unique dates in the data, sorts them in ascending order, and then indexes them as .1, .2, .3, ….
Flexibility: You can still opt not to align time points by leaving completeTimePoints as FALSE. This can be useful for scenarios where each patient’s internal timeline is more relevant than a global date alignment.
Merging: Whether you choose to use alignment or not, your wide-format table remains mergeable with other wide-format tables in DataSHIELD if they share the same person_id (or whichever merge column(s) you specify).

In short, wideLongitudinal provides a wide “spread” of repeated measurements within each row, and completeTimePoints ensures that these measurements line up on a shared calendar of dates. You can fine-tune these parameters to suit your analytic design.

2.2 Using the `get` method

The get method has several arguments that allow for the customization of the data that is retrieved. These arguments are:

table: the name of the table to be retrieved.
symbol: the name of the symbol that will be used to store the table in the server’s R environment. If not specified, the table will be stored in a symbol with the same name as the table.
conceptFilter: a numeric vector with the concept IDs to be retrieved. It can also be a single numeric value.
columnFilter: a character vector with the names of the columns to be retrieved. It can also be a single character value.
personFilter: the name of another symbol in the server’s R environment which contains the person IDs to be retrieved by the current get operation. If not specified, all existing person IDs will be retrieved.
mergeColumn: the name of the column that will be used to merge the table with another table. If not specified, the person_id column will be used as the default merge column.
dropNA: a logical value indicating whether empty columns should be automatically filtered out from the table. The default value is TRUE.

Note that only the table parameter is mandatory for retrieving a table. However, it is highly recommended to utilize the available data filtering parameters, as they can significantly enhance the performance of operations and ensure that only the relevant data for your study is selected.

2.2.1 Applying basic filters

This part is quite straightforward. We use the tables() command to know which tables we can take, columns() to know what columns we have available, and concepts() to understand what concepts are present as entities in the table and select those we wish to work with.

Once we have identified the specific data we are interested in analyzing, we can apply the columnFilter and conceptFilter parameters to retrieve the desired table.

Let’s proceed with an example.

We check the available tables:

o$tables()

## $opal_demo
##  [1] "attribute_definition" "care_site"            "cdm_source"          
##  [4] "cohort"               "cohort_attribute"     "cohort_definition"   
##  [7] "concept"              "concept_relationship" "condition_era"       
## [10] "condition_occurrence" "cost"                 "death"               
## [13] "device_exposure"      "dose_era"             "drug_era"            
## [16] "drug_exposure"        "fact_relationship"    "location"            
## [19] "measurement"          "metadata"             "note"                
## [22] "note_nlp"             "observation"          "observation_period"  
## [25] "payer_plan_period"    "person"               "procedure_occurrence"
## [28] "provider"             "specimen"             "visit_detail"        
## [31] "visit_occurrence"     "vocabulary"

Let’s say we want to retrieve some measurements from the Measurement table. We can use the concepts() command to see what concepts are available in this table:

o$concepts("measurement")

## $opal_demo
##   concept_id                                       concept_name
## 1          0                                No matching concept
## 2    1175625                            Breath rate spontaneous
## 3    3000067 Parathyrin.intact [Mass/volume] in Serum or Plasma
## 4    3000068                      oxyCODONE [Presence] in Urine
## 5    3000099  Nuclear Ab [Units/volume] in Serum by Immunoassay
##  [ reached 'max' / getOption("max.print") -- omitted 348 rows ]

We have decided to retrieve the information related to the concept 3010421 (which corresponds to pH of blood).

But we don’t want to retrieve all the columns from the table, so we can use the columns() command to see what columns are available in the Measurement table:

o$columns("measurement")

## $opal_demo
##  [1] "measurement_id"                "person_id"                    
##  [3] "measurement_concept_id"        "measurement_date"             
##  [5] "measurement_datetime"          "measurement_time"             
##  [7] "measurement_type_concept_id"   "operator_concept_id"          
##  [9] "value_as_number"               "value_as_concept_id"          
## [11] "unit_concept_id"               "range_low"                    
## [13] "range_high"                    "provider_id"                  
## [15] "visit_occurrence_id"           "visit_detail_id"              
## [17] "measurement_source_value"      "measurement_source_concept_id"
## [19] "unit_source_value"             "value_source_value"

We are only interested in the measurement_date and value_as_number from the Measurement table related to the concept ID 3010421. We can use the get method to retrieve this information:

o$get(table = "measurement",  # The table we want to retrieve
      symbol = "my_table",  # The name of the symbol that will store the table
      conceptFilter = 3010421,  # The concept we are interested in
      columnFilter = c("measurement_date", "value_as_concept_id"),  # The only columns we want to retrieve
      dropNA = FALSE  # Whether we want to drop empty columns or not
)

ds.summary("my_table")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 78
## 
## $opal_demo$`number of columns`
## [1] 3
## 
## $opal_demo$`variables held`
## [1] "person_id"                       "ph_of_blood.measurement_date"   
## [3] "ph_of_blood.value_as_concept_id"

Now this information is ready to be used within the DataSHIELD environment for any analysis or data manipulation operations that we want to perform. In the next chapter, we will see how to use the built-in functions of DataSHIELD’s dsBaseClient package to combine it with other tables.

2.2.2 Selecting a merge column

When retrieving a table, it is possible to specify a column that will be used to merge the table with another table. This is done using the mergeColumn argument. For example, when retrieving the Visit_detail table, we can specify that the column visit_occurrence_id will be used to merge this table with the Visit_occurrence table, and it will perform the reshaping operation to make this merge possible.

For this example, we will retrieve the Visit_detail table related to the concept ID 4149943 (which corresponds to Cardiac intensive care unit, meaning that we are interested in retrieving only the visits to this unit):

o$get(table = "visit_detail",
      conceptFilter = 4149943,  # `Cardiac intensive care unit`
      columnFilter = c("visit_detail_start_date", "visit_detail_end_date"),
      mergeColumn = "visit_occurrence_id")

ds.summary("visit_detail")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 39
## 
## $opal_demo$`number of columns`
## [1] 4
## 
## $opal_demo$`variables held`
## [1] "visit_occurrence_id"                                
## [2] "cardiac_intensive_care_unit.person_id"              
## [3] "cardiac_intensive_care_unit.visit_detail_start_date"
## [4] "cardiac_intensive_care_unit.visit_detail_end_date"

If no mergeColumn is specified, the get method will assume person_id as the default merge column, as this is the most common column that is used to merge tables in an OMOP CDM database.

2.2.3 Filtering by person IDs

The personFilter argument allows for the retrieval of data for a specific subset of person IDs. This can be useful when working with a large dataset and only needing to retrieve data for a specific group of individuals.

To use the personFilter parameter, we have to specify the symbol of another table that already exists in the R environment. This referenced table has to contain person_id values that may represent a subset of the person_id values in the table that we aim to retrieve. By applying this parameter, the retrieval process filters the target table to include only records corresponding to the person_id values present in the referenced table, thereby excluding any records associated with person_id values not contained in the referenced table.

Utilizing the subsetting functions provided by the dsBaseClient package can significantly enhance the versatility of filtering database records based on cohorts, conditions, or any other characteristics defining a subset of individuals. Furthermore, the dsOMOP package implements a disclosure control process that prevents the retrieval of tables containing information on a number of patients below a predefined threshold set in the server’s configuration. This ensures the security and confidentiality of the data at all times.

Let’s perform a simple example to illustrate the potential of this feature.

Suppose that, in this case, we are interested in analyzing data from patients who have records of the condition Chest pain (Concept ID 77670 in Condition_occurrence). We can save this information to a specific symbol and then pass it as the personFilter argument to the get method for the Person table:

o$get(table = "condition_occurrence",
      conceptFilter = 77670,  # `Chest pain`
      columnFilter = "condition_start_date",
      symbol = "chest_pain_condition")

ds.summary("chest_pain_condition")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 4
## 
## $opal_demo$`number of columns`
## [1] 2
## 
## $opal_demo$`variables held`
## [1] "person_id"                       "chest_pain.condition_start_date"

We stored the measurements under the chest_pain_condition symbol. Now we can use this table (referenced through the chest_pain_condition symbol) to filter the Person table and retrieve only the data related to these patients:

o$get(table = "person",
      symbol = "person_with_chest_pain",  # The name of the symbol that will store the table
      personFilter = "chest_pain_condition")  # The symbol that contains the only person IDs to be retrieved

Now we have a subset of the Person table that only contains the data related to the patients who have records of the condition Chest pain, stored in the person_with_chest_pain symbol. We can pass it to the personFilter argument of the get method for a third table to retrieve only the data related to these patients. For example, we can retrieve the Observation table using the person_with_chest_pain symbol as the personFilter, and it will only retrieve the data related to the the patients who are present in the person_with_chest_pain symbol:

o$get(table = "observation",
      symbol = "observation_from_person_with_chest_pain",
      columnFilter = "observation_date",
      personFilter = "person_with_chest_pain")

ds.summary("observation_from_person_with_chest_pain")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 4
## 
## $opal_demo$`number of columns`
## [1] 15
## 
## $opal_demo$`variables held`
##  [1] "person_id"                                           
##  [2] "primary_insurance.observation_date"                  
##  [3] "preferred_language.observation_date"                 
##  [4] "cost_containment.observation_date"                   
##  [5] "pain_severity_reported.observation_date"             
##  [6] "acceptable_pain_level_status.observation_date"       
##  [7] "indicators_of_pain_or_possible_pain.observation_date"
##  [8] "body_temperature_measurement_site.observation_date"  
##  [9] "pain_management_specialty.observation_date"          
## [10] "bowel_sounds.observation_date"                       
##  [ reached getOption("max.print") -- omitted 5 entries ]

The Observation table we retrieved only contains all the observation records related to the patients who have records of the condition Chest pain.

This way, we can easily filter the data we are interested in and work with a subset of the data that is relevant to our study. This can also be done with cohorts, conditions, and any other characteristics that define a subset of individuals. It is done at query time, which makes data retrieval efficient.

In the next article, we will see how to use the dsBaseClient package to combine the data we have retrieved to create a combined dataset that can be used for analysis.

2.1 Features of the get method