3. Integrating data into the workflow • dsOMOPClient

Having explored how to retrieve tables using the get method from dsOMOPClient and its customization parameters, it is now time to integrate these tables into the DataSHIELD workflow properly. To achieve this, we will primarily use two commands from the dsBaseClient package:

ds.merge, to combine tables into a single data frame.
ds.dataFrameSubset, to apply value filters to tables based on desired characteristics.

Make sure that you have the dsBaseClient package installed and loaded in your DataSHIELD environment:

library(dsBaseClient)

3.1 Merging tables

In this section, we will demonstrate how to combine the Person table and the Measurement table. This process involves retrieving the necessary data from both tables and then merging them based on a common identifier.

First, we retrieve the Person table:

o$get("person")

ds.summary("person")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 100
## 
## $opal_demo$`number of columns`
## [1] 11
## 
## $opal_demo$`variables held`
##  [1] "person_id"            "gender_concept_id"    "year_of_birth"       
##  [4] "month_of_birth"       "day_of_birth"         "birth_datetime"      
##  [7] "race_concept_id"      "ethnicity_concept_id" "location_id"         
## [10] "provider_id"          "care_site_id"

Then, we retrieve the Measurement table. For this example, we will only retrieve the numerical value (value_as_number) and the date of the measurement (measurement_date) for the concept Dyastolic blood pressure, which has a concept ID of 3012888:

o$get("measurement",
      columnFilter = c("value_as_number", "measurement_date"),
      conceptFilter = c(3012888))

ds.summary("measurement")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 59
## 
## $opal_demo$`number of columns`
## [1] 3
## 
## $opal_demo$`variables held`
## [1] "person_id"                                
## [2] "diastolic_blood_pressure.measurement_date"
## [3] "diastolic_blood_pressure.value_as_number"

We can now merge the two tables based on the person_id column, using the conventional ds.merge command, which adheres to DataSHIELD’s data disclosure rules:

ds.merge(
  x.name = "person",
  y.name = "measurement",
  by.x.names = "person_id",
  by.y.names = "person_id",
  all.x = TRUE,
  all.y = FALSE,
  sort = FALSE,
  suffixes = c(".x", ".y"),
  no.dups = TRUE,
  newobj = "merged",
  datasources = conns
)

## $is.object.created
## [1] "A data object <merged> has been created in all specified data sources"
## 
## $validity.check
## [1] "<merged> appears valid in all sources"

We can see the merged table includes the columns from both the Person and Measurement tables:

ds.summary("merged")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 100
## 
## $opal_demo$`number of columns`
## [1] 13
## 
## $opal_demo$`variables held`
##  [1] "person_id"                                
##  [2] "gender_concept_id"                        
##  [3] "year_of_birth"                            
##  [4] "month_of_birth"                           
##  [5] "day_of_birth"                             
##  [6] "birth_datetime"                           
##  [7] "race_concept_id"                          
##  [8] "ethnicity_concept_id"                     
##  [9] "location_id"                              
## [10] "provider_id"                              
## [11] "care_site_id"                             
## [12] "diastolic_blood_pressure.measurement_date"
## [13] "diastolic_blood_pressure.value_as_number"

This is how we can merge tables in DataSHIELD. We could also merge more tables by repeating the process with additional tables and the combined table.

3.1.1 Auxiliar functions and packages

Given the complexity that the previous operations could pose when having to work with multiple concepts and tables, it is recommended to use and develop auxiliary packages that encapsulate and automate the interaction with dsOMOPClient, which only acts as an interface with the database data, to meet the research needs of each specific case.

Below we can see an example of how dsOMOPHelper, a package that encapsulates the interaction with dsOMOPClient, can automatically join multiple tables in batch by the desired concepts using the auto function.

For this example, we will retrieve the concepts Tobacco use (identified by the concept ID 4005823 in the Observation table) and Chronic obstructive pulmonary disease (identified by the concept ID 255573 in the Condition_occurrence table):

library(dsOMOPHelper)

h <- ds.omop.helper(connections = conns,
                    resource = "omop_demo.mimiciv",
                    symbol = "automatic_table")

h$auto(tables = c("observation", "condition_occurrence"),
       columns = c("observation_date", "condition_start_date"),
       concepts = c(4005823,  # `Tobacco use`
                    255573))  # `Chronic obstructive pulmonary disease`

ds.summary("automatic_table")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 100
## 
## $opal_demo$`number of columns`
## [1] 7
## 
## $opal_demo$`variables held`
## [1] "person_id"                                            
## [2] "gender_concept_id"                                    
## [3] "year_of_birth"                                        
## [4] "race_concept_id"                                      
## [5] "ethnicity_concept_id"                                 
## [6] "tobacco_user.observation_date"                        
## [7] "chronic_obstructive_lung_disease.condition_start_date"

This is precisely why we encourage the community to develop their own packages on top of dsOMOPClient. Creating custom packages with specific workflows allows researchers and developers to tailor the functionality of dsOMOPClient to easily meet the specific needs of their projects. For more information about dsOMOPHelper and how it can automate the process of working with OMOP CDM databases in DataSHIELD, please visit its GitHub repository.

3.2 Subsetting tables

Another fundamental operation in data selection and dataset construction that may be required for specific research contexts is the ds.dataFrameSubset operation from the dsBaseClient package. It is used to filter tables based on specific criteria, such as gender, age, or other characteristics. This operation can be used to determine a specific cohort of individuals based on their characteristics so that they can be used for further data retrieval and analysis.

Let’s see with a simple example how we can filter, using the subset operation, a dataset to only include female individuals.

First, we get the Person table, which contains gender_concept_id, and verify that this column includes values for female and male.

o$get("person")

ds.summary("person$gender_concept_id")

## $opal_demo
## $opal_demo$class
## [1] "factor"
## 
## $opal_demo$length
## [1] 100
## 
## $opal_demo$categories
## [1] "female" "male"  
## 
## $opal_demo$`count of 'female'`
## [1] 43
## 
## $opal_demo$`count of 'male'`
## [1] 57

From this point on, we will work with the standard DataSHIELD procedures to filter the data. First, we create a gender_filter object that contains the value female:

ds.make(newobj = "gender_filter", toAssign = "c('female')")

## $is.object.created
## [1] "A data object <gender_filter> has been created in all specified data sources"
## 
## $validity.check
## [1] "<gender_filter> appears valid in all sources"

Then, we use the ds.dataFrameSubset function to apply the gender_filter to the gender_concept_id column of the Person table:

ds.dataFrameSubset(
  df.name = "person",
  V1.name = "person$gender_concept_id",
  V2.name = "gender_filter",
  Boolean.operator = "==",
  newobj = "person_female",
  datasources = conns,
  notify.of.progress = FALSE
)

## $is.object.created
## [1] "A data object <person_female> has been created in all specified data sources"
## 
## $validity.check
## [1] "<person_female> appears valid in all sources"

Thus, we can verify that the resulting table person_female only contains individuals with a gender_concept_id value of female:

ds.summary("person_female$gender_concept_id")

## $opal_demo
## $opal_demo$class
## [1] "factor"
## 
## $opal_demo$length
## [1] 43
## 
## $opal_demo$categories
## [1] "female" "male"  
## 
## $opal_demo$`count of 'female'`
## [1] 43
## 
## $opal_demo$`count of 'male'`
## [1] 0

Once the person_female subset is obtained, we can pass it as the personFilter parameter to the get function to retrieve another table, in this case, Measurement. In this case, we are going to retrieve the measurements of Mean blood pressure, which has a concept ID of 3027598. The resulting table will contain only the measurements from individuals with a gender_concept_id value of female:

o$get(table = "measurement",
      symbol = "measurement_female",
      columnFilter = c("value_as_number", "measurement_date"),
      conceptFilter = c(3027598),  # `Mean blood pressure`
      personFilter = "person_female")

ds.summary("measurement_female")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 21
## 
## $opal_demo$`number of columns`
## [1] 3
## 
## $opal_demo$`variables held`
## [1] "person_id"                           
## [2] "mean_blood_pressure.measurement_date"
## [3] "mean_blood_pressure.value_as_number"

We can now merge the person_female and measurement_female tables based on the person_id column:

ds.merge(
  x.name = "person_female",
  y.name = "measurement_female",
  by.x.names = "person_id",
  by.y.names = "person_id",
  all.x = TRUE,
  all.y = FALSE,
  sort = FALSE,
  suffixes = c(".x", ".y"),
  no.dups = TRUE,
  newobj = "merged_female",
  datasources = conns
)

## $is.object.created
## [1] "A data object <merged_female> has been created in all specified data sources"
## 
## $validity.check
## [1] "<merged_female> appears valid in all sources"

We can see the merged table includes the columns from both the Person and Measurement tables, but only for individuals with a gender_concept_id value of female:

# General summary of the resulting table
ds.summary("merged_female")

## $opal_demo
## $opal_demo$class
## [1] "data.frame"
## 
## $opal_demo$`number of rows`
## [1] 43
## 
## $opal_demo$`number of columns`
## [1] 13
## 
## $opal_demo$`variables held`
##  [1] "person_id"                           
##  [2] "gender_concept_id"                   
##  [3] "year_of_birth"                       
##  [4] "month_of_birth"                      
##  [5] "day_of_birth"                        
##  [6] "birth_datetime"                      
##  [7] "race_concept_id"                     
##  [8] "ethnicity_concept_id"                
##  [9] "location_id"                         
## [10] "provider_id"                         
## [11] "care_site_id"                        
## [12] "mean_blood_pressure.measurement_date"
## [13] "mean_blood_pressure.value_as_number"

# Summary of the `gender_concept_id` column to verify that it only includes `female` values
ds.summary("merged_female$gender_concept_id")

## $opal_demo
## $opal_demo$class
## [1] "factor"
## 
## $opal_demo$length
## [1] 43
## 
## $opal_demo$categories
## [1] "female" "male"  
## 
## $opal_demo$`count of 'female'`
## [1] 43
## 
## $opal_demo$`count of 'male'`
## [1] 0

This is how we can use DataSHIELD’s standard methods to subset tables obtained from OMOP CDM databases. We can achieve complex filters by applying more sophisticated ds.dataFrameSubset logics and propagating them to the desired tables through the personFilter parameter of the get method.

Logout

We have arrived at the end of these vignettes. If we have finished working with our DataSHIELD session, it is a good practice to disconnect from the DataSHIELD server to avoid any potential issues:

datashield.logout(conns)