3. Integrating data into the workflow
03_integration.RmdHaving explored how to retrieve tables using the get
method from dsOMOPClient and its customization parameters,
it is now time to integrate these tables into the DataSHIELD workflow
properly. To achieve this, we will primarily use two commands from the
dsBaseClient package:
-
ds.merge, to combine tables into a single data frame. -
ds.dataFrameSubset, to apply value filters to tables based on desired characteristics.
Make sure that you have the dsBaseClient package
installed and loaded in your DataSHIELD environment:
library(dsBaseClient)3.1 Merging tables
In this section, we will demonstrate how to combine the
Person table and the Measurement table. This
process involves retrieving the necessary data from both tables and then
merging them based on a common identifier.
First, we retrieve the Person table:
o$get("person")
ds.summary("person")## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 11
##
## $opal_demo$`variables held`
## [1] "person_id" "gender_concept_id" "year_of_birth"
## [4] "month_of_birth" "day_of_birth" "birth_datetime"
## [7] "race_concept_id" "ethnicity_concept_id" "location_id"
## [10] "provider_id" "care_site_id"
Then, we retrieve the Measurement table. For this
example, we will only retrieve the numerical value
(value_as_number) and the date of the measurement
(measurement_date) for the concept
Dyastolic blood pressure, which has a concept ID of
3012888:
o$get("measurement",
columnFilter = c("value_as_number", "measurement_date"),
conceptFilter = c(3012888))
ds.summary("measurement")## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 59
##
## $opal_demo$`number of columns`
## [1] 3
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "diastolic_blood_pressure.measurement_date"
## [3] "diastolic_blood_pressure.value_as_number"
We can now merge the two tables based on the person_id
column, using the conventional ds.merge command, which
adheres to DataSHIELD’s data disclosure rules:
ds.merge(
x.name = "person",
y.name = "measurement",
by.x.names = "person_id",
by.y.names = "person_id",
all.x = TRUE,
all.y = FALSE,
sort = FALSE,
suffixes = c(".x", ".y"),
no.dups = TRUE,
newobj = "merged",
datasources = conns
)## $is.object.created
## [1] "A data object <merged> has been created in all specified data sources"
##
## $validity.check
## [1] "<merged> appears valid in all sources"
We can see the merged table includes the columns from
both the Person and Measurement tables:
ds.summary("merged")## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 13
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "gender_concept_id"
## [3] "year_of_birth"
## [4] "month_of_birth"
## [5] "day_of_birth"
## [6] "birth_datetime"
## [7] "race_concept_id"
## [8] "ethnicity_concept_id"
## [9] "location_id"
## [10] "provider_id"
## [11] "care_site_id"
## [12] "diastolic_blood_pressure.measurement_date"
## [13] "diastolic_blood_pressure.value_as_number"
This is how we can merge tables in DataSHIELD. We could also merge more tables by repeating the process with additional tables and the combined table.
3.1.1 Auxiliar functions and packages
Given the complexity that the previous operations could pose when
having to work with multiple concepts and tables, it is recommended to
use and develop auxiliary packages that encapsulate and automate the
interaction with dsOMOPClient, which only acts as an
interface with the database data, to meet the research needs of each
specific case.
Below we can see an example of how dsOMOPHelper, a
package that encapsulates the interaction with
dsOMOPClient, can automatically join multiple tables in
batch by the desired concepts using the auto function.
For this example, we will retrieve the concepts
Tobacco use (identified by the concept ID
4005823 in the Observation table) and
Chronic obstructive pulmonary disease (identified by the
concept ID 255573 in the Condition_occurrence
table):
library(dsOMOPHelper)
h <- ds.omop.helper(connections = conns,
resource = "omop_demo.mimiciv",
symbol = "automatic_table")
h$auto(tables = c("observation", "condition_occurrence"),
columns = c("observation_date", "condition_start_date"),
concepts = c(4005823, # `Tobacco use`
255573)) # `Chronic obstructive pulmonary disease`
ds.summary("automatic_table")## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 7
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "gender_concept_id"
## [3] "year_of_birth"
## [4] "race_concept_id"
## [5] "ethnicity_concept_id"
## [6] "tobacco_user.observation_date"
## [7] "chronic_obstructive_lung_disease.condition_start_date"
This is precisely why we encourage the community to develop their own
packages on top of dsOMOPClient. Creating custom packages
with specific workflows allows researchers and developers to tailor the
functionality of dsOMOPClient to easily meet the specific
needs of their projects. For more information about
dsOMOPHelper and how it can automate the process of working
with OMOP CDM databases in DataSHIELD, please visit its GitHub
repository.
3.2 Subsetting tables
Another fundamental operation in data selection and dataset
construction that may be required for specific research contexts is the
ds.dataFrameSubset operation from the
dsBaseClient package. It is used to filter tables based on
specific criteria, such as gender, age, or other characteristics. This
operation can be used to determine a specific cohort of individuals
based on their characteristics so that they can be used for further data
retrieval and analysis.
Let’s see with a simple example how we can filter, using the subset operation, a dataset to only include female individuals.
First, we get the Person table, which contains
gender_concept_id, and verify that this column includes
values for female and male.
o$get("person")
ds.summary("person$gender_concept_id")## $opal_demo
## $opal_demo$class
## [1] "factor"
##
## $opal_demo$length
## [1] 100
##
## $opal_demo$categories
## [1] "female" "male"
##
## $opal_demo$`count of 'female'`
## [1] 43
##
## $opal_demo$`count of 'male'`
## [1] 57
From this point on, we will work with the standard DataSHIELD
procedures to filter the data. First, we create a
gender_filter object that contains the value
female:
ds.make(newobj = "gender_filter", toAssign = "c('female')")## $is.object.created
## [1] "A data object <gender_filter> has been created in all specified data sources"
##
## $validity.check
## [1] "<gender_filter> appears valid in all sources"
Then, we use the ds.dataFrameSubset function to apply
the gender_filter to the gender_concept_id
column of the Person table:
ds.dataFrameSubset(
df.name = "person",
V1.name = "person$gender_concept_id",
V2.name = "gender_filter",
Boolean.operator = "==",
newobj = "person_female",
datasources = conns,
notify.of.progress = FALSE
)## $is.object.created
## [1] "A data object <person_female> has been created in all specified data sources"
##
## $validity.check
## [1] "<person_female> appears valid in all sources"
Thus, we can verify that the resulting table
person_female only contains individuals with a
gender_concept_id value of female:
ds.summary("person_female$gender_concept_id")## $opal_demo
## $opal_demo$class
## [1] "factor"
##
## $opal_demo$length
## [1] 43
##
## $opal_demo$categories
## [1] "female" "male"
##
## $opal_demo$`count of 'female'`
## [1] 43
##
## $opal_demo$`count of 'male'`
## [1] 0
Once the person_female subset is obtained, we can pass
it as the personFilter parameter to the get
function to retrieve another table, in this case,
Measurement. In this case, we are going to retrieve the
measurements of Mean blood pressure, which has a concept ID
of 3027598. The resulting table will contain only the
measurements from individuals with a gender_concept_id
value of female:
o$get(table = "measurement",
symbol = "measurement_female",
columnFilter = c("value_as_number", "measurement_date"),
conceptFilter = c(3027598), # `Mean blood pressure`
personFilter = "person_female")
ds.summary("measurement_female")## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 21
##
## $opal_demo$`number of columns`
## [1] 3
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "mean_blood_pressure.measurement_date"
## [3] "mean_blood_pressure.value_as_number"
We can now merge the person_female and
measurement_female tables based on the
person_id column:
ds.merge(
x.name = "person_female",
y.name = "measurement_female",
by.x.names = "person_id",
by.y.names = "person_id",
all.x = TRUE,
all.y = FALSE,
sort = FALSE,
suffixes = c(".x", ".y"),
no.dups = TRUE,
newobj = "merged_female",
datasources = conns
)## $is.object.created
## [1] "A data object <merged_female> has been created in all specified data sources"
##
## $validity.check
## [1] "<merged_female> appears valid in all sources"
We can see the merged table includes the columns from both the
Person and Measurement tables, but only for
individuals with a gender_concept_id value of
female:
# General summary of the resulting table
ds.summary("merged_female")## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 43
##
## $opal_demo$`number of columns`
## [1] 13
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "gender_concept_id"
## [3] "year_of_birth"
## [4] "month_of_birth"
## [5] "day_of_birth"
## [6] "birth_datetime"
## [7] "race_concept_id"
## [8] "ethnicity_concept_id"
## [9] "location_id"
## [10] "provider_id"
## [11] "care_site_id"
## [12] "mean_blood_pressure.measurement_date"
## [13] "mean_blood_pressure.value_as_number"
# Summary of the `gender_concept_id` column to verify that it only includes `female` values
ds.summary("merged_female$gender_concept_id")## $opal_demo
## $opal_demo$class
## [1] "factor"
##
## $opal_demo$length
## [1] 43
##
## $opal_demo$categories
## [1] "female" "male"
##
## $opal_demo$`count of 'female'`
## [1] 43
##
## $opal_demo$`count of 'male'`
## [1] 0
This is how we can use DataSHIELD’s standard methods to subset tables
obtained from OMOP CDM databases. We can achieve complex filters by
applying more sophisticated ds.dataFrameSubset logics and
propagating them to the desired tables through the
personFilter parameter of the get method.
Logout
We have arrived at the end of these vignettes. If we have finished working with our DataSHIELD session, it is a good practice to disconnect from the DataSHIELD server to avoid any potential issues:
datashield.logout(conns)