3. Integrating data into the workflow
03_integration.Rmd
Having explored how to retrieve tables using the get
method from dsOMOPClient
and its customization parameters,
it is now time to integrate these tables into the DataSHIELD workflow
properly. To achieve this, we will primarily use two commands from the
dsBaseClient
package:
-
ds.merge
, to combine tables into a single data frame. -
ds.dataFrameSubset
, to apply value filters to tables based on desired characteristics.
Make sure that you have the dsBaseClient
package
installed and loaded in your DataSHIELD environment:
library(dsBaseClient)
3.1 Merging tables
In this section, we will demonstrate how to combine the
Person
table and the Measurement
table. This
process involves retrieving the necessary data from both tables and then
merging them based on a common identifier.
First, we retrieve the Person
table:
o$get("person")
ds.summary("person")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 11
##
## $opal_demo$`variables held`
## [1] "person_id" "gender_concept_id" "year_of_birth"
## [4] "month_of_birth" "day_of_birth" "birth_datetime"
## [7] "race_concept_id" "ethnicity_concept_id" "location_id"
## [10] "provider_id" "care_site_id"
Then, we retrieve the Measurement
table. For this
example, we will only retrieve the numerical value
(value_as_number
) and the date of the measurement
(measurement_date
) for the concept
Dyastolic blood pressure
, which has a concept ID of
3012888
:
o$get("measurement",
columnFilter = c("value_as_number", "measurement_date"),
conceptFilter = c(3012888))
ds.summary("measurement")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 59
##
## $opal_demo$`number of columns`
## [1] 3
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "diastolic_blood_pressure.measurement_date"
## [3] "diastolic_blood_pressure.value_as_number"
We can now merge the two tables based on the person_id
column, using the conventional ds.merge
command, which
adheres to DataSHIELD’s data disclosure rules:
ds.merge(
x.name = "person",
y.name = "measurement",
by.x.names = "person_id",
by.y.names = "person_id",
all.x = TRUE,
all.y = FALSE,
sort = FALSE,
suffixes = c(".x", ".y"),
no.dups = TRUE,
newobj = "merged",
datasources = conns
)
## $is.object.created
## [1] "A data object <merged> has been created in all specified data sources"
##
## $validity.check
## [1] "<merged> appears valid in all sources"
We can see the merged
table includes the columns from
both the Person
and Measurement
tables:
ds.summary("merged")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 13
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "gender_concept_id"
## [3] "year_of_birth"
## [4] "month_of_birth"
## [5] "day_of_birth"
## [6] "birth_datetime"
## [7] "race_concept_id"
## [8] "ethnicity_concept_id"
## [9] "location_id"
## [10] "provider_id"
## [11] "care_site_id"
## [12] "diastolic_blood_pressure.measurement_date"
## [13] "diastolic_blood_pressure.value_as_number"
This is how we can merge tables in DataSHIELD. We could also merge more tables by repeating the process with additional tables and the combined table.
3.1.1 Auxiliar functions and packages
Given the complexity that the previous operations could pose when
having to work with multiple concepts and tables, it is recommended to
use and develop auxiliary packages that encapsulate and automate the
interaction with dsOMOPClient
, which only acts as an
interface with the database data, to meet the research needs of each
specific case.
Below we can see an example of how dsOMOPHelper
, a
package that encapsulates the interaction with
dsOMOPClient
, can automatically join multiple tables in
batch by the desired concepts using the auto
function.
For this example, we will retrieve the concepts
Tobacco use
(identified by the concept ID
4005823
in the Observation
table) and
Chronic obstructive pulmonary disease
(identified by the
concept ID 255573
in the Condition_occurrence
table):
library(dsOMOPHelper)
h <- ds.omop.helper(connections = conns,
resource = "omop_demo.mimiciv",
symbol = "automatic_table")
h$auto(tables = c("observation", "condition_occurrence"),
columns = c("observation_date", "condition_start_date"),
concepts = c(4005823, # `Tobacco use`
255573)) # `Chronic obstructive pulmonary disease`
ds.summary("automatic_table")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 100
##
## $opal_demo$`number of columns`
## [1] 7
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "gender_concept_id"
## [3] "year_of_birth"
## [4] "race_concept_id"
## [5] "ethnicity_concept_id"
## [6] "tobacco_user.observation_date"
## [7] "chronic_obstructive_lung_disease.condition_start_date"
This is precisely why we encourage the community to develop their own
packages on top of dsOMOPClient
. Creating custom packages
with specific workflows allows researchers and developers to tailor the
functionality of dsOMOPClient
to easily meet the specific
needs of their projects. For more information about
dsOMOPHelper
and how it can automate the process of working
with OMOP CDM databases in DataSHIELD, please visit its GitHub
repository.
3.2 Subsetting tables
Another fundamental operation in data selection and dataset
construction that may be required for specific research contexts is the
ds.dataFrameSubset
operation from the
dsBaseClient
package. It is used to filter tables based on
specific criteria, such as gender, age, or other characteristics. This
operation can be used to determine a specific cohort of individuals
based on their characteristics so that they can be used for further data
retrieval and analysis.
Let’s see with a simple example how we can filter, using the subset operation, a dataset to only include female individuals.
First, we get the Person
table, which contains
gender_concept_id
, and verify that this column includes
values for female
and male
.
o$get("person")
ds.summary("person$gender_concept_id")
## $opal_demo
## $opal_demo$class
## [1] "factor"
##
## $opal_demo$length
## [1] 100
##
## $opal_demo$categories
## [1] "female" "male"
##
## $opal_demo$`count of 'female'`
## [1] 43
##
## $opal_demo$`count of 'male'`
## [1] 57
From this point on, we will work with the standard DataSHIELD
procedures to filter the data. First, we create a
gender_filter
object that contains the value
female
:
ds.make(newobj = "gender_filter", toAssign = "c('female')")
## $is.object.created
## [1] "A data object <gender_filter> has been created in all specified data sources"
##
## $validity.check
## [1] "<gender_filter> appears valid in all sources"
Then, we use the ds.dataFrameSubset
function to apply
the gender_filter
to the gender_concept_id
column of the Person
table:
ds.dataFrameSubset(
df.name = "person",
V1.name = "person$gender_concept_id",
V2.name = "gender_filter",
Boolean.operator = "==",
newobj = "person_female",
datasources = conns,
notify.of.progress = FALSE
)
## $is.object.created
## [1] "A data object <person_female> has been created in all specified data sources"
##
## $validity.check
## [1] "<person_female> appears valid in all sources"
Thus, we can verify that the resulting table
person_female
only contains individuals with a
gender_concept_id
value of female
:
ds.summary("person_female$gender_concept_id")
## $opal_demo
## $opal_demo$class
## [1] "factor"
##
## $opal_demo$length
## [1] 43
##
## $opal_demo$categories
## [1] "female" "male"
##
## $opal_demo$`count of 'female'`
## [1] 43
##
## $opal_demo$`count of 'male'`
## [1] 0
Once the person_female
subset is obtained, we can pass
it as the personFilter
parameter to the get
function to retrieve another table, in this case,
Measurement
. In this case, we are going to retrieve the
measurements of Mean blood pressure
, which has a concept ID
of 3027598
. The resulting table will contain only the
measurements from individuals with a gender_concept_id
value of female
:
o$get(table = "measurement",
symbol = "measurement_female",
columnFilter = c("value_as_number", "measurement_date"),
conceptFilter = c(3027598), # `Mean blood pressure`
personFilter = "person_female")
ds.summary("measurement_female")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 21
##
## $opal_demo$`number of columns`
## [1] 3
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "mean_blood_pressure.measurement_date"
## [3] "mean_blood_pressure.value_as_number"
We can now merge the person_female
and
measurement_female
tables based on the
person_id
column:
ds.merge(
x.name = "person_female",
y.name = "measurement_female",
by.x.names = "person_id",
by.y.names = "person_id",
all.x = TRUE,
all.y = FALSE,
sort = FALSE,
suffixes = c(".x", ".y"),
no.dups = TRUE,
newobj = "merged_female",
datasources = conns
)
## $is.object.created
## [1] "A data object <merged_female> has been created in all specified data sources"
##
## $validity.check
## [1] "<merged_female> appears valid in all sources"
We can see the merged table includes the columns from both the
Person
and Measurement
tables, but only for
individuals with a gender_concept_id
value of
female
:
# General summary of the resulting table
ds.summary("merged_female")
## $opal_demo
## $opal_demo$class
## [1] "data.frame"
##
## $opal_demo$`number of rows`
## [1] 43
##
## $opal_demo$`number of columns`
## [1] 13
##
## $opal_demo$`variables held`
## [1] "person_id"
## [2] "gender_concept_id"
## [3] "year_of_birth"
## [4] "month_of_birth"
## [5] "day_of_birth"
## [6] "birth_datetime"
## [7] "race_concept_id"
## [8] "ethnicity_concept_id"
## [9] "location_id"
## [10] "provider_id"
## [11] "care_site_id"
## [12] "mean_blood_pressure.measurement_date"
## [13] "mean_blood_pressure.value_as_number"
# Summary of the `gender_concept_id` column to verify that it only includes `female` values
ds.summary("merged_female$gender_concept_id")
## $opal_demo
## $opal_demo$class
## [1] "factor"
##
## $opal_demo$length
## [1] 43
##
## $opal_demo$categories
## [1] "female" "male"
##
## $opal_demo$`count of 'female'`
## [1] 43
##
## $opal_demo$`count of 'male'`
## [1] 0
This is how we can use DataSHIELD’s standard methods to subset tables
obtained from OMOP CDM databases. We can achieve complex filters by
applying more sophisticated ds.dataFrameSubset
logics and
propagating them to the desired tables through the
personFilter
parameter of the get
method.
Logout
We have arrived at the end of these vignettes. If we have finished working with our DataSHIELD session, it is a good practice to disconnect from the DataSHIELD server to avoid any potential issues:
datashield.logout(conns)