Skip to content

Introduction

In this analysis, we explore the relationship between chronic obstructive pulmonary disease (COPD) and several key predictors using a synthetic dataset. Our goal is to demonstrate how dsOMOP can be used to analyze clinical data while validating known clinical associations.

We utilize the Tufts Synthetic Dataset, which consists of fully synthetic electronic health record (EHR) data representing 567,000 synthetic patients. This dataset was generated in 2021 through a collaboration between Syntegra, Inc. and Tufts Medical Center using a deep learning transformer model. The model was trained on real-world EHR data from the Tufts Research Data Warehouse (TRDW), which had already been transformed into OMOP CDM format. It includes longitudinal clinical data from patients who received care across Tufts Medicine’s three hospitals, 40-practice physician network, and home health care organization.

The Tufts Synthetic Dataset contains clinical information such as patient visits, conditions, medications, laboratory measurements, procedures, observations, and device exposures, all structured according to the OMOP CDM version 5.3. The large volume of patient data, along with the realistic nature of the synthetic dataset, makes it ideal for testing the functionality of dsOMOP by exploring significant associations and patterns within the data.

Based on their relevance in the literature and their availability in our dataset, we focus on the following well-established predictors of COPD:

  • Tobacco use
  • Vitamin D deficiency
  • History of asthma
  • History of rheumatoid arthritis

The analysis is conducted using two distinct scenarios to validate the federated approach:

  1. A centralized analysis using the complete Tufts Synthetic Dataset on a single Opal server.
  2. A distributed analysis where the dataset is split across three separate Opal servers, each containing a subset of patients, connected through DataSHIELD.
Figure 1: Centralized and federated architectures for the COPD analysis
Figure 1: Centralized (1) and distributed (2) architectures for the COPD analysis

Centralized analysis

Server connection

We start by establishing a connection with ISGlobal’s BRGE development Opal server. This Opal server is configured to allow access to a resource with the full version of the Tufts Synthetic Dataset:

library(DSI)
library(DSOpal)
library(dsBaseClient)
library(dsOMOPClient)
library(dsOMOPHelper)

builder <- newDSLoginBuilder()

builder$append(server="brge",
               url="https://opal.isglobal.org/brge",
               user="dsuser",
               password="P@ssw0rd",
               driver = "OpalDriver")

logindata <- builder$build()
conns <- datashield.login(logins=logindata)

We create an instance of dsOMOPHelper to interact with the database and build our desired dataset:

o <- ds.omop.helper(
    connections = conns,
    resource = "omop_demo.tufts", 
    symbol = "tufts"
)

Dataset construction

Variable definition

We define the variables that will be present in our dataset. These variables include our outcome variable and the predictor variables we will use in our generalized linear model (GLM) for COPD:

# Outcome variable
outcome_concept_id <- 255573 # Chronic Obstructive Pulmonary Disease (COPD)

# Predictor variables
observation_list <- c(
    4005823 # Tobacco use
)

condition_list <- c(
    317009, # History of asthma
    436070, # Rheumatoid arthritis
    80809   # Vitamin D deficiency
)

Data retrieval

We use dsOMOPHelper’s auto function to retrieve the defined variables from the tables condition_occurrence and observation:

o$auto(
    table = c("condition_occurrence", "observation"), 
    concepts = c(outcome_concept_id, condition_list, observation_list),
    columns = c("condition_occurrence_id", "observation_id") 
    # We only want the IDs to act as a boolean indicating 
    # the presence of the condition occurrence or observation
)

Data type conversions

DataSHIELD will not transform an ID to a boolean if it is in a string format, so this process involves two steps:

  1. Transforming the ID to numeric
  2. Transforming the numeric ID to boolean

We define a function to perform these steps automatically for every variable:

convert_to_boolean <- function(table, variable_name, id_type, conns) {
  # Construct the full variable name in the format "table$variable_name.id_type"
  full_variable_name <- paste0(table, "$", variable_name, ".", id_type)

  # Create a new variable name for the numeric conversion
  new_numeric_name <- paste0(variable_name, "_numeric")
  
  # Convert the original variable to numeric
  ds.asNumeric(
    x.name = full_variable_name, 
    newobj = new_numeric_name, 
    datasources = conns
  )

  # Convert the numeric variable to boolean
  # True (1) if not equal to 0, False (0) otherwise
  # NA values are assigned 0
  ds.Boole(
    V1 = new_numeric_name, 
    V2 = 0, 
    Boolean.operator = "!=", 
    numeric.output = TRUE, 
    na.assign = 0, 
    newobj = variable_name,
    datasources = conns
  )
}

This allows us to transform the selected variables to boolean format:

# Convert tobacco use observation to boolean
convert_to_boolean("tufts", "tobacco_user", 
                   "observation_id", conns)

# Convert condition occurrences to boolean
convert_to_boolean("tufts", "asthma", 
                   "condition_occurrence_id", conns)
convert_to_boolean("tufts", "rheumatoid_arthritis", 
                   "condition_occurrence_id", conns)
convert_to_boolean("tufts", "vitamin_d_deficiency", 
                   "condition_occurrence_id", conns)

# Convert outcome variable (COPD) to boolean
convert_to_boolean("tufts", "chronic_obstructive_pulmonary_disease", 
                   "condition_occurrence_id", conns)
## $is.object.created
## [1] "A data object <tobacco_user> has been created in all specified data sources"
## 
## $validity.check
## [1] "<tobacco_user> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <asthma> has been created in all specified data sources"
## 
## $validity.check
## [1] "<asthma> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <rheumatoid_arthritis> has been created in all specified data sources"
## 
## $validity.check
## [1] "<rheumatoid_arthritis> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <vitamin_d_deficiency> has been created in all specified data sources"
## 
## $validity.check
## [1] "<vitamin_d_deficiency> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <chronic_obstructive_pulmonary_disease> has been created in all specified data sources"
## 
## $validity.check
## [1] "<chronic_obstructive_pulmonary_disease> appears valid in all sources"

Create table for the GLM

We create a new table with the prepared outcome and predictors for the GLM:

# We define the name of the variables as they will be in the dataset
variable_names <- c(
  "tobacco_user",
  "asthma",
  "rheumatoid_arthritis",
  "vitamin_d_deficiency",
  "chronic_obstructive_pulmonary_disease"
)

# Create a new table with the prepared outcome and predictors
ds.cbind(
  x = variable_names,
  DataSHIELD.checks = FALSE,
  newobj = "glm_table",
  datasources = conns
)

# Check the structure of the resulting table to ensure it has been created correctly
ds.summary("glm_table")
## $is.object.created
## [1] "A data object <glm_table> has been created in all specified data sources"
## 
## $validity.check
## [1] "<glm_table> appears valid in all sources"
## 
## $brge
## $brge$class
## [1] "data.frame"
## 
## $brge$`number of rows`
## [1] 576750
## 
## $brge$`number of columns`
## [1] 5
## 
## $brge$`variables held`
## [1] "tobacco_user"                         
## [2] "asthma"                               
## [3] "rheumatoid_arthritis"                 
## [4] "vitamin_d_deficiency"                 
## [5] "chronic_obstructive_pulmonary_disease"

Generalized Linear Model

We can now execute the final step - running the GLM on our prepared dataset:

# Define the formula for the GLM
formula <- paste0(
  "chronic_obstructive_pulmonary_disease ~ ", 
  paste(
    c(
      "tobacco_user",
      "asthma",
      "rheumatoid_arthritis",
      "vitamin_d_deficiency"
    ),
    collapse = " + "
  )
)

# Fit the GLM
ds.glm(
  formula = formula, 
  data = "glm_table", 
  family = "binomial"
)
## $Nvalid
## [1] 576750
## 
## $Nmissing
## [1] 0
## 
## $Ntotal
## [1] 576750
## 
## $disclosure.risk
##      RISK OF DISCLOSURE
## brge                  0
## 
## $errorMessage
##      ERROR MESSAGES
## brge "No errors"   
## 
## $nsubs
## [1] 576750
## 
## $iter
## [1] 10
## 
## $family
## 
## Family: binomial 
## Link function: logit 
## 
## 
## $formula
## [1] "chronic_obstructive_pulmonary_disease ~ tobacco_user + asthma +      rheumatoid_arthritis + vitamin_d_deficiency"
## 
## $coefficients
##                       Estimate Std. Error     z-value       p-value
## (Intercept)          -6.572083 0.03534378 -185.947355  0.000000e+00
## tobacco_user          1.860760 0.24413811    7.621753  2.502536e-14
## asthma                2.996876 0.11485691   26.092255 4.463685e-150
## rheumatoid_arthritis  2.106862 0.39986610    5.268918  1.372299e-07
## vitamin_d_deficiency  1.814804 0.30800707    5.892084  3.813549e-09
##                      low0.95CI.LP high0.95CI.LP         P_OR
## (Intercept)             -6.641355     -6.502810  0.001396927
## tobacco_user             1.382258      2.339262  6.428622994
## asthma                   2.771760      3.221991 20.022883190
## rheumatoid_arthritis     1.323139      2.890585  8.222397606
## vitamin_d_deficiency     1.211121      2.418486  6.139869864
##                      low0.95CI.P_OR high0.95CI.P_OR
## (Intercept)             0.001303556     0.001496976
## tobacco_user            3.983888931    10.373580768
## asthma                 15.986751946    25.078005376
## rheumatoid_arthritis    3.755189296    18.003838708
## vitamin_d_deficiency    3.357245312    11.228849381
## 
## $dev
## [1] 13173.98
## 
## $df
## [1] 576745
## 
## $output.information
## [1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

Logout

After finishing the centralized analysis, we close the connection:

Distributed analysis

Server connection

We start by establishing a connection with the three dedicated Opal servers. Each of these Opal servers contains a resource pointing to a database containing a different subset of the Tufts Synthetic Dataset. In combination, the three Opal servers contain 100% of the original dataset:

builder_dist <- newDSLoginBuilder()

builder_dist$append(server="brge1", 
               url="https://opal.isglobal.org/brge1",
               user="dsuser", 
               password="P@ssw0rd", 
               driver = "OpalDriver")

builder_dist$append(server="brge2", 
               url="https://opal.isglobal.org/brge2",
               user="dsuser", 
               password="P@ssw0rd", 
               driver = "OpalDriver")

builder_dist$append(server="brge3", 
               url="https://opal.isglobal.org/brge3",
               user="dsuser", 
               password="P@ssw0rd", 
               driver = "OpalDriver")

logindata_dist <- builder_dist$build()
conns_dist <- datashield.login(logins=logindata_dist)

We create an instance of dsOMOPHelper to interact with the three databases. Note that every Opal server will have a different resource name, so we need to specify the resource for each server:

o <- ds.omop.helper(
    connections = conns_dist,
    resource = list("brge1" = "omop_demo.tufts_dist_1",
                    "brge2" = "omop_demo.tufts_dist_2",
                    "brge3" = "omop_demo.tufts_dist_3"),
    symbol = "tufts_dist"
)

We have specified the symbol tufts_dist for all servers, so from this point on, we can refer to tufts_dist in each server and every server will treat it as the same object, despite containing different parts of the dataset, performing the same operations on it. Therefore, the analysis will be conducted in the same way as in the centralized analysis, and DataSHIELD will handle the distributed nature of the dataset and combine the results.

Dataset construction

Data retrieval

We already had defined the concepts that we want to retrieve under the outcome_concept_id, condition_list, and observation_list variables, so we can directly apply the auto function to the servers. Each server will retrieve the data for its corresponding part of the dataset:

o$auto(
    table = c("condition_occurrence", "observation"), 
    concepts = c(outcome_concept_id, condition_list, observation_list),
    columns = c("condition_occurrence_id", "observation_id") 
)

We can observe how the three servers have obtained and stored the data in their corresponding tufts_dist objects:

ds.summary("tufts_dist", conns_dist)
## $brge1
## $brge1$class
## [1] "data.frame"
## 
## $brge1$`number of rows`
## [1] 192250
## 
## $brge1$`number of columns`
## [1] 16
## 
## $brge1$`variables held`
##  [1] "person_id"                                                    
##  [2] "gender_concept_id"                                            
##  [3] "year_of_birth"                                                
##  [4] "month_of_birth"                                               
##  [5] "day_of_birth"                                                 
##  [6] "birth_datetime"                                               
##  [7] "race_concept_id"                                              
##  [8] "ethnicity_concept_id"                                         
##  [9] "location_id"                                                  
## [10] "provider_id"                                                  
## [11] "care_site_id"                                                 
## [12] "vitamin_d_deficiency.condition_occurrence_id"                 
## [13] "asthma.condition_occurrence_id"                               
## [14] "chronic_obstructive_pulmonary_disease.condition_occurrence_id"
## [15] "rheumatoid_arthritis.condition_occurrence_id"                 
## [16] "tobacco_user.observation_id"                                  
## 
## 
## $brge2
## $brge2$class
## [1] "data.frame"
## 
## $brge2$`number of rows`
## [1] 192250
## 
## $brge2$`number of columns`
## [1] 16
## 
## $brge2$`variables held`
##  [1] "person_id"                                                    
##  [2] "gender_concept_id"                                            
##  [3] "year_of_birth"                                                
##  [4] "month_of_birth"                                               
##  [5] "day_of_birth"                                                 
##  [6] "birth_datetime"                                               
##  [7] "race_concept_id"                                              
##  [8] "ethnicity_concept_id"                                         
##  [9] "location_id"                                                  
## [10] "provider_id"                                                  
## [11] "care_site_id"                                                 
## [12] "asthma.condition_occurrence_id"                               
## [13] "vitamin_d_deficiency.condition_occurrence_id"                 
## [14] "rheumatoid_arthritis.condition_occurrence_id"                 
## [15] "chronic_obstructive_pulmonary_disease.condition_occurrence_id"
## [16] "tobacco_user.observation_id"                                  
## 
## 
## $brge3
## $brge3$class
## [1] "data.frame"
## 
## $brge3$`number of rows`
## [1] 192250
## 
## $brge3$`number of columns`
## [1] 16
## 
## $brge3$`variables held`
##  [1] "person_id"                                                    
##  [2] "gender_concept_id"                                            
##  [3] "year_of_birth"                                                
##  [4] "month_of_birth"                                               
##  [5] "day_of_birth"                                                 
##  [6] "birth_datetime"                                               
##  [7] "race_concept_id"                                              
##  [8] "ethnicity_concept_id"                                         
##  [9] "location_id"                                                  
## [10] "provider_id"                                                  
## [11] "care_site_id"                                                 
## [12] "asthma.condition_occurrence_id"                               
## [13] "vitamin_d_deficiency.condition_occurrence_id"                 
## [14] "chronic_obstructive_pulmonary_disease.condition_occurrence_id"
## [15] "rheumatoid_arthritis.condition_occurrence_id"                 
## [16] "tobacco_user.observation_id"

Data type conversions

The function that we defined previously, convert_to_boolean, should remain the same. We will apply it to transform the retrieved data into a boolean format for the analysis. Each server will apply the function to the data that is stored in its tufts_dist object:

# Convert tobacco use observation to boolean
convert_to_boolean("tufts_dist", "tobacco_user", 
                   "observation_id", conns_dist)

# Convert condition occurrences to boolean
convert_to_boolean("tufts_dist", "asthma", 
                   "condition_occurrence_id", conns_dist)
convert_to_boolean("tufts_dist", "rheumatoid_arthritis", 
                   "condition_occurrence_id", conns_dist)
convert_to_boolean("tufts_dist", "vitamin_d_deficiency", 
                   "condition_occurrence_id", conns_dist)

# Convert outcome variable (COPD) to boolean
convert_to_boolean("tufts_dist", "chronic_obstructive_pulmonary_disease", 
                   "condition_occurrence_id", conns_dist)
## $is.object.created
## [1] "A data object <tobacco_user> has been created in all specified data sources"
## 
## $validity.check
## [1] "<tobacco_user> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <asthma> has been created in all specified data sources"
## 
## $validity.check
## [1] "<asthma> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <rheumatoid_arthritis> has been created in all specified data sources"
## 
## $validity.check
## [1] "<rheumatoid_arthritis> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <vitamin_d_deficiency> has been created in all specified data sources"
## 
## $validity.check
## [1] "<vitamin_d_deficiency> appears valid in all sources"
## 
## $is.object.created
## [1] "A data object <chronic_obstructive_pulmonary_disease> has been created in all specified data sources"
## 
## $validity.check
## [1] "<chronic_obstructive_pulmonary_disease> appears valid in all sources"

Create table for the GLM

We have successfully prepared the data for the analysis. Now, we can create a new table with the variables of interest for the GLM. These variables have already been defined in the variable_names list previously, so we can directly use it to create the table. This operation will construct a new table in each server:

# Create new tables with the prepared outcome and predictors
ds.cbind(
  x = variable_names,
  DataSHIELD.checks = FALSE,
  newobj = "glm_table_dist",
  datasources = conns_dist
)

# Check the structure of the resulting tables to ensure they have been created correctly
ds.summary("glm_table_dist", conns_dist)
## $is.object.created
## [1] "A data object <glm_table_dist> has been created in all specified data sources"
## 
## $validity.check
## [1] "<glm_table_dist> appears valid in all sources"
## 
## $brge1
## $brge1$class
## [1] "data.frame"
## 
## $brge1$`number of rows`
## [1] 192250
## 
## $brge1$`number of columns`
## [1] 5
## 
## $brge1$`variables held`
## [1] "tobacco_user"                         
## [2] "asthma"                               
## [3] "rheumatoid_arthritis"                 
## [4] "vitamin_d_deficiency"                 
## [5] "chronic_obstructive_pulmonary_disease"
## 
## 
## $brge2
## $brge2$class
## [1] "data.frame"
## 
## $brge2$`number of rows`
## [1] 192250
## 
## $brge2$`number of columns`
## [1] 5
## 
## $brge2$`variables held`
## [1] "tobacco_user"                         
## [2] "asthma"                               
## [3] "rheumatoid_arthritis"                 
## [4] "vitamin_d_deficiency"                 
## [5] "chronic_obstructive_pulmonary_disease"
## 
## 
## $brge3
## $brge3$class
## [1] "data.frame"
## 
## $brge3$`number of rows`
## [1] 192250
## 
## $brge3$`number of columns`
## [1] 5
## 
## $brge3$`variables held`
## [1] "tobacco_user"                         
## [2] "asthma"                               
## [3] "rheumatoid_arthritis"                 
## [4] "vitamin_d_deficiency"                 
## [5] "chronic_obstructive_pulmonary_disease"

Generalized Linear Model

We can now execute the same GLM (with the same formula as in the centralized analysis, which was defined under the formula variable previously) on each server:

formula
## [1] "chronic_obstructive_pulmonary_disease ~ tobacco_user + asthma + rheumatoid_arthritis + vitamin_d_deficiency"
ds.glm(
  formula = formula, 
  data = "glm_table_dist", 
  family = "binomial",
  datasources = conns_dist
)
## $Nvalid
## [1] 576750
## 
## $Nmissing
## [1] 0
## 
## $Ntotal
## [1] 576750
## 
## $disclosure.risk
##       RISK OF DISCLOSURE
## brge1                  0
## brge2                  0
## brge3                  0
## 
## $errorMessage
##       ERROR MESSAGES
## brge1 "No errors"   
## brge2 "No errors"   
## brge3 "No errors"   
## 
## $nsubs
## [1] 576750
## 
## $iter
## [1] 10
## 
## $family
## 
## Family: binomial 
## Link function: logit 
## 
## 
## $formula
## [1] "chronic_obstructive_pulmonary_disease ~ tobacco_user + asthma +      rheumatoid_arthritis + vitamin_d_deficiency"
## 
## $coefficients
##                       Estimate Std. Error     z-value       p-value
## (Intercept)          -6.572083 0.03534378 -185.947355  0.000000e+00
## tobacco_user          1.860760 0.24413811    7.621753  2.502536e-14
## asthma                2.996876 0.11485691   26.092255 4.463685e-150
## rheumatoid_arthritis  2.106862 0.39986610    5.268918  1.372299e-07
## vitamin_d_deficiency  1.814804 0.30800707    5.892084  3.813549e-09
##                      low0.95CI.LP high0.95CI.LP         P_OR
## (Intercept)             -6.641355     -6.502810  0.001396927
## tobacco_user             1.382258      2.339262  6.428622994
## asthma                   2.771760      3.221991 20.022883190
## rheumatoid_arthritis     1.323139      2.890585  8.222397606
## vitamin_d_deficiency     1.211121      2.418486  6.139869864
##                      low0.95CI.P_OR high0.95CI.P_OR
## (Intercept)             0.001303556     0.001496976
## tobacco_user            3.983888931    10.373580768
## asthma                 15.986751946    25.078005376
## rheumatoid_arthritis    3.755189296    18.003838708
## vitamin_d_deficiency    3.357245312    11.228849381
## 
## $dev
## [1] 13173.98
## 
## $df
## [1] 576745
## 
## $output.information
## [1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

As we can observe, the results are the same as in the centralized analysis, but the analysis has been conducted in a distributed manner.

Logout

After finishing the distributed analysis, we close the connections to the servers:

datashield.logout(conns_dist)

Conclusion

In this study, we demonstrated the robustness and reliability of dsOMOP for conducting federated analyses across distributed datasets. The identical results obtained from both centralized and distributed approaches validate the effectiveness of our federated analysis framework. Our findings revealed statistically significant associations (p < 0.05) between COPD and all examined predictors, aligning with previously reported relationships in the literature.

These results showcase dsOMOP’s capability to integrate and analyze distributed medical data while maintaining statistical integrity. The implementation of this federated approach represents a significant advancement in distributed clinical data analysis, offering a solution for large-scale, multi-center studies where data sharing restrictions may apply.