18 Exposome data analysis

18.1 Loading the Exposome Set

An exposome dataset is made up from three different tables. In order to be studied, this three tables need to be coerced into a R object of class ExposomeSet, for that reason an exposome dataset can be available on a server as three tables or as a resource.

18.1.1 From tables to ExposomeSet

To coerce the three tables that make the exposome dataset into an ExposomeSet R object, the tables have to be loaded inside the same study server. On this example, the tables are on the demo Opal server as resources, which means they have to be loaded on the study server and be coerced as data frames; for more information read the resourcer package documentation.

library(dsBaseClient)
library(dsExposomeClient)
library(DSOpal)

builder <- newDSLoginBuilder()
builder$append(server = "server1", url = "https://opal-demo.obiba.org/",
               user = "administrator", password = "password",
               driver = "OpalDriver")
logindata <- builder$build()
conns <- datashield.login(logins = logindata)

datashield.assign.resource(conns, symbol = 'description', resource = list(server1 = 'EXPOSOME.description'))
ds.class('description')
$server1
[1] "TidyFileResourceClient" "FileResourceClient"     "ResourceClient"         "R6"                    
datashield.assign.expr(conns, symbol = "description", expr = quote(as.resource.data.frame(description)))
ds.class('description')
$server1
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
datashield.assign.resource(conns, symbol = 'exposures', resource = list(server1 = 'EXPOSOME.exposures'))
datashield.assign.expr(conns, symbol = "exposures", expr = quote(as.resource.data.frame(exposures)))

datashield.assign.resource(conns, symbol = 'phenotypes', resource = list(server1 = 'EXPOSOME.phenotypes'))
datashield.assign.expr(conns, symbol = "phenotypes", expr = quote(as.resource.data.frame(phenotypes)))

ds.ls()
$server1
$server1$environment.searched
[1] "R_GlobalEnv"

$server1$objects.found
[1] "description" "exposures"   "phenotypes" 

If the tables are on the Opal server as tables instead of resources, the lines after the datashield.login would look like this

# Don't run this code, the tables are not on the demo Opal server, it's just for demostration purposes
datashield.assign.table(conns, symbol = 'description', table = list(server1 = 'EXPOSOME.description_table'))

# Repeat for the other two tables

When there are tables on the Opal server they can be directly loaded to the study server as data frames, there is no need to perform the as.data.frame assignation.

To coerce the three tables to the ExposomeSet object, the dsExposomeClient library has the ds.loadExposome() function, which takes as input parameters the names of the tables on the study server among some further configuration options, refer to the function documentation for more information.

ds.loadExposome("exposures", "description", "phenotypes", "idnum", "idnum", "Exposure", "Family", 5, FALSE, "exposome_object")
ds.class("exposome_object")
$server1
[1] "ExposomeSet"
attr(,"package")
[1] "rexposome"

18.1.2 From resource to ExposomeSet

When there’s an ExposomeSet resource available, it just needs to be loaded into the study server

datashield.assign.resource(conns, symbol = 'exposome_resource', resource = list(server1 = 'EXPOSOME.exposomeSet'))

ds.ls()
$server1
$server1$environment.searched
[1] "R_GlobalEnv"

$server1$objects.found
[1] "description"       "exposome_object"   "exposome_resource" "exposures"         "phenotypes"       
ds.class("exposome_resource")
$server1
[1] "RDataFileResourceClient" "FileResourceClient"      "ResourceClient"          "R6"                     

Once on the study server, the resource has to be coerced into an R object

datashield.assign.expr(conns, symbol = "exposome_resource", expr = quote(as.resource.object(exposome_resource)))
ds.class("exposome_resource")
$server1
[1] "ExposomeSet"
attr(,"package")
[1] "rexposome"

18.2 Exploring the loaded dataset

Once the Exposome datasets are loaded on the study servers, some simple functions can be used to have some understanding of what the dataset contains. Those correspond to the name of the exposures, the names of the families of the exposures and the name of the phenotypes.

18.2.1 Family names

To get the names of the families of exposures present on the exposome set, there’s the ds.familyNames function, it can take the parameter by.exposure = TRUE to list the family names and the exposures that contain each family.

ds.familyNames("exposome_object")
$server1
 [1] "Air Pollutants"    "Metals"            "PBDEs"             "Organochlorines"   "Bisphenol A"      
 [6] "Water Pollutants"  "Built Environment" "Cotinine"          "Home Environment"  "Phthalates"       
[11] "Noise"             "PFOAs"             "Temperature"      
head(ds.familyNames("exposome_object", TRUE)$server1)
         AbsPM25               As           BDE100           BDE138           BDE153           BDE154 
"Air Pollutants"         "Metals"          "PBDEs"          "PBDEs"          "PBDEs"          "PBDEs" 

18.2.2 Exposures and phenotypes names

To get the names of the exposures or phenotypes (or both combined) there’s the function ds.exposome_variables, it takes as argument exposures, phenotypes or all to retrieve the desired names.

head(ds.exposome_variables("exposome_object" , "exposures", conns)$server1)
[1] "AbsPM25" "As"      "BDE100"  "BDE138"  "BDE153"  "BDE154" 
ds.exposome_variables("exposome_object" , "phenotypes", conns)
$server1
[1] "whistling_chest" "flu"             "rhinitis"        "wheezing"        "birthdate"       "sex"            
[7] "age"             "cbmi"            "blood_pre"      

18.2.3 Summary of variables

Non-disclosive descriptive statistics can be obtained from the exposome dataset, the function ds.exposome_summary is in charge of that. It can obtain descriptive statistics from numeric and factor variables of the exposome dataset (both from exposures and phenotypes).

# Numerical variable
ds.exposome_summary("exposome_object", "PM25", conns)
$server1
$server1$class
[1] "numeric"

$server1$length
[1] 109

$server1$`quantiles & mean`
      5%      10%      25%      50%      75%      90%      95%     Mean 
1.093217 1.132021 1.146654 1.171904 1.181451 1.202192 1.216758 1.165488 
# Factor variable
ds.exposome_summary("exposome_object", "flu", conns)
$server1
$server1$class
[1] "character"

$server1$length
[1] 109

18.2.4 Missing data

The number of missing data on each exposure and on each phenotype can be found by using the function ds.tableMissings. This function returns a vector with the amount of missing data in each exposure or phenotype. The argument set indicates if the number of missing values is counted on exposures or phenotypes. The argument output indicates if it is shown as counts (output="n") or as percentage (output="p").

ds.tableMissings("exposome_object", set = "exposures")
$server1
        Dens         Temp         Conn      AbsPM25           NO          NO2          NOx         PM10 
           0            0            1            2            2            2            2            2 
      PM10Cu       PM10Fe        PM10K       PM10Ni        PM10S       PM10SI       PM10Zn         PM25 
           2            2            2            2            2            2            2            2 
      PM25CU       PM25FE        PM25K       PM25Ni        PM25S       PM25Sl       PM25Zn     PMcoarse 
           2            2            2            2            2            2            2            2 
     Benzene        PM25V          ETS G_pesticides          Gas         BTHM        CHCl3 H_pesticides 
           3            3            5            5            5            6            6            6 
     Noise_d      Noise_n          THM     Cotinine         bHCH          DDE          DDT          HCB 
           6            6            6            7           13           13           13           13 
      PCB118       PCB138       PCB153       PCB180          BPA           As           Cs           Mo 
          13           13           13           13           21           24           24           24 
          Ni           Tl           Zn           Hg           Cd           Sb        Green           Cu 
          24           24           24           27           28           30           31           40 
       PM10V           Se         MBzP        MEHHP         MEHP        MEOHP          MEP         MiBP 
          41           45           46           46           46           46           46           46 
        MnBP     X5cxMEPP           Co        PFHxS         PFNA         PFOA         PFOS    X7OHMMeOP 
          46           46           47           48           48           48           48           49 
          Pb     X2cxMMHP       BDE100       BDE138       BDE153       BDE154        BDE17       BDE183 
          59           64           76           76           76           76           76           76 
      BDE190       BDE209        BDE28        BDE47        BDE66        BDE71        BDE85        BDE99 
          76           76           76           76           76           76           76           76 
ds.tableMissings("exposome_object", set = "phenotypes")
$server1
whistling_chest             flu        rhinitis        wheezing             sex             age 
              0               0               0               0               0               0 
           cbmi       blood_pre       birthdate 
              0               2               3 

Optionally, there’s also the ds.plotMissings function which returns a ggplot object with a barplot of missings for exposures or phenotypes, there’s the option of displaying the percentage of missings as well as total counts, check the function documentation for further information.

ds.plotMissings("exposome_object", "exposures", "p")
$server1

Please note that since this function call returns a ggplot object, it can be plotted properly to avoid squished Y axis when there are lots of exposures, read the official documentation for information on how to do that.

18.2.5 Exposures Normality

Most of the test done in exposome analysis requires that the exposures must follow a normal distribution. The function ds.normalityTest performs a test on each exposure for normality behaviour. The result is a data.frame with the exposures’ names, a flag TRUE/FALSE for normality and the p-value obtained from the Shapiro-Wilk Normality Test (if the p-value is under the threshold, then the exposure is not normal).

nm <- ds.normalityTest("exposome_object")
table(nm$server1$normality)

FALSE  TRUE 
   55    29 

So, the exposures that do not follow a normal distribution are:

nm$server1$exposure[!nm$server1$normality]
 [1] "DDT"      "PM10SI"   "PM25K"    "PM25Sl"   "PCB118"   "Tl"       "PM10V"    "PM25Zn"   "PM25FE"  
[10] "PM10K"    "BDE17"    "PM25"     "PMcoarse" "PM10"     "BPA"      "Green"    "NO2"      "Cs"      
[19] "PFNA"     "PCB153"   "PM25CU"   "MEOHP"    "Cu"       "HCB"      "MEHHP"    "DDE"      "BDE190"  
[28] "bHCH"     "PM10Zn"   "MnBP"     "NO"       "NOx"      "PM10S"    "MEHP"     "PCB138"   "Zn"      
[37] "X2cxMMHP" "PCB180"   "PFOA"     "Cotinine" "PM25S"    "Co"       "Conn"     "PM25Ni"   "PFHxS"   
[46] "PM10Ni"   "Cd"       "Dens"     "Se"       "X5cxMEPP" "BDE183"   "BDE28"    "Sb"       "BDE138"  
[55] "PM25V"   

The ds.normalityTest function has some extra input arguments to tune the normality test, check the function documentation for further information.

The exposures can be visualized using non-disclosive histograms to see their distribution along their Shapiro-Wilk Normality Test p-value.

ds.exposure_histogram("exposome_object", "BDE209")

If the selected exposure is not following a normal distribution, the function ds.exposure_histogram accepts the argument show.trans to visualize the raw data histogram plus three typical transformations (exp, log and sqrt), the Shapiro-Wilk Normality Test p-value is shown for all the transformations.

ds.exposure_histogram("exposome_object", "BDE209", TRUE)

18.2.6 Exposures Behaviour

We can get a snapshot of the behaviour of the full exposome using the ds.plotFamily function. This function draws a mosaic of boxplots with all the numeric families of exposures, it can also plot single families and perform grouping inside of them. This function makes use of the ggplot2 library to generate the plots. Future plans include displaying non-disclosive plots for non-numeric families.

ds.plotFamily("exposome_object", family = "all")

ds.plotFamily("exposome_object", family = "Phthalates", group = "sex")

18.3 Exposures Imputation

The missings of an exposome dataset can be imputed using the ds.imputation function, which calls the rexposome function imputation on the study server. Refer to the documentation of rexposome for details about the imputation procedures.

ds.imputation("exposome_object", "exposome_object_imputed")

18.4 Performing an ExWAS

When the ExposomeSet object is on the study the server, the function ds.exwas is in charge of performing the ExWAS. As an example let’s present a situation where each exposition has to be associated to the blood_pre phenotype using the phenotype sex as a covariate. This study would be done as follows

exwas_results <- ds.exwas("blood_pre ~ sex", "exposome_object", "gaussian")
head(exwas_results$exwas_results)
  exposure         family coefficient      minE      maxE      p.value
1  AbsPM25 Air Pollutants  20.2567443 11.532500 28.980989 5.343657e-06
2       As         Metals   0.5049329 -1.637091  2.646956 6.440695e-01
3   BDE100          PBDEs  -1.4684745 -5.073223  2.136273 4.246181e-01
4   BDE138          PBDEs   1.2819339 -4.872401  7.436269 6.830857e-01
5   BDE153          PBDEs   2.8699183 -1.252769  6.992606 1.724462e-01
6   BDE154          PBDEs   0.1317781 -5.381815  5.645371 9.626373e-01

The model is inputted as a string, where the left side term is the objective phenotype, and the right term are the covariates, in the case of wanting more covariates proceed as: objective ~ cov1 + cov2 + ... + covN. It’s important noting that if the output family (third argument) does not match the objective family, the ExWAS will fail (Example: The objective family is binomial and the user sets it to gaussian).

To visualize the results from the ExWAS, the function ds.plotExwas takes the output of ds.exwas and creates two different visualizations. A Manhattan-like plot with the p-values of the association between each exposure and asthma, coloured by families of exposures:

ds.plotExwas(exwas_results, "manhattan")

Also, a plot for the effects of a given model can be obtained with:

ds.plotExwas(exwas_results, "effect")

18.5 Exposures PCA

A PCA can be performed on the exposures of the exposome dataset. To do so, there’s the ds.exposome_pca. The exposures should be standardized in order to perform the PCA properly, to do so, there’s the arguments standar and method, which standardize the Exposome Set before performing the PCA following the specified method. The available methods are normal (default method), which scales the exposures using the mean as the center and the standard variation as dispersion; the robust method, which uses the median and median absolute deviation respectively; and, interquartile range, which uses the median as the center and the coeficient between the interquartile range of the exposure and the normal range between the percentile 75 and 25 as variance. It is important noting that this function is sensitive to be disclosive, specially for very rectangular data frames (similar number of variables as individuals). To illustrate this problem, let’s try to perform a PCA on the whole exposures test data.

ds.exposome_pca("exposome_object", standar = TRUE)
Error: There are some DataSHIELD errors, list them with datashield.errors()

If that is the case, one option is to reduce the families of exposures of the Exposome Set. The ds.exposome_pca function has the argument fam to select the families to subset the Exposome Set to perform the PCA.

ds.exposome_pca("exposome_object", fam = c("Metals", "Noise"), standar = TRUE)

The PCA function saves the results on the study server to prevent any dislosures, the default variable they take is called "ds.exposome_pca.Results", which has to be passed to the visualization function. To visualize the results of the PCA there is the function ds.exposome_pca_plot, this function relies on the visualization methods already implemented on rexposome for the PCA analysis, it does it however on a non-disclosive way, by passing the scatter plot points through an anonimization process, hence the arguments k, method and noise. The visualization is controlled with the set argument, which takes "all" (mosaic of plots of the PCA), "exposures" (plot of the exposures space on the first two principal components, color coded by family), "samples" (plot of the individuals space on the first two principal components, this plot can take the phenotype argument to color code the individuals by phenotypes), "variance" and "variance_explained", the two variance plots are quite self explanatory, the color code on the "variance" highlights the first two principal components as they are the ones shown on the other drawings.

ds.exposome_pca_plot("ds.exposome_pca.Results", set = "all", method = 1, k = 3, noise = 5)

ds.exposome_pca_plot("ds.exposome_pca.Results", set = "samples", phenotype = "sex", method = 1, k = 3, noise = 5)

ds.exposome_pca_plot("ds.exposome_pca.Results", set = "exposures", method = 1, k = 3, noise = 5)

ds.exposome_pca_plot("ds.exposome_pca.Results", set = "variance", method = 1, k = 3, noise = 5)

ds.exposome_pca_plot("ds.exposome_pca.Results", set = "variance_explained", method = 1, k = 3, noise = 5)

Furthermore, the ds.exposome_pca_plot function can plot the correlations betweeen the principal components and the exposures and the association of the phenotypes with the principal components. This two visualizations can be obtained by setting the set argument to "exposures_correlation" and "phenotypes_correlation" respectively.

ds.exposome_pca_plot("ds.exposome_pca.Results", set = "exposures_correlation")

ds.exposome_pca_plot("ds.exposome_pca.Results", set = "phenotypes_correlation")

18.6 Exposures Correlation

The correlation between exposures, in terms of intra-family and inter-family exposures, is interesting to take into account. The correlation of the exposome can be computed using ds.correlation. The correlation could be disclosive, for that reason there’s the argument fam to select the families to compute the correlation subsetting the exposome dataset by families, typical complete exposome datasets will fail as the correlation matrix could be disclosive.

ds.correlation("exposome_object")[[1]][[1]]$`Correlation Matrix`[1:5,1:5]
     [,1] [,2] [,3] [,4] [,5]
[1,]   NA   NA   NA   NA   NA
[2,]   NA   NA   NA   NA   NA
[3,]   NA   NA   NA   NA   NA
[4,]   NA   NA   NA   NA   NA
[5,]   NA   NA   NA   NA   NA
ds.correlation("exposome_object")[[1]][[1]]$`Error message`
[1] "ERROR: The ratio of the number of variables over the number of individual-level records exceeds the allowed threshold, there is a possible risk of disclosure"

When subsetting the exposome set by families the correlation succeeds.

ds.correlation("exposome_object", c("Metals", "Noise"))[[1]][[1]]$`Correlation Matrix`[1:5,1:5]
            [,1]       [,2]        [,3]       [,4]        [,5]
[1,]  1.00000000 0.08897651 -0.12603679  0.1112793 -0.03127031
[2,]  0.08897651 1.00000000  0.10209881  0.5356812  0.84660044
[3,] -0.12603679 0.10209881  1.00000000 -0.5111215 -0.05851681
[4,]  0.11127929 0.53568121 -0.51112146  1.0000000  0.47229280
[5,] -0.03127031 0.84660044 -0.05851681  0.4722928  1.00000000

The output from the ds.correlation consists of the correlation matrix and a vector that contains the names of exposures and families on the correlation table, this second item is used by the corPlot function to display the results of the correlation study.

corr_data <- ds.correlation("exposome_object", c("Metals", "Noise"))
corPlot(corr_data)

The best option to see the inter-family correlations is the circos of correlations while the matrix of correlations is a better way for studying the intra-family correlations.

corPlot(corr_data, type = "circos")

datashield.logout(conns)