7 Methods

7.1 Missing data imputation

LOD missings are discovered through a encoding provided by the user, there is no method implemented to separate missing values between missing at random at LOD, meaning that all NA values are considered missing at random.

7.1.1 Limit of detection (LOD) missing

LOD missings can be imputed using two methodologies:

  • LOD value / sqrt(2) : Use a LOD value provided by the user (one value per exposures) divided by the square root of two. Richardson and Ciampi (2003)
  • QRILC: a quantile regression approach for the imputation of left-censored missing data Lazar (2015).

7.1.2 Missing at random

Multiple imputation chained equations (MICE) is used to impute missing at random data. The mice package is used to do so. A brief explanation on the algorithm:

  1. Imputation of the variable (exposure) xn with the mean of all it’s values.
  2. Perform 1 for all the variables.
  3. Set the mean imputed values from one variable back to missing.
  4. Perform a regression model and fill those missings.
  5. Repeat 3 and 4 for all the variables.
  6. Repeat 3, 4 and 5 until the imputed values obtained are stabilized.

7.2 Normality

7.2.1 Normality testing

To test the normality of a variable, a Shapiro-Wilks test is used. The Shapiro-Wilks test, tests the null hypothesis of a sample (variable of the dataset) is normally distributed, to perform the test it calculates the W statistic.

\(W = \frac{\left( \Sigma^{n}_{i=1} a_i x_{(i)} \right)^2}{\Sigma^{n}_{i=1} (x_i - \overline{x})^2}\)

To perform this test exposome uses the shapiro.test function from the base package of R.

7.2.2 Normalization

A user selected function can be applied to exposures (selected by the user) to normalize them. The available functions are: log, sqrt and ^1/3.

7.3 Principal component analysis (PCA)

Rexposome contains two PCA methodologies

exposomeShiny uses regular PCA from the FactoMineR package. A toggle to select between the two may be added in future releases.

7.4 Exposures correlation

The correlation method takes into account the nature of each pair of exposures: continuous vs. continuous uses cor function from R base, categorical vs. categorical uses cramerV function from lsr R package and categorical vs. continuous exposures correlation is calculated as the square root of the adjusted r-square obtained from fitting a lineal model with the categorical exposures as dependent variable and the continuous exposure as independent variable.

7.5 Exposures clustering

Clustering analysis on samples can be performed to cluster individuals having similar exposure profiles. This is done using hierarchical clustering using the function hclust from the stats R package. The results this analysis yields are the exposure profiles of a selected number of groups.

7.6 Exposome Association Analysis

7.6.1 Single Association Analysis

Exposome-Wide Association Study (ExWAS) is equivalent to a Genome-Wide Association Study (GWAS) in genomics or to Epigenetic-Wide Association Study (EWAS) in epigenomics. The ExWAS was first described by Patel et al. Patel, Bhattacharya, and Butte (2010) . ExWAS are based on generalized linear models using any formula describing the model that should be adjusted for (following standard formula options in R). That is, continuous or factor variables can be incorporated in the design, as well as interaction or splines using standard R functions and formulas. Multiple comparisons in the ExWAS analysis is addressed by computing the number of effective (Neff) tests as described by Li and Ju Li and Ji (2005) . The method estimates Neff by using the exposure correlation matrix that is corrected when it is not positive definite by using nearPD R function. The significant threshold is computed as 1-(1-0.05)Meff. This threshold is added to the Manhattan plots. When using imputed data, analysis is done for each imputed set and P-Values are pooled to obtain a global association score.

7.6.2 Stratified Single Association Analysis

The stratified analysis option for the ExWAS corresponds to applying the same method as regular ExWAS to subsetted datasets. As example, a stratified analysis with the sex variable stratified corresponds to performing two ExWAS, one to the male and one for the female group.

7.6.3 Variable selection ExWAS

There are some authors that proposed to perform association analysis in a multivariate fashion, just to take into account the correlation across exposures Agier et al. (2016) . A Lasso regression is implemented using Elastic-Net regularized generalized linear models implemented in glmnet R package.

7.7 Exposome-Omic Association Analysis

Perform association analyses between exposures and omic data bt fitting linear models as described in the limma R package Ritchie et al. (2015) . The pipeline implemented in association allows performing surrogate variable analysis in order to correct for unwanted variability. This adjustment is provided by SVA R package Leek et al. (2020) .

7.8 Integration analysis

There are three different methodologies to perform the integration analysis:

  • Multiset canonical correlation analysis (MCCA). Implemented using the MultiCCA function of PMA R package Witten et al. (2020) .
  • Multiple co-inertia analysis (MCIA). Implemented using the mcia function of omicade4 R package Meng et al. (2013) , Min and Long (2020) .
  • Partial least squares (PLS). Implemented using the plsr function of pls R package Mevik and Wehrens (2015) .

7.9 Enrichment analysis

Functional profiles of selected genes are obtained using the Bioconductor package clusterProfiler Yu et al. (2012) . The available enrichment databases are GO and KEGG.