7 Methods
7.1 Missing data imputation
LOD missings are discovered through a encoding provided by the user, there is no method implemented to separate missing values between missing at random at LOD, meaning that all NA values are considered missing at random.
7.1.1 Limit of detection (LOD) missing
LOD missings can be imputed using two methodologies:
- LOD value / sqrt(2) : Use a LOD value provided by the user (one value per exposures) divided by the square root of two. Richardson and Ciampi (2003)
- QRILC: a quantile regression approach for the imputation of left-censored missing data Lazar (2015).
7.1.2 Missing at random
Multiple imputation chained equations (MICE) is used to impute missing at random data. The mice package is used to do so. A brief explanation on the algorithm:
- Imputation of the variable (exposure) xn with the mean of all it’s values.
- Perform 1 for all the variables.
- Set the mean imputed values from one variable back to missing.
- Perform a regression model and fill those missings.
- Repeat 3 and 4 for all the variables.
- Repeat 3, 4 and 5 until the imputed values obtained are stabilized.
7.2 Normality
7.2.1 Normality testing
To test the normality of a variable, a Shapiro-Wilks test is used. The Shapiro-Wilks test, tests the null hypothesis of a sample (variable of the dataset) is normally distributed, to perform the test it calculates the W statistic.
\(W = \frac{\left( \Sigma^{n}_{i=1} a_i x_{(i)} \right)^2}{\Sigma^{n}_{i=1} (x_i - \overline{x})^2}\)
To perform this test exposome uses the shapiro.test function from the base package of R.
7.2.2 Normalization
A user selected function can be applied to exposures (selected by the user) to normalize them. The available functions are: log, sqrt and ^1/3.
7.3 Principal component analysis (PCA)
Rexposome contains two PCA methodologies
- Regular PCA Jolliffe and Cadima (2016) (only numerical exposures)
- FAMD Chavent et al. (2014) (numerical and categorical)
exposomeShiny uses regular PCA from the FactoMineR package. A toggle to select between the two may be added in future releases.
7.4 Exposures correlation
The correlation method takes into account the nature of each pair of exposures: continuous vs. continuous uses cor function from R base, categorical vs. categorical uses cramerV function from lsr R package and categorical vs. continuous exposures correlation is calculated as the square root of the adjusted r-square obtained from fitting a lineal model with the categorical exposures as dependent variable and the continuous exposure as independent variable.
7.5 Exposures clustering
Clustering analysis on samples can be performed to cluster individuals having similar exposure profiles. This is done using hierarchical clustering using the function hclust from the stats R package. The results this analysis yields are the exposure profiles of a selected number of groups.
7.6 Exposome Association Analysis
7.6.1 Single Association Analysis
Exposome-Wide Association Study (ExWAS) is equivalent to a Genome-Wide Association Study (GWAS) in genomics or to Epigenetic-Wide Association Study (EWAS) in epigenomics. The ExWAS was first described by Patel et al. Patel, Bhattacharya, and Butte (2010) . ExWAS are based on generalized linear models using any formula describing the model that should be adjusted for (following standard formula options in R). That is, continuous or factor variables can be incorporated in the design, as well as interaction or splines using standard R functions and formulas. Multiple comparisons in the ExWAS analysis is addressed by computing the number of effective (Neff) tests as described by Li and Ju Li and Ji (2005) . The method estimates Neff by using the exposure correlation matrix that is corrected when it is not positive definite by using nearPD R function. The significant threshold is computed as 1-(1-0.05)Meff. This threshold is added to the Manhattan plots. When using imputed data, analysis is done for each imputed set and P-Values are pooled to obtain a global association score.
7.6.2 Stratified Single Association Analysis
The stratified analysis option for the ExWAS corresponds to applying the same method as regular ExWAS to subsetted datasets. As example, a stratified analysis with the sex
variable stratified corresponds to performing two ExWAS, one to the male
and one for the female
group.
7.6.3 Variable selection ExWAS
There are some authors that proposed to perform association analysis in a multivariate fashion, just to take into account the correlation across exposures Agier et al. (2016) . A Lasso regression is implemented using Elastic-Net regularized generalized linear models implemented in glmnet R package.
7.7 Exposome-Omic Association Analysis
Perform association analyses between exposures and omic data bt fitting linear models as described in the limma R package Ritchie et al. (2015) . The pipeline implemented in association allows performing surrogate variable analysis in order to correct for unwanted variability. This adjustment is provided by SVA R package Leek et al. (2020) .
7.8 Integration analysis
There are three different methodologies to perform the integration analysis:
- Multiset canonical correlation analysis (MCCA). Implemented using the
MultiCCA
function ofPMA
R package Witten et al. (2020) . - Multiple co-inertia analysis (MCIA). Implemented using the
mcia
function ofomicade4
R package Meng et al. (2013) , Min and Long (2020) . - Partial least squares (PLS). Implemented using the
plsr
function ofpls
R package Mevik and Wehrens (2015) .