6 Addressing batch effects

Omic data are the product of high throughput technologies that are affected by laboratory conditions, reagent lots and personnel. Biological systems are highly reactive on small changes in the surrounding physical an chemical conditions. Therefore, omic data depends on the date and place of processing as surrogate variables of uncontrolled conditions. Samples that are processed in the same laboratory and the same place will reflect such batch effect in their data. All types of omic data are affected by batch effects. While the effects can be mitigated by a suitable study design, they cannot be completely removed. Clearly, if all cases are processed together, their differences with controls cannot be teased apart from the batch effect. In addition, while statistical estimates can be adjusted by laboratory and date, there may be measurements in some genes that are more reactive than others and will be more subjected to confounding than others.

In this chapter, we analyze already collected data and readers interested in collecting new data should refer to authors discussing the study design of specific omic studies. In this chapter, we illustrate how to detect and correct the batch effect, taking as an example transcriptomic data. We discuss how to detect unwanted variation in omic data from high throughput experiments using surrogate variable analysis (SVA). When batch effect variables have been reported then corrected datasets can be obtained using the ComBat algorithm.

The R code to reproduce the results and figures of this chapter can be found here.