1 Introduction

1.1 Materials to read beforehand

Along this book, there are some details regarding DataSHIELD and “resources” that are not explained in detail, it is expected that the reader is familiar with them. If that is not the case, there are other free online books/papers with that knowledge.

  • DataSHIELD paper: Description of what is DataSHIELD.

  • DataSHIELD wiki: Materials about DataSHIELD including:

    • Beginner material
    • Recorded DataSHIELD workshops
    • Information on current release of DataSHIELD
  • resource book: In this book you will find information about:

    • DataSHIELD (Section 5)
    • What are resources (Section 6/7)

We will be interacting with DataSHIELD through a data warehouse called Opal. This is the server that will handle the authentication of our credentials, storage of data and “resources” and will provide an R server where the non-disclosive analysis will be conducted. Information about it can also be foun online:

1.2 What are “resources”: A very simple explanation without any technicalities

It is quite important to have a solid understanding of what are the “resources” and how we work with them, since in all the use cases we are interacting with them to load the Omic data on the R sessions. For that reason we included a very brief description of them without using technicalities.

The “resources” can be imagined as a data structure that contains the information about where to find a data set and the access credentials to it; we as DataSHIELD users are not able to look at this information (it is privately stored on the Opal server), but we can load it into our remote R session to make use of it. Following that, the next step comes naturally.

Once we have in an R session the information to access a dataset (an ExpressionSet for example) we have to actually retrieve it on the remote R session to analyze it. This step is called resolving the resource.

Those two steps can be identified on the code we provide as the following:

Loading the information of a “resource”:

DSI::datashield.assign.resource(conns, "resource", "resource.path.in.opal.server")

Resolving the “resource”:

DSI::datashield.assign.expr(conns, "resource.resolved", expr = as.symbol("as.resource.object(resource)"))

This toy code would first load the “resource” on a variable called resource and it would retrieve the information it contains and assign it to a variable called resource.resolved.

1.3 Capabilities of OmicSHIELD

The functionalities of OmicSHIELD are built on top of the “resources” to work with different types of data objects, more precisely we have developed capabilities to work with the following R objects:

  • ExpressionSet
  • RangedSummarizedExperiment
  • VCF/GDS (Genotype data containers)

These objects are analyzed using BioConductor packages as well as custom-made functions. This ensures that researchers familiar with the BioConductor universe will feel at home when using OmicSHIELD.

Not only we can work using a BioConductor approach, we also developed functionalities to make use of command line tools that are traditionally used on omics analysis, those are:

  • PLINK
  • SNPTEST

This allow the researchers to perform analysis on federated data using their own command line based pipelines. Again this ensures that people familiar with those tools will be able to perform analysis easily.

1.4 Opal servers

Along this bookdown there are reproducible examples that make use of two different Opal servers. Information about the technology and resources about setting up Opal servers on your institution can be found on the following links 1, 2.

Information about the used Opal servers:

Opal 1 Opal 2
URL https://opal-demo.obiba.org/ https://datashield.isglobal.org/repo
Host Obiba ISGlobal
Cores 12 72
RAM 18 GB 218 GB
Details
  • For development purposes

  • Daily rebuild with static data and libraries

  • Only accessible with ISGlobal permissions
Credentials
  • User: dsuser

  • Password: P@ssw0rd

Upon request