1 Introduction
1.1 Materials to read beforehand
Along this book, there are some details regarding DataSHIELD and “resources” that are not explained in detail, it is expected that the reader is familiar with them. If that is not the case, there are other free online books/papers with that knowledge.
DataSHIELD paper: Description of what is DataSHIELD.
DataSHIELD wiki: Materials about DataSHIELD including:
- Beginner material
- Recorded DataSHIELD workshops
- Information on current release of DataSHIELD
resource book: In this book you will find information about:
- DataSHIELD (Section 5)
- What are resources (Section 6/7)
We will be interacting with DataSHIELD through a data warehouse called Opal. This is the server that will handle the authentication of our credentials, storage of data and “resources” and will provide an R server where the non-disclosive analysis will be conducted. Information about it can also be foun online:
- Opal papers 1; 2
- Opal documentation
1.2 What are “resources”: A very simple explanation without any technicalities
It is quite important to have a solid understanding of what are the “resources” and how we work with them, since in all the use cases we are interacting with them to load the Omic data on the R sessions. For that reason we included a very brief description of them without using technicalities.
The “resources” can be imagined as a data structure that contains the information about where to find a data set and the access credentials to it; we as DataSHIELD users are not able to look at this information (it is privately stored on the Opal server), but we can load it into our remote R session to make use of it. Following that, the next step comes naturally.
Once we have in an R session the information to access a dataset (an ExpressionSet for example) we have to actually retrieve it on the remote R session to analyze it. This step is called resolving the resource.
Those two steps can be identified on the code we provide as the following:
Loading the information of a “resource”:
::datashield.assign.resource(conns, "resource", "resource.path.in.opal.server") DSI
Resolving the “resource”:
::datashield.assign.expr(conns, "resource.resolved", expr = as.symbol("as.resource.object(resource)")) DSI
This toy code would first load the “resource” on a variable called resource
and it would retrieve the information it contains and assign it to a variable called resource.resolved
.
1.3 Capabilities of OmicSHIELD
The functionalities of OmicSHIELD are built on top of the “resources” to work with different types of data objects, more precisely we have developed capabilities to work with the following R objects:
- ExpressionSet
- RangedSummarizedExperiment
- VCF/GDS (Genotype data containers)
These objects are analyzed using BioConductor packages as well as custom-made functions. This ensures that researchers familiar with the BioConductor universe will feel at home when using OmicSHIELD.
Not only we can work using a BioConductor approach, we also developed functionalities to make use of command line tools that are traditionally used on omics analysis, those are:
- PLINK
- SNPTEST
This allow the researchers to perform analysis on federated data using their own command line based pipelines. Again this ensures that people familiar with those tools will be able to perform analysis easily.
1.4 Opal servers
Along this bookdown there are reproducible examples that make use of two different Opal servers. Information about the technology and resources about setting up Opal servers on your institution can be found on the following links 1, 2.
Information about the used Opal servers:
Opal 1 | Opal 2 | |
---|---|---|
URL | https://opal-demo.obiba.org/ | https://datashield.isglobal.org/repo |
Host | Obiba | ISGlobal |
Cores | 12 | 72 |
RAM | 18 GB | 218 GB |
Details |
|
|
Credentials |
|
Upon request |