22 Advanced/technical questions

We received a number of question when submitting our paper to Plos Comp Biology by the reviewers. Most of them were addressed into the manuscript. However, there were some raised by one of the reviewers that focused on technical aspects or technical details of the resources. These are reproduced here just in case some of the readers are also interested in such advanced aspects.

Comment 1: Resources credentials are fixed and managed by Opal which implies you do not have a refresh token or something like that, that will expire over time. The policy decision and enforcement is now located in the DataSHIELD engine. This means that the resource owner can not decide anymore if someone has access to the resource. Audit logging at the resource side is hard to do this way.

Answer: it is true that a more elaborated authentication/authorization policy cannot be handled on the DataSHIELD server side. Any token must be valid for a programmatic usage, and if it happens to expire or be invalidated, it needs to be renewed and updated in the Opal’s definition of the resource. Regarding the auditing, all DataSHIELD user commands are logged by the Opal server. The data owner has both control on the resource data access credentials and the permissions to use this resource.

Comment 2: If the computational resource needs to do a lot, you need to program that either in the ResourceExtension or as given commands. When you program an extension you are dependent on the person’s choices regarding the interface he/she exposes and when you give it as parameters in the resource you need to parse it in the resourcer package. For example, how do you prevent malicious SQL injection? We would expect that you would not open up such resources to the external source but only a pre-packaged analysis.

Answer: The DataSHIELD R analysis procedure that makes use of a resource is responsible for interacting in a secure way with the underlying resource system (database or ssh server for instance), like any application that makes use of a database. The best practice is to expose an API that is “business” oriented by defining a limited number of parameters that are easy to validate for each specific operations. This is also facilitated by the DataSHIELD infrastructure built-in feature which consists of allowing a limited syntax for expressing the server side function call parameters.

Comment 3: You are still bound to the limitations of R in terms of memory and CPU usage when you want to correlate data in R against the data that is available in the resource. You need to either push the data you want to correlate against into the resource or extract it from the resource in the R-environment.

Answer: The resource API offers the possibility to work with the dplyr package which delegates as much as possible the work to the underlying database. More generally, R is only required as an entry point and real data analysis can happen in any system that is accessible by R (for instance ML analysis can be launched from R in a Apache Spark cluster). The DataSHIELD analysis procedures must be programmed in that way, to support large to big datasets analysis. Another improvement that is currently being developed (EUCAN-Connect) is the capability of having multiple R servers for a single DataSHIELD data node, in order to address the case of multiple users accessing concurrently the memory and CPU of a server.

Comment 4: What is going to happen when you want to finish the analysis on another moment. Or the analysis is taking days how do you retrieve the result.

Answer: A plain R server is probably not the most suitable system for handling long running blocking tasks. Usually this kind of situation is addressed by submitting a task execution request to a worker, which will run the task in the background and will make available the progress of the task and its result for latter retrieval. This type of architecture is available in many languages and systems and could be implemented in a DataSHIELD analysis procedure.

Comment 5: When the analysis is taking a lot of resources in terms of memory and CPU how do you limit this per DataSHIELD-user? Who is responsible for this, DataSHIELD or the resource owner?

Answer: It is possible in R to limit the memory and CPU usage (and much more) using the package RAppArmor. One possible improvement would be the ability to define an AppArmor profile per user or group of users (this profile would be applied at the start of a DataSHIELD session in each data node). The data owner would have the control of the definition of the profiles and which one apply to each user.

Comment 6: The resource owner needs to implement a way to be easily accessible for the resourcer-package. Especially when you want to run more complex jobs it usually takes a complex interface to work with.

Answer: Building access to a resource is one complexity, that is in practice limited. Allowing multiple/complex parameters for the analysis of a resource is another one, that takes place in the definition of the DataSHIELD R server analysis functions, not in the definition of the resource.

Comment 7: In the shell and ssh resource the Opal administrator is managing the actions that may be performed by the resource handler (researcher in general). When you host Opal on different infrastructure than the resource and is managed by someone who is not in charge of managing the resource the list of possible commands can be freely added to the resource. This possible allows

Answer: The example of the PLINK resource (i.e. running PLINK through SSH) is put in practive by a using a SSH server in a docker container: the possible actions are limited to the available commands and data in this container for the user that connects to the SSH server (which itself has limited rights). Extra security could be added with setting up an AppArmor constrained environment. This illustrates that the security enforcement is possible but needs to be thought ahead of deploying access to such a resource.

Comment 8: Maybe less dependencies in the resourcer package?

Answer: We have tried to find a balance between showing out-of-the-box capabilities of various types of resources and not overloading the package. Note that most of the dependencies are suggestions and are not required at installation time. Since the review of the paper, several additional resource extensions have been developed to access less common or more specific resources (Dremio, HL7 FHIR, S3).

Comment 9: When the interface of a dependency is changed the package needs to be changed as well.

Answer: The resourcer package dependencies are well established packages (DBI, tidyverse etc.) that should not change in the near future. More experimental ones (ODBC, aws.s3) are proposed as additional packages.

Comment 10: Think of a way to delegate the authorisation and authentication back to the resource owner. We would propose to change the way of passing credentials in the resource.

Answer: The resource owner also owns the resource definition in the Opal server.

Comment 11: Is it possible to restrict the usage of certain resources? For example, prohibit the use of resources.

Answer: The resource usage requires user/group permission to be set up in Opal. In addition to that, the credentials used to access the resource can be permanently or temporarily invalidated by the resource owner.

Comment 12: It would be nice to have some way of enforcing the metadata which is tight toi the data to be correct. This is often the problem with longitudinal data that metadata over all columns should be correct. It is likely that this also yields for big data structures that are analysed in a pooled manner. How do you handle multiple versions of VCF for example?

Answer: This is a data harmonization issue that is usually addressed before the analysis. Data format validation, if needed, can also be performed by the analyst using DataSHIELD functions. Considering having different versions of VCF we aim to harmonize data by first doing imputation and solving issues concerning genomic strand and file format (PMID: 25495213).