Part 3: Data wranggling

We are now in the position to start working with the data. The first step is to learn how to manipulate it.

Data classes

All the tools to perform data transformations are the following:

  • ds.asCharacter: Useful to convert numerical values that we want as characters.
  • ds.asDataMatrixr: Useful to convert data.frames into matrices, as some functions do not accept data frames as inputs. Maintains the original class for all columns.
  • ds.asMatrixUseful to convert data.frames into matrices, as some functions do not accept data frames as inputs. Converts all columns into character class.
  • ds.asFactor: May cause disclosure issues if we try to convert a continuous variable. Very useful when the loaded categorical variables are represented as character instead of factor.
  • ds.asInteger: Useful when we require integer values, as the class numeric can’t guarantee it.
  • ds.asList: Rarely used.
  • ds.asLogical: To have bool variables. Similar to having factor variables.
  • ds.asNumeric: Useful to convert columns that have been interpreted as character class but we want them as numbers.

When we use all this functions, we will create a new object on the study servers. If we are dealing with a data.frame and want to include this new column we created into it, we can use the following code.

ds.asFactor(input.var.name = "data$CMXCVD", newobj.name = "CMXCVD_factor")
$all.unique.levels
[1] "Yes" "No" 

$return.message
[1] "Data object <CMXCVD_factor> correctly created in all specified data sources"
DSI::datashield.assign.expr(connections, "data", "cbind(data, CMXCVD_factor)")

We will overwrite the input data object with an added column.

tail(ds.colnames("data")[[1]])
[1] "SMXSBA_numeric"   "SMXNAA_numeric"   "DMRGENDR_numeric" "DATAD_year"      
[5] "DATAD_year_day"   "CMXCVD_factor"   

Complex recoding. Number of comorbidities

Given the available variables on our dataset, we might be interested on recoding variables or creating new variables as combinations of the existing ones. This is exactly what we will do here. We will take the comorbities: CMXHT, CMXCVD, CMXCPD, CMXCKD, CMXCLD and RFXONC.

We will create a new variable called CMXCOM that quantifies how many comorbidities an individual has. We will have four categories:

  • 0 comorbidities
  • 1 comorbidities
  • 2 comorbidities
  • 3+ comorbidities

The process to achieve this is not as easy as straightforward as it would be using base R functions on our computer.

To begin, we will take a look at how the variables are encoded. Looking at the codebook they should all be dicotomous variables encoded as "Yes" / "No". We can verify that with the functions we’ve covered at the Part 2 of this workshop.

ds.table("data$CMXHT")$output.list$TABLES.COMBINED_all.sources_counts

 Data in all studies were valid 

Study 1 :  No errors reported from this study
Study 2 :  No errors reported from this study
data$CMXHT
 Yes   No   NA 
 436  101 1977 

Now, we will recode all the variabled so they are coded as 0 / 1.

variables <- c("CMXHT", "CMXCVD", 
               "CMXCPD", "CMXCKD", "CMXCLD", "RFXONC")

for (x in variables){
  ds.recodeValues(var.name = paste0("data$", x), 
                  values2replace.vector = c("Yes", "No"), 
                  new.values.vector = c(1, 0),
                  newobj = paste0(x, "_recoded"))
}

We have created a new object for each comorbiditie. The problem we now have is that the objects are of class character, because the ds.recodeValues function does not change the class.

ds.class("CMXHT_recoded")
$sc_verona
[1] "character"

$umf_cluj
[1] "character"

Since we want to count the numbers of comorbidities for each individual, it is important that we have a numeric variable. As we have prevously seen we can easily obtain this.

for (x in variables){
  ds.asNumeric(x.name = paste0(x, "_recoded"), 
               newobj = paste0(x, "_recoded_num"))
}
ds.class("CMXHT_recoded_num")
$sc_verona
[1] "numeric"

$umf_cluj
[1] "numeric"

Now we have a collection of variables encoded as we want and ready to be combined. First, we will join them on a data.frame.

ds.dataFrame(x = paste0(variables, "_recoded_num"), 
             newobj = "joint_comorbidities")
$is.object.created
[1] "A data object <joint_comorbidities> has been created in all specified data sources"

$validity.check
[1] "<joint_comorbidities> appears valid in all sources"

On the server we have created a table that looks something like this:

id RFXOB_recoded_num CMXDI_recoded_num CMXHT_recoded_num
Individual 1 0 0 0
Individual 2 1 1 1

The variable we want to create is the amount of comorbidities by individual, therefore we have to perform rowSums.

ds.rowColCalc(x = "joint_comorbidities", 
              operation = "rowSums", 
              newobj = "new_variable")

We are almost done, we will convert our new variable to be a factor.

ds.asFactor(input.var.name = "new_variable", 
            newobj.name = "new_variable_factor")
$all.unique.levels
[1] "0" "1" "2" "3" "4" "5"

$return.message
[1] "Data object <new_variable_factor> correctly created in all specified data sources"

We can see that when we call the function ds.asFactor we receive the actual levels of the variable. We have said that the levels we actually want are 0, 1, 2, 3+.

To achieve this we just have to re-code the levels.

ds.recodeValues(var.name = "new_variable_factor", 
                values2replace.vector = c("0", "1", "2", "3", "4", "5", "6"),
                new.values.vector = c("0", "1", "2","3+", "3+", "3+","3+"), 
                newobj = "CMXCOM")
$is.object.created
[1] "A data object <CMXCOM> has been created in all specified data sources"

$validity.check
[1] "<CMXCOM> appears valid in all sources"

Finally, we merge the new variable to our original data, and we are finished.

DSI::datashield.assign.expr(connections, "data", "cbind(data, CMXCOM)")

Complete cases

Real data tends to have missing values, however, some specific tools or functions expect complete data as input. For that reason there is a function specifically for that. Let’s suppose we want to have the complete cases for the variables DATAD and CMXHT. First we join them on a new data.frame.

ds.dataFrame(x = c("data$DATAD", "data$CMXHT"), newobj = "data_subset")
$is.object.created
[1] "A data object <data_subset> has been created in all specified data sources"

$validity.check
[1] "<data_subset> appears valid in all sources"

Now we can get the complete cases.

ds.completeCases(x1 = "data_subset", newobj = "data_subset_ccases")
$is.object.created
[1] "A data object <data_subset_ccases> has been created in all specified data sources"

$validity.check
[1] "<data_subset_ccases> appears valid in all sources"

We can check the dimensions before and after.

ds.dim("data_subset")
$`dimensions of data_subset in sc_verona`
[1] 1515    2

$`dimensions of data_subset in umf_cluj`
[1] 999   2

$`dimensions of data_subset in combined studies`
[1] 2514    2
ds.dim("data_subset_ccases")
$`dimensions of data_subset_ccases in sc_verona`
[1] 269   2

$`dimensions of data_subset_ccases in umf_cluj`
[1] 252   2

$`dimensions of data_subset_ccases in combined studies`
[1] 521   2

Dates

For this section we will use the HM Hospitales server, as it is the only one that has the prototype package installed.

library(DSI)
library(DSOpal)
library(dsBaseClient)
library(dsDatesClient)
builder <- DSI::newDSLoginBuilder()
builder$append(server = "hm_hospitales",
               url = "https://192.168.1.50:9002",
               user = "user_analisis", password = "Ekfl07UUgz")
logindata <- builder$build()
library(httr);set_config(config(ssl_verifypeer = 0L))
connections <- DSI::datashield.login(logins = logindata, failSafe = TRUE)

Logging into the collaborating servers
DSI::datashield.assign.resource(connections, "resource", "FiHM.harmonized_data")
DSI::datashield.assign.expr(conns = connections, symbol = "data", expr = "as.resource.data.frame(resource)")

To begin working with dates, first we have to convert the columns that have the dates to the appropiate class. On the unCoVer codebook, the dates are encoded as strings with the structure yyyy-mm-dd, this does not hold for the HM Hospitales server, which has the dates encoded as POSIXct, for that reason first we will convert them to characters. In future versions this will be solved and we will be able to work with POSIXct data. We convert the columns using the ds.asDate function.

ds.asCharacter("data$DATAD", "DATD_chr")
DSI::datashield.assign.expr(connections, "data", "cbind(data, DATD_chr)")
ds.asCharacter("data$DATDS", "DATDS_chr")
DSI::datashield.assign.expr(connections, "data", "cbind(data, DATDS_chr)")

ds.asDate(x.name = "data$DATD_chr", newobj = "DATD_converted", datasources = connections)
ds.asDate(x.name = "data$DATDS_chr", newobj = "DATDS_converted", datasources = connections)

We can now see that we have a column with the Date class.

ds.class("data$DATD_converted")
$hm_hospitales
[1] "Date"

Now we have two columns that contain dates in the correct class. With that, we can create a new column with the difference in days between the two dates.

ds.dateDiff("data$DATD_converted", "data$DATDS_converted", "difference")
ds.quantileMean("data$difference")
 Quantiles of the pooled data
     5%     10%     25%     50%     75%     90%     95%    Mean 
 1.0000  2.0000  4.0000  7.0000 12.0000 20.0000 28.0000 10.0701 

Also, we can perform subsets of the dataframe using dates.

ds.subsetByDate("data", "DATD_converted", year = 2022, newobj = "data_subset")
$is.object.created
[1] "A data object <data_subset> has been created in all specified data sources"

$validity.check
[1] "<data_subset> appears valid in all sources"
ds.dim("data_subset")
$`dimensions of data_subset in hm_hospitales`
[1] 549 224

$`dimensions of data_subset in combined studies`
[1] 549 224

The subset function can be used for more complex queries. For extra information make sure to check ?ds.subsetByDate.

ds.subsetByDate("data", "DATD_converted", year = 2022, month = 4)
ds.subsetByDate("data", "DATD_converted", range = c("2021-05-23", "2021-09-25"))