Part 3: Data wranggling
We are now in the position to start working with the data. The first step is to learn how to manipulate it.
Data classes
All the tools to perform data transformations are the following:
ds.asCharacter
: Useful to convert numerical values that we want as characters.ds.asDataMatrixr
: Useful to convertdata.frames
into matrices, as some functions do not accept data frames as inputs. Maintains the original class for all columns.ds.asMatrix
Useful to convertdata.frames
into matrices, as some functions do not accept data frames as inputs. Converts all columns into character class.ds.asFactor
: May cause disclosure issues if we try to convert a continuous variable. Very useful when the loaded categorical variables are represented ascharacter
instead offactor
.ds.asInteger
: Useful when we require integer values, as theclass
numeric can’t guarantee it.ds.asList
: Rarely used.ds.asLogical
: To havebool
variables. Similar to havingfactor
variables.ds.asNumeric
: Useful to convert columns that have been interpreted ascharacter
class but we want them as numbers.
When we use all this functions, we will create a new object on the study servers. If we are dealing with a data.frame
and want to include this new column we created into it, we can use the following code.
ds.asFactor(input.var.name = "data$CMXCVD", newobj.name = "CMXCVD_factor")
$all.unique.levels
[1] "Yes" "No"
$return.message
[1] "Data object <CMXCVD_factor> correctly created in all specified data sources"
::datashield.assign.expr(connections, "data", "cbind(data, CMXCVD_factor)") DSI
We will overwrite the input data
object with an added column.
tail(ds.colnames("data")[[1]])
[1] "SMXSBA_numeric" "SMXNAA_numeric" "DMRGENDR_numeric" "DATAD_year"
[5] "DATAD_year_day" "CMXCVD_factor"
Complex recoding. Number of comorbidities
Given the available variables on our dataset, we might be interested on recoding variables or creating new variables as combinations of the existing ones. This is exactly what we will do here. We will take the comorbities: CMXHT
, CMXCVD
, CMXCPD
, CMXCKD
, CMXCLD
and RFXONC
.
We will create a new variable called CMXCOM
that quantifies how many comorbidities an individual has. We will have four categories:
- 0 comorbidities
- 1 comorbidities
- 2 comorbidities
- 3+ comorbidities
The process to achieve this is not as easy as straightforward as it would be using base R functions on our computer.
To begin, we will take a look at how the variables are encoded. Looking at the codebook they should all be dicotomous variables encoded as "Yes"
/ "No"
. We can verify that with the functions we’ve covered at the Part 2 of this workshop.
ds.table("data$CMXHT")$output.list$TABLES.COMBINED_all.sources_counts
Data in all studies were valid
Study 1 : No errors reported from this study
Study 2 : No errors reported from this study
data$CMXHT
Yes No NA
436 101 1977
Now, we will recode all the variabled so they are coded as 0
/ 1
.
<- c("CMXHT", "CMXCVD",
variables "CMXCPD", "CMXCKD", "CMXCLD", "RFXONC")
for (x in variables){
ds.recodeValues(var.name = paste0("data$", x),
values2replace.vector = c("Yes", "No"),
new.values.vector = c(1, 0),
newobj = paste0(x, "_recoded"))
}
We have created a new object for each comorbiditie. The problem we now have is that the objects are of class character
, because the ds.recodeValues
function does not change the class.
ds.class("CMXHT_recoded")
$sc_verona
[1] "character"
$umf_cluj
[1] "character"
Since we want to count the numbers of comorbidities for each individual, it is important that we have a numeric
variable. As we have prevously seen we can easily obtain this.
for (x in variables){
ds.asNumeric(x.name = paste0(x, "_recoded"),
newobj = paste0(x, "_recoded_num"))
}
ds.class("CMXHT_recoded_num")
$sc_verona
[1] "numeric"
$umf_cluj
[1] "numeric"
Now we have a collection of variables encoded as we want and ready to be combined. First, we will join them on a data.frame
.
ds.dataFrame(x = paste0(variables, "_recoded_num"),
newobj = "joint_comorbidities")
$is.object.created
[1] "A data object <joint_comorbidities> has been created in all specified data sources"
$validity.check
[1] "<joint_comorbidities> appears valid in all sources"
On the server we have created a table that looks something like this:
id | RFXOB_recoded_num | CMXDI_recoded_num | CMXHT_recoded_num | … |
---|---|---|---|---|
Individual 1 | 0 | 0 | 0 | … |
Individual 2 | 1 | 1 | 1 | … |
… | … | … | … | … |
The variable we want to create is the amount of comorbidities by individual, therefore we have to perform rowSums
.
ds.rowColCalc(x = "joint_comorbidities",
operation = "rowSums",
newobj = "new_variable")
We are almost done, we will convert our new variable to be a factor
.
ds.asFactor(input.var.name = "new_variable",
newobj.name = "new_variable_factor")
$all.unique.levels
[1] "0" "1" "2" "3" "4" "5"
$return.message
[1] "Data object <new_variable_factor> correctly created in all specified data sources"
We can see that when we call the function ds.asFactor
we receive the actual levels of the variable. We have said that the levels we actually want are 0, 1, 2, 3+
.
To achieve this we just have to re-code the levels.
ds.recodeValues(var.name = "new_variable_factor",
values2replace.vector = c("0", "1", "2", "3", "4", "5", "6"),
new.values.vector = c("0", "1", "2","3+", "3+", "3+","3+"),
newobj = "CMXCOM")
$is.object.created
[1] "A data object <CMXCOM> has been created in all specified data sources"
$validity.check
[1] "<CMXCOM> appears valid in all sources"
Finally, we merge the new variable to our original data, and we are finished.
::datashield.assign.expr(connections, "data", "cbind(data, CMXCOM)") DSI
Complete cases
Real data tends to have missing values, however, some specific tools or functions expect complete data as input. For that reason there is a function specifically for that. Let’s suppose we want to have the complete cases for the variables DATAD
and CMXHT
. First we join them on a new data.frame
.
ds.dataFrame(x = c("data$DATAD", "data$CMXHT"), newobj = "data_subset")
$is.object.created
[1] "A data object <data_subset> has been created in all specified data sources"
$validity.check
[1] "<data_subset> appears valid in all sources"
Now we can get the complete cases.
ds.completeCases(x1 = "data_subset", newobj = "data_subset_ccases")
$is.object.created
[1] "A data object <data_subset_ccases> has been created in all specified data sources"
$validity.check
[1] "<data_subset_ccases> appears valid in all sources"
We can check the dimensions before and after.
ds.dim("data_subset")
$`dimensions of data_subset in sc_verona`
[1] 1515 2
$`dimensions of data_subset in umf_cluj`
[1] 999 2
$`dimensions of data_subset in combined studies`
[1] 2514 2
ds.dim("data_subset_ccases")
$`dimensions of data_subset_ccases in sc_verona`
[1] 269 2
$`dimensions of data_subset_ccases in umf_cluj`
[1] 252 2
$`dimensions of data_subset_ccases in combined studies`
[1] 521 2
Dates
For this section we will use the HM Hospitales server, as it is the only one that has the prototype package installed.
library(DSI)
library(DSOpal)
library(dsBaseClient)
library(dsDatesClient)
<- DSI::newDSLoginBuilder()
builder $append(server = "hm_hospitales",
builderurl = "https://192.168.1.50:9002",
user = "user_analisis", password = "Ekfl07UUgz")
<- builder$build()
logindata library(httr);set_config(config(ssl_verifypeer = 0L))
<- DSI::datashield.login(logins = logindata, failSafe = TRUE) connections
Logging into the collaborating servers
::datashield.assign.resource(connections, "resource", "FiHM.harmonized_data")
DSI::datashield.assign.expr(conns = connections, symbol = "data", expr = "as.resource.data.frame(resource)") DSI
To begin working with dates, first we have to convert the columns that have the dates to the appropiate class. On the unCoVer codebook, the dates are encoded as strings with the structure yyyy-mm-dd
, this does not hold for the HM Hospitales server, which has the dates encoded as POSIXct
, for that reason first we will convert them to characters. In future versions this will be solved and we will be able to work with POSIXct
data. We convert the columns using the ds.asDate
function.
ds.asCharacter("data$DATAD", "DATD_chr")
::datashield.assign.expr(connections, "data", "cbind(data, DATD_chr)")
DSIds.asCharacter("data$DATDS", "DATDS_chr")
::datashield.assign.expr(connections, "data", "cbind(data, DATDS_chr)")
DSI
ds.asDate(x.name = "data$DATD_chr", newobj = "DATD_converted", datasources = connections)
ds.asDate(x.name = "data$DATDS_chr", newobj = "DATDS_converted", datasources = connections)
We can now see that we have a column with the Date
class.
ds.class("data$DATD_converted")
$hm_hospitales
[1] "Date"
Now we have two columns that contain dates in the correct class. With that, we can create a new column with the difference in days between the two dates.
ds.dateDiff("data$DATD_converted", "data$DATDS_converted", "difference")
ds.quantileMean("data$difference")
Quantiles of the pooled data
5% 10% 25% 50% 75% 90% 95% Mean
1.0000 2.0000 4.0000 7.0000 12.0000 20.0000 28.0000 10.0701
Also, we can perform subsets of the dataframe using dates.
ds.subsetByDate("data", "DATD_converted", year = 2022, newobj = "data_subset")
$is.object.created
[1] "A data object <data_subset> has been created in all specified data sources"
$validity.check
[1] "<data_subset> appears valid in all sources"
ds.dim("data_subset")
$`dimensions of data_subset in hm_hospitales`
[1] 549 224
$`dimensions of data_subset in combined studies`
[1] 549 224
The subset function can be used for more complex queries. For extra information make sure to check ?ds.subsetByDate
.
ds.subsetByDate("data", "DATD_converted", year = 2022, month = 4)
ds.subsetByDate("data", "DATD_converted", range = c("2021-05-23", "2021-09-25"))