I need to process data that is being provided in a password protected Excel (xlsx) workbook. For legal reasons, I cannot create an unprotected Excel file or a csv file etc and process from there. None of the Excel import packages can deal with password protected workbooks.
From this answer Import password-protected xlsx workbook into R I have managed to extract the data. However it is imported in a list of character lists format. The dput of my list looks like this:
list(list("ID", "ID1", "ID2"),
list("V2", NULL, "text2"),
list("Name", "John Smith", "Mary Brown"),
list("Score", 1, 2),
list("email", "[email protected]", "[email protected]"))
What I want is a dataframe with columns ID, V2 etc that looks like this:
ID V2 Name Score email
ID1 NULL John Smith 1 [email protected]
ID2 text2 Mary Brown 2 [email protected]
There are empty cells in the original Excel workbook, so solutions with unlist will not work.
Using a combination of answers from R list to data frame and other similar questions, I have the following code (where listform is the name of the list):
matform <- as.matrix(sapply(listform, function(s) s)) # retains empty
df <- data.frame(matform[2:nrow(matform),])
names(df) = matform[1,]
This is close, but the dataframe has lists as columns. So str(df)
yields:
'data.frame': 2 obs. of 5 variables:
$ ID:List of 2
..$ : chr "ID1"
..$ : chr "ID2"
$ V2:List of 2
..$ : NULL
..$ : chr "text2"
and so on
"SetDT" from the "data.table" package seems to be very powerful:
> library(data.table)
> null2na <- function(x){ ifelse(is.null(x),NA,x)}
> f <- function(x){sapply(x,null2na)}
> L <- list(list("ID", "ID1", "ID2"),
+ list("V2", NULL, "text2"),
+ list("Name", "John Smith", "Mary Brown"),
+ list("S ..." ... [TRUNCATED]
> L <- setDT(L)[, lapply(.SD, f)]
> setnames(L,colnames(L),unlist(L[1,]))
> L <- L[-1,]
> L
ID V2 Name Score email
1: ID1 NA John Smith 1 [email protected]
2: ID2 text2 Mary Brown 2 [email protected]
> str(L)
Classes ‘data.table’ and 'data.frame': 2 obs. of 5 variables:
$ ID : chr "ID1" "ID2"
$ V2 : chr NA "text2"
$ Name : chr "John Smith" "Mary Brown"
$ Score: chr "1" "2"
$ email: chr "[email protected]" "[email protected]"
- attr(*, ".internal.selfref")=<externalptr>
>
(A data table is a better data frame.)
The function "f" does two jobs: It "unlist"s and turns NULL into NA.
First step: extract the names:
names = lapply(listform, `[[`, 1)
data = setNames(lapply(listform, `[`, -1), names)
Second step: unlist data and bind columns
result = as.data.frame(sapply(data, unlist))
The magic here happens in sapply
, which internally calls simplify2array
. as.data.frame
at the end is needed to make a data.frame out of the resulting matrix, and to assign proper data types to the individual columns.
The above code has got one problem though: the column types probably aren’t what you want. This can be fixed as follows:
col_classes = sapply(lapply(listform, `[[`, 2), typeof)
result = as.data.frame(sapply(data, unlist), stringsAsFactors = FALSE)
for (col in seq_len(ncol(result)))
class(result[[col]]) = col_classes[col]
Now you’ll get the following result:
> str(result)
'data.frame': 2 obs. of 5 variables:
$ ID : chr "ID1" "ID2"
$ V2 : chr "text1" "text2"
$ Name : chr "John Smith" "Mary Brown"
$ Score: num 1 2
$ email: chr "[email protected]" "[email protected]"
And this, I think, is what you want.
Here's one way using data.table v1.9.5
(for the transpose()
function):
require(data.table) # v1.9.5+
setDT(sapply(ll, function(x) setattr(transpose(x[-1L]), 'names', x[[1L]])))[]
# ID V2 Name Score email
# 1: ID1 NA John Smith 1 [email protected]
# 2: ID2 text2 Mary Brown 2 [email protected]
Use setDF()
instead of setDT()
to return a data.frame
instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With