Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Melt using patterns when variable names contain string information - avoid coercion to numeric

Tags:

r

data.table

melt

I am using the patterns() argument in data.table::melt() to melt data that has columns that have several easily-defined patterns. It is working, but I'm not seeing how I can create a character index variable instead of the default numeric breakdown.

For example, in data set 'A', the dog and cat column names have numeric suffixes (e.g. 'dog_1', 'cat_2'), which are handled correctly in melt (see the resulting 'variable' column):

A = data.table(idcol = c(1:5),
            dog_1 = c(1:5),   cat_1 = c(101:105),
            dog_2 = c(6:10),  cat_2 = c(106:110),
            dog_3 = c(11:15), cat_3 = c(111:115))  

head(melt(A, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))
   
   idcol variable dog cat
1:     1        1   1 101
2:     2        1   2 102
3:     3        1   3 103
4:     4        1   4 104
5:     5        1   5 105
6:     1        2   6 106

However, in data set 'B', the suffix of dog and cat columns is a string (e.g. 'dog_one', 'cat_two'). Such suffixes are converted to a numeric representation in melt, see the "variable" column.

B = data.table(idcol = c(1:5),
                dog_one = c(1:5),     cat_one = c(101:105),
                dog_two = c(6:10),    cat_two = c(106:110),
                dog_three = c(11:15), cat_three = c(111:115))

head(melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))

   idcol variable dog cat
1:     1        1   1 101
2:     2        1   2 102
3:     3        1   3 103
4:     4        1   4 104
5:     5        1   5 105
6:     1        2   6 106

How can I fill the "variable" column with the correct string suffixes one/two/three instead of 1/2/3?

like image 706
Nancy Avatar asked Jan 26 '17 21:01

Nancy


Video Answer


1 Answers

From data.table 1.14.1 (in development; installation), the new function measure makes it much easier to melt data with concatenated variable names to a desired format (see ?measure.

The separator argument is used to create different groups of measure.vars. In the ... argument, we further specify the fate of the values corresponding to the groups generated by sep.

In OP, the variable names are of the form species_number, e.g. dog_one. Thus, we need two symbols in ... to specify how groups before and after the separator should be treated, one for the species (dog or cat) and one for the numbers (one-three).

If a symbol in ... is set to value.name, then "melt returns multiple value columns (with names defined by the unique values in that group)". Thus, because you want multiple columns for each species, the first group defined by the separator, the first symbol in ... should be value.name.

The second group, after the separator, are the numbers, so this is specified as the second symbol in .... We want in a single value column for the numbers, so here we specify the desired column name of the output variable, e.g. "nr".

melt(B, measure.vars = measure(value.name, nr, sep = "_"))

      idcol    nr dog cat
#  1:     1   one   1 101
#  2:     2   one   2 102
#  3:     3   one   3 103
#  4:     4   one   4 104
#  5:     5   one   5 105
#  6:     1   two   6 106
#  7:     2   two   7 107
#  8:     3   two   8 108
#  9:     4   two   9 109
# 10:     5   two  10 110
# 11:     1 three  11 111
# 12:     2 three  12 112
# 13:     3 three  13 113
# 14:     4 three  14 114
# 15:     5 three  15 115

Pre data.table 1.14.1

There might be easier ways, but this seems to work:

# grab suffixes of 'variable' names
suff <- unique(sub('^.*_', '', names(B[ , -1])))
# suff <- unique(tstrsplit(names(B[, -1]), "_")[[2]])

# melt
B2 <- melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat"))
   
# replace factor levels in 'variable' with the suffixes
setattr(B2$variable, "levels", suff)

B2
#     idcol variable dog cat
# 1:      1      one   1 101
# 2:      2      one   2 102
# 3:      3      one   3 103
# 4:      4      one   4 104
# 5:      5      one   5 105
# 6:      1      two   6 106
# 7:      2      two   7 107
# 8:      3      two   8 108
# 9:      4      two   9 109
# 10:     5      two  10 110
# 11:     1    three  11 111
# 12:     2    three  12 112
# 13:     3    three  13 113
# 14:     4    three  14 114
# 15:     5    three  15 115

Two related data.table issues:

melt.data.table should offer variable to match on the name, rather than the number

FR: expansion of melt functionality for handling names of output.


This is one of the (rare) instances where I believe good'ol base::reshape is cleaner. Its sep argument comes in handy here — both the names of the 'value' column and the levels of the 'variable' columns are generated in one go:

reshape(data = B,
        varying = names(B[ , -1]),
        sep = "_",
        direction = "long")
like image 74
Henrik Avatar answered Nov 16 '22 04:11

Henrik