I'm using the dcast
function in the library(reshape2)
package to cast a simple table of three columns
df = data.table(id = 1:1e6,
var = c('continent','subcontinent',...),
val = c('America','Caribbean',...)````
by dcast(df, id ~ var, value.var ='val')
and it automatically converts the value to the count, i.e.
id continent subcontinent
1 1 1
2 1 1
However, if I reduce the size to 10000 rows, it correctly outputs
id continent subcontinent
1 America Caribbean
2 Europe West Europe
Is this a bug or I need to change the code somehow? Please help. Thanks!
The problem is not the size of the dataset itself but the occurrence of duplicate entries in the full dataset. By picking smaller subsets from the full dataset there is a chance that no duplicates are included.
help("dcast", "data.table")
says:
When variable combinations in
formula
doesn't identify a unique value in a cell,fun.aggregate
will have to be specified, which defaults tolength
if unspecified.
All occurrences of duplicates can be identified by
cols <- c("id", "var")
df[duplicated(df, by = cols) | duplicated(df, by = cols, fromLast = TRUE)][
order(id)]
id var val 1: 1 subcontinent Caribbean 2: 1 subcontinent South America
Note that we are looking for duplicates in id
and var
as these two form the cells, i.e., rows and columns, of the reshaped result.
unique()
doesn't workNB: This is the explanation why simply taking unique(df)
will not work:
unique(df)
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe 5: 1 subcontinent South America
does not remove any rows. Consequently,
dcast(unique(df), id ~ var, value.var = "val")
Aggregate function missing, defaulting to 'length' id continent subcontinent 1: 1 1 2 2: 2 1 1
Whereas
unique(df, by = cols)
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe
has removed the duplicate var
for id == 1L
. Consequently,
dcast(unique(df, by = cols), id ~ var, value.var = "val")
id continent subcontinent 1: 1 America Caribbean 2: 2 Europe West Europe
The OP has reported that the issue appears only with the full dataset but not with a subset of the first 1e5
rows.
The row indices of the duplicate entries can be found by
which(duplicated(df, by = cols))
which returns 5
for the sample dataset. For OP's full dataset, I suspect that
min(which(duplicated(df, by = cols))) > 1e5
is true, i.e., there are no duplicates within the first 1e5 rows.
OP's own approach using fun.aggregate = function(x) paste(x[1L])
as well as applying unique()
on df
just aim at removing any disturbing duplicates. The duplicates will be silently dropped.
Alternatively, toString()
can be used as aggregation function which shows the duplicate entries:
dcast(df, id ~ var, toString, value.var = "val")
id continent subcontinent 1: 1 America Caribbean, South America 2: 2 Europe West Europe
library(data.table)
df <- data.table(
id = c(1L, 1L, 2L, 2L, 1L),
var = c("continent", "subcontinent", "continent", "subcontinent", "subcontinent"),
val = c("America", "Caribbean", "Europe", "West Europe", "South America")
)
df
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe 5: 1 subcontinent South America
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With