Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index unique values in data.table

Tags:

r

data.table

Not sure how to formulate the question in words, but how can I create an index-column for a data.table that per group increments when a different value appear?

Here is the MWE

library(data.table)
in.data <- data.table(fruits=c(rep("banana", 4), rep("pear", 5)),vendor=c("a", "b", "b", "c", "d", "d", "e", "f", "f"))

Here is the result the R-code should generate

in.data[, wanted.column:=c(1,2,2,3,1,1,2,3,3)]

#    fruits vendor wanted.column
# 1: banana      a             1
# 2: banana      b             2
# 3: banana      b             2
# 4: banana      c             3
# 5:   pear      d             1
# 6:   pear      d             1
# 7:   pear      e             2
# 8:   pear      f             3
# 9:   pear      f             3

So it labels each vendor 1, 2, 3, ... within each fruit. There is probably a very simple solution, but I'm stuck.

like image 581
Chris Avatar asked Feb 12 '16 20:02

Chris


People also ask

Can a table have 2 unique indexes?

If two entities are connected via a many to many relationship, you create a link table, storing their foreign keys, so there is no table with 2 unique indexes. One instance can be used only once, so you must have 2 uniqie indices in a linked table (by each entity separately) to prevent use of instance already used.

Can indexes be unique?

Unique indexes are indexes that help maintain data integrity by ensuring that no rows of data in a table have identical key values. When you create a unique index for an existing table with data, values in the columns or expressions that comprise the index key are checked for uniqueness.

How do I add a unique index to an existing table?

Right-click the table on which you want to create a unique index and select Design. On the Table Designer menu, select Indexes/Keys. In the Indexes/Keys dialog box, click Add. Select the new index in the Selected Primary/Unique Key or Index text box.

Can Unique Key be indexed?

The unique piece is not where the difference lies. The index and key are not the same thing, and are not comparable. A key is a data column, or several columns, that are forced to be unique with a constraint, either primary key or explicitly defined unique constraint.


3 Answers

I have a few ideas. You can use a nested group counter:

in.data[, w := setDT(list(v = vendor))[, g := .GRP, by=v]$g, by=fruits]

Alternately, make a run ID, which depends on sorted data (thanks @eddi) and seems wasteful:

in.data[, w := rleid(vendor), by=fruits]

The base-R approach would probably be:

in.data[, w := match(vendor, unique(vendor)), by=fruits]

# or in base R ...

in.data$w = with(in.data, ave(vendor, fruits, FUN = function(x) match(x, unique(x))))
like image 85
Frank Avatar answered Oct 13 '22 19:10

Frank


Another approach might be two steps :

DT = data.table(fruits=c(rep("banana", 4), rep("pear", 5)),vendor=c("a", "b", "b", "c", "d", "d", "e", "f", "f"))
DT
   fruits vendor
1: banana      a
2: banana      b
3: banana      b
4: banana      c
5:   pear      d
6:   pear      d
7:   pear      e
8:   pear      f
9:   pear      f
DT[, wanted:=.GRP, by="fruits,vendor"]  # step 1
DT
   fruits vendor wanted
1: banana      a      1
2: banana      b      2
3: banana      b      2
4: banana      c      3
5:   pear      d      4
6:   pear      d      4
7:   pear      e      5
8:   pear      f      6
9:   pear      f      6
DT[, wanted:=wanted-wanted[1]+1L, by="fruits"]  # step 2 (adjust)
DT
   fruits vendor wanted
1: banana      a      1
2: banana      b      2
3: banana      b      2
4: banana      c      3
5:   pear      d      1
6:   pear      d      1
7:   pear      e      2
8:   pear      f      3
9:   pear      f      3
> 

The way I would comment this in production code might be :

DT[, wanted:=.GRP, by="fruits,vendor"]          # .GRP is simple group counter
DT[, wanted:=wanted-wanted[1]+1L, by="fruits"]  # reset vendor counter per fruit
like image 26
Matt Dowle Avatar answered Oct 13 '22 18:10

Matt Dowle


If you want the index to be the same for all vendors within a given fruit, then this is another option:

in.data[, wanted := as.integer(factor(vendor, levels = unique(vendor))), by = fruits]

Otherwise, if you want it to tick up every time the vendor changes, then, from the given answers so far, rleid is the only one that works.

like image 37
eddi Avatar answered Oct 13 '22 20:10

eddi