I need to translate the values in a vector according to a mapping of key value pairs:
vector <- c("dog","ant","eagle","ant","eagle","parrot")
"dog" "ant" "eagle" "ant" "eagle" "parrot"
mapping <- data.frame(key=c("dog","cat","elephant","ant","parrot","eagle"),value=c("mammal","mammal","mammal","insect","bird","bird"))
key value
dog mammal
cat mammal
elephant mammal
ant insect
parrot bird
eagle bird
The desired output would be like this:
output <- ("mammal", "insect", "bird", "insect", "bird", "bird")
In the real dataset I have to translate ~10000 input vectors of an average length of ~15 and the mapping data-frame is in the range of a million keys with about 100000 unique classes on the side of the values.
The problem itself appears rather basic to me, but the bottleneck is runtime. In other programming languages you would probably use a HashMap for the mapping and then loop through the vector. Any solution in R I could come up with so far is orders of magnitude slower than a simple HashMap-based one in Java or Python (see comments below).
Is there a more efficient data structure to store the mapping than a data frame?
What would be the most runtime-efficient solution to this problem in R?
For the hashmap, we can use the inserted value as the key and its vector index as the corresponding hashmap value.
HashMap stores the data in (Key, Value) pairs, and you can access them by an index of another type. HashMap class implements Map interface which allows us to store key.
To get the key and value elements, we should call the getKey() and getValue() methods. The Map.Entry interface contains the getKey() and getValue() methods. But, we should call the entrySet() method of Map interface to get the instance of Map.Entry.
Key value maps (KVMs) are ideal for this. A KVM is a custom collection of encrypted key/value String pairs. The following lists three broad use cases for storing data in KVMs: User session data: Data that is created and deleted by the runtime only; you cannot view or manage KVM entries outside of the runtime.
There is a package called hashmap
which is perfect for this:
library(hashmap)
hash_lookup = hashmap(mapping$key, mapping$value)
output = hash_lookup[[vector]]
Result:
> hash_lookup
## (character) => (character)
## [cat] => [mammal]
## [elephant] => [mammal]
## [ant] => [insect]
## [dog] => [mammal]
## [eagle] => [bird]
## [parrot] => [bird]
> output
[1] "mammal" "insect" "bird" "insect" "bird" "bird"
Data:
vector <- c("dog","ant","eagle","ant","eagle","parrot")
mapping <- data.frame(key=c("dog","cat","elephant","ant","parrot","eagle"),
value=c("mammal","mammal","mammal","insect","bird","bird"),
stringsAsFactors = FALSE)
Note:
Have to test this on a bigger dataset, but this method should be very fast since it is implemented with Rcpp internally.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With