my challenge is to convert ten and one which is in words to numbers as 10 and 1 in the input sentence:
example_input <- paste0("I have ten apple and one orange")
Numbers may change based on user requirement, input sentence can be tokenized:
my_output_toget<-paste("I have 10 apple and 1 orange")
We can pass a key/val pair as replacement
in gsubfn
to replace those words with numbers
library(english)
library(gsubfn)
gsubfn("\\w+", setNames(as.list(1:10), as.english(1:10)), example_input)
#[1] "I have 10 apple and 1 orange"
textclean
is quite a handy possibility for this task:
mgsub(example_input, replace_number(seq_len(10)), seq_len(10))
[1] "I have 10 apple and 1 orange"
You just need to adjust the seq_len()
parameter according to the maximum number in your data.
Some examples:
example_input <- c("I have one hundred apple and one orange")
mgsub(example_input, replace_number(seq_len(100)), seq_len(100))
[1] "I have 100 apple and 1 orange"
example_input <- c("I have one tousand apple and one orange")
mgsub(example_input, replace_number(seq_len(1000)), seq_len(1000))
[1] "I have 1 tousand apple and 1 orange"
If you don't know your maximum number beforehand, you can just choose a sufficiently big number.
I wrote an R package to do this - https://github.com/fsingletonthorn/words_to_numbers which should work for more use cases.
devtools::install_github("fsingletonthorn/words_to_numbers")
library(wordstonumbers)
example_input <- "I have ten apple and one orange"
words_to_numbers(example)
[1] "I have 10 apple and 1 orange"
It also works for much more complex cases like
words_to_numbers("The Library of Babel (by Jorge Luis Borges) describes a library that contains all possible four-hundred and ten page books made with a character set of twenty five characters (twenty two letters, as well as spaces, periods, and commas), with eighty lines per book and forty characters per line.")
#> [1] "The Library of Babel (by Jorge Luis Borges) describes a library that contains all possible 410 page books made with a character set of 25 characters (22 letters, as well as spaces, periods, and commas), with 80 lines per book and 40 characters per line."
Or
words_to_numbers("300 billion, 2 hundred and 79 cats")
#> [1] "300000000279 cats"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With