Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read values from numbers written as words?

People also ask

How do you write 1000000 in words?

Therefore, the number 1000000 in words is One Million.


I was playing around with a PEG parser to do what you wanted (and may post that as a separate answer later) when I noticed that there's a very simple algorithm that does a remarkably good job with common forms of numbers in English, Spanish, and German, at the very least.

Working with English for example, you need a dictionary that maps words to values in the obvious way:

"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000

...and so forth

The algorithm is just:

total = 0
prior = null
for each word w
    v <- value(w) or next if no value defined
    prior <- case
        when prior is null:       v
        when prior > v:     prior+v
        else                prior*v
        else
    if w in {thousand,million,billion,trillion...}
        total <- total + prior
        prior <- null
total = total + prior unless prior is null

For example, this progresses as follows:

total    prior      v     unconsumed string
    0      _              four score and seven 
                    4     score and seven 
    0      4              
                   20     and seven 
    0     80      
                    _     seven 
    0     80      
                    7 
    0     87      
   87

total    prior      v     unconsumed string
    0        _            two million four hundred twelve thousand eight hundred seven
                    2     million four hundred twelve thousand eight hundred seven
    0        2
                  1000000 four hundred twelve thousand eight hundred seven
2000000      _
                    4     hundred twelve thousand eight hundred seven
2000000      4
                    100   twelve thousand eight hundred seven
2000000    400
                    12    thousand eight hundred seven
2000000    412
                    1000  eight hundred seven
2000000  412000
                    1000  eight hundred seven
2412000     _
                      8   hundred seven
2412000     8
                     100  seven
2412000   800
                     7
2412000   807
2412807

And so on. I'm not saying it's perfect, but for a quick and dirty it does quite well.


Addressing your specific list on edit:

  1. cardinal/nominal or ordinal: "one" and "first" -- just put them in the dictionary
  2. english/british: "fourty"/"forty" -- ditto
  3. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred" -- works as is
  4. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot -- just define "next word" to be the longest prefix that matches a defined word, or up to the next non-word if none do, for a start
  5. colloqialisms: "thirty-something" -- works
  6. fragments: 'one third', 'two fifths' -- uh, not yet...
  7. common names: 'a dozen', 'half' -- works; you can even do things like "a half dozen"

Number 6 is the only one I don't have a ready answer for, and that's because of the ambiguity between ordinals and fractions (in English at least) added to the fact that my last cup of coffee was many hours ago.


It's not an easy issue, and I know of no library to do it. I might sit down and try to write something like this sometime. I'd do it in either Prolog, Java or Haskell, though. As far as I can see, there are several issues:

  • Tokenization: sometimes, numbers are written eleven hundred fifty two, but I've seen elevenhundred fiftytwo or eleven-hundred-fifty-two and whatnot. One would have to conduct a survey on what forms are actually in use. This might be especially tricky for Hebrew.
  • Spelling mistakes: that's not so hard. You have a limited amount of words, and a bit of Levenshtein-distance magic should do the trick.
  • Alternate forms, like you already mentioned, exist. This includes ordinal/cardinal numbers, as well as forty/fourty and...
  • ... common names or commonly used phrases and NEs (named entities). Would you want to extract 30 from the Thirty Years War or 2 from World War II?
  • Roman numerals, too?
  • Colloqialisms, such as "thirty-something" and "three Euro and shrapnel", which I wouldn't know how to treat.

If you are interested in this, I could give it a shot this weekend. My idea is probably using UIMA and tokenizing with it, then going on to further tokenize/disambiguate and finally translate. There might be more issues, let's see if I can come up with some more interesting things.

Sorry, this is not a real answer yet, just an extension to your question. I'll let you know if I find/write something.

By the way, if you are interested in the semantics of numerals, I just found an interesting paper by Friederike Moltmann, discussing some issues regarding the logic interpretation of numerals.


I have some code I wrote a while ago: text2num. This does some of what you want, except it does not handle ordinal numbers. I haven't actually used this code for anything, so it's largely untested!


Use the Python pattern-en library:

>>> from pattern.en import number
>>> number('two thousand fifty and a half') => 2050.5

You should keep in mind that Europe and America count differently.

European standard:

One Thousand
One Million
One Thousand Millions (British also use Milliard)
One Billion
One Thousand Billions
One Trillion
One Thousand Trillions

Here is a small reference on it.


A simple way to see the difference is the following:

(American counting Trillion) == (European counting Billion)

Ordinal numbers are not applicable because they cant be joined in meaningful ways with other numbers in language (...at least in English)

e.g. one hundred and first, eleven second, etc...

However, there is another English/American caveat with the word 'and'

i.e.

one hundred and one (English) one hundred one (American)

Also, the use of 'a' to mean one in English

a thousand = one thousand

...On a side note Google's calculator does an amazing job of this.

one hundred and three thousand times the speed of light

And even...

two thousand and one hundred plus a dozen

...wtf?!? a score plus a dozen in roman numerals