I am working on a system that is recognizing paper documents using OCR engines. These documents are invoices containing amounts such as total, vat and net amounts. I need to parse these amount strings into numbers, but they are coming in many formats and flavors using different symbols for decimal and thousands separation in the number in each invoice. If I am trying to use the normal double.tryparse and double.parse methods in .NET then they normally fail for some of the amounts
These are some of the examples I receive as amount
"3.533,65" => 3533.65
"-133.696" => -133696
"-33.017" => -33017
"-166.713" => -166713
"-5088,8" => -5088.8
"0.423" => 0.423
"9,215,200" => 9215200
"1,443,840.00" => 1443840
I need some way to guess what the decimal separator and the thousand separator is in the number and then present the value to the user to decide if this is correct or not.
I am wondering how to solve this problem in an elegant way.
I'm not sure you'll be able to get an elegant way of figuring this out, because it's always going to be ambigious if you can't tell it where the data is from.
For example, the numbers 1.234 and 1,234 are both valid numbers, but without establishing what the symbols mean you won't be able to tell which is which.
Personally, I would write a function which attempted to do a "best guess" based on some rules...
, BEFORE ., then , must be for thousands and . must be for decimals. BEFORE ,, then . must be for thousands and , must be for decimals, symbols, the thousand separator must be ,. symbols, the thousand separator must be ., how many numbers follow it? If it's NOT 3, then it must be
the decimal separator (same rule for .)Once you've figured out the decimal separate, remove any thousand separators (not needed for parsing the number) and ensure the decimal separator is . in the string you are parsing. Then you can pass this into Double.TryParse
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With