Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing amount strings into numbers

Tags:

c#

.net

parsing

ocr

I am working on a system that is recognizing paper documents using OCR engines. These documents are invoices containing amounts such as total, vat and net amounts. I need to parse these amount strings into numbers, but they are coming in many formats and flavors using different symbols for decimal and thousands separation in the number in each invoice. If I am trying to use the normal double.tryparse and double.parse methods in .NET then they normally fail for some of the amounts

These are some of the examples I receive as amount

"3.533,65" =>  3533.65 
"-133.696" => -133696
"-33.017" => -33017
"-166.713" => -166713
"-5088,8" => -5088.8 
"0.423" => 0.423
"9,215,200" => 9215200
"1,443,840.00" => 1443840

I need some way to guess what the decimal separator and the thousand separator is in the number and then present the value to the user to decide if this is correct or not.

I am wondering how to solve this problem in an elegant way.

like image 856
gyurisc Avatar asked Feb 26 '26 00:02

gyurisc


1 Answers

I'm not sure you'll be able to get an elegant way of figuring this out, because it's always going to be ambigious if you can't tell it where the data is from.

For example, the numbers 1.234 and 1,234 are both valid numbers, but without establishing what the symbols mean you won't be able to tell which is which.

Personally, I would write a function which attempted to do a "best guess" based on some rules...

  • If the number contains , BEFORE ., then , must be for thousands and . must be for decimals
  • If the number contains . BEFORE ,, then . must be for thousands and , must be for decimals
  • If there are >1 , symbols, the thousand separator must be ,
  • If there are >1 . symbols, the thousand separator must be .
  • If there is only 1 , how many numbers follow it? If it's NOT 3, then it must be the decimal separator (same rule for .)
  • If there are 3 numbers separating it (e.g. 1,234 and 1.234), perhaps you could put this number aside and parse other numbers on the same page to try and figure out if they use different separators, then come back to it?

Once you've figured out the decimal separate, remove any thousand separators (not needed for parsing the number) and ensure the decimal separator is . in the string you are parsing. Then you can pass this into Double.TryParse

like image 140
Richard Avatar answered Feb 27 '26 14:02

Richard



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!