Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex to parse financial statements

I'm working on a regex query to return texts of a particular pattern into groups. Here is the regex that I've used: r"([\w+ \-? \w+]* [\w+ ]+ [\(?\w+ \)?]*) (\(?[\d,-]+\)?) (\(?[\d,-]+\)?)". Here are the sample lines that I'm parsing and what I'd like the output to be:

1) String: LOSS BEFORE INCOME TAXES (900,000) (900,000)
Desired output: [('LOSS BEFORE INCOME TAXES', '(900,000)', '(900,000)')]
Final result: correct 

2) String: INCOME TAXES (RECOVERED) (90,000) (90,000)
Desired output: [('INCOME TAXES (RECOVERED)', '(90,000)', '(90,000)')]
Final result: correct

3) String: RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999
Desired output: [('RETAINED EARNINGS - BEGINNING OF YEAR', '9,999,999', '9,999,999')]
Final result: correct

4) String: EXPENSES
Desired output: ['EXPENSES']
Final result: correct

5) String: Subcontracts 8,058 2,655
Desired output: [('Subcontracts', '8,000,000')]
Final result: ['Subcontracts 8', '', '058 2', '', '655', '']

6) String: Business taxes 116 -
Desired output: [('Business taxes', '116', '-')]
Final result: ['Business taxes 116 ', '', '']

7) String: 600,000 600,000
Desired output: [(600,000), (600,000)]
Final result: ['642', '', '437 629', '', '070', '']

8) String: Salaries, wages and benefits 400,000 400,000
Desired output: [('Salaries, wages and benefits', '400,000', '400,000')]
Final result: [(' wages and benefits', '463,437', '466,742')]

I'm not sure what I'm doing wrong or what I'm missing, but 5, 6, 7 & 8 have problems with them. How can I adjust the above query such that it accounts for all the mentioned cases? Thanks in advance!

like image 309
Sam Avatar asked Feb 25 '26 17:02

Sam


1 Answers

I think this regex will do what you want:

^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$

It looks for a set of alphabetic characters, starting with a letter, and possibly including some of [(),%;-], but not ending with a (, digit or whitespace, followed by two groups of possibly () surrounded digits and , or -. All groups are optional to allow matching lines with no description or no numbers.

In Python:

import re
data = """LOSS BEFORE INCOME TAXES (900,000) (900,000)
INCOME TAXES (RECOVERED) (90,000) (90,000)
RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999
EXPENSES
Subcontracts 8,058 2,655
Business taxes 116 -
600,000 600,000
GROSS PROFIT (50%; 2016 - 50%) 500,000 500,000
Bad debts - 50
Salaries, wages and benefits 400,000 400,000"""
regex = re.compile('^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$', re.MULTILINE)
print regex.findall(data)

Output:

[('LOSS BEFORE INCOME TAXES', '(900,000)', '(900,000)'),
 ('INCOME TAXES (RECOVERED)', '(90,000)', '(90,000)'),
 ('RETAINED EARNINGS - BEGINNING OF YEAR', '9,999,999', '9,999,999'),
 ('EXPENSES', '', ''),
 ('Subcontracts', '8,058', '2,655'),
 ('Business taxes', '116', '-'),
 ('', '600,000', '600,000'),
 ('GROSS PROFIT (50%; 2016 - 50%)', '500,000', '500,000'),
 ('Bad debts', '-', '50'),
 ('Salaries, wages and benefits', '400,000', '400,000')
]

Demo on rextester

like image 142
Nick Avatar answered Feb 27 '26 06:02

Nick



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!