I'm working on a regex query to return texts of a particular pattern into groups. Here is the regex that I've used: r"([\w+ \-? \w+]* [\w+ ]+ [\(?\w+ \)?]*) (\(?[\d,-]+\)?) (\(?[\d,-]+\)?)". Here are the sample lines that I'm parsing and what I'd like the output to be:
1) String: LOSS BEFORE INCOME TAXES (900,000) (900,000)
Desired output: [('LOSS BEFORE INCOME TAXES', '(900,000)', '(900,000)')]
Final result: correct
2) String: INCOME TAXES (RECOVERED) (90,000) (90,000)
Desired output: [('INCOME TAXES (RECOVERED)', '(90,000)', '(90,000)')]
Final result: correct
3) String: RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999
Desired output: [('RETAINED EARNINGS - BEGINNING OF YEAR', '9,999,999', '9,999,999')]
Final result: correct
4) String: EXPENSES
Desired output: ['EXPENSES']
Final result: correct
5) String: Subcontracts 8,058 2,655
Desired output: [('Subcontracts', '8,000,000')]
Final result: ['Subcontracts 8', '', '058 2', '', '655', '']
6) String: Business taxes 116 -
Desired output: [('Business taxes', '116', '-')]
Final result: ['Business taxes 116 ', '', '']
7) String: 600,000 600,000
Desired output: [(600,000), (600,000)]
Final result: ['642', '', '437 629', '', '070', '']
8) String: Salaries, wages and benefits 400,000 400,000
Desired output: [('Salaries, wages and benefits', '400,000', '400,000')]
Final result: [(' wages and benefits', '463,437', '466,742')]
I'm not sure what I'm doing wrong or what I'm missing, but 5, 6, 7 & 8 have problems with them. How can I adjust the above query such that it accounts for all the mentioned cases? Thanks in advance!
I think this regex will do what you want:
^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$
It looks for a set of alphabetic characters, starting with a letter, and possibly including some of [(),%;-], but not ending with a (, digit or whitespace, followed by two groups of possibly () surrounded digits and , or -. All groups are optional to allow matching lines with no description or no numbers.
In Python:
import re
data = """LOSS BEFORE INCOME TAXES (900,000) (900,000)
INCOME TAXES (RECOVERED) (90,000) (90,000)
RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999
EXPENSES
Subcontracts 8,058 2,655
Business taxes 116 -
600,000 600,000
GROSS PROFIT (50%; 2016 - 50%) 500,000 500,000
Bad debts - 50
Salaries, wages and benefits 400,000 400,000"""
regex = re.compile('^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$', re.MULTILINE)
print regex.findall(data)
Output:
[('LOSS BEFORE INCOME TAXES', '(900,000)', '(900,000)'),
('INCOME TAXES (RECOVERED)', '(90,000)', '(90,000)'),
('RETAINED EARNINGS - BEGINNING OF YEAR', '9,999,999', '9,999,999'),
('EXPENSES', '', ''),
('Subcontracts', '8,058', '2,655'),
('Business taxes', '116', '-'),
('', '600,000', '600,000'),
('GROSS PROFIT (50%; 2016 - 50%)', '500,000', '500,000'),
('Bad debts', '-', '50'),
('Salaries, wages and benefits', '400,000', '400,000')
]
Demo on rextester
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With