I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.
My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?
Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.
# -*- coding: utf-8 -*-
import re
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)
Output is
['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
Starting with the carat ^ indicates a beginning of line. The 0-9 indicates characters 0 through 9, the comma , indicates comma, and the semicolon indicates a ; . The closing ] indicates the end of the character set. The plus + indicates that one or more of the "previous item" must be present.
The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99. That's the easy part. Matching the three-digit numbers is a little more complicated, since we need to exclude numbers 256 through 999.
In regex, there are basically two types of characters: Regular characters, or literal characters, which means that the character is what it looks like. The letter "a" is simply the letter "a". A comma "," is simply a comma and has no special meaning.
(? i) makes the regex case insensitive. (? c) makes the regex case sensitive.
Option 1 is the most suitable of the regex, but it is not working correctly because findall
will return what is matched by the capture group ()
, not the complete match.
For example, the first three matches in your example will be the 18
, 04
and 2013
, and in each case the capture group will be unmatched so an empty string will be added to the results list.
The solution is to make the group non-capturing
r'\d+(?:,\d+)?'
Option 2 does not work only so far as it won't match sequences that don't contain a comma.
Option 3 isn't great because it will match e.g. +,1
.
I'd like to extract all the numbers from the following string ...
By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:
print re.findall(r'[0-9][0-9,.]+', my_str)
Output:
['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']
If by "numbers" you mean only the currency amounts, then use
print re.findall(r'[0-9]+,[0-9]+', my_str)
Or perhaps better yet,
print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)
Here is a solution, which parse the statement and put the result in a dictionary called bank_statement
:
# -*- coding: utf-8 -*-
import itertools
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
bank_statement = {}
for line in my_str.split('\n'):
tokens = line.split()
#print tokens
it = iter(tokens)
category = ''
for token in it:
if token == ':':
category = category.strip(' *')
bank_statement[category] = next(it)
category = ''
else:
category += ' ' + token
# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
for k, v in sorted(bank_statement.items()))
The Output of this code:
Bruttodividende... 78,40
Depotinhaber...... ME
Einbeh. Steuer.... 20,67
Endbetrag......... 57,73
Extag............. 18.04.2013
Nettodividende.... 78,40
Valuta............ 18.04.2013
Zahlungstag....... 18.04.2013
pro Stück........ 0,9800
bank_statement['Extag']
for example, has the value of '18.04.2013'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With