I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.
My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?
Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.
# -*- coding: utf-8 -*-
import re
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)
Output is
['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
Starting with the carat ^ indicates a beginning of line. The 0-9 indicates characters 0 through 9, the comma , indicates comma, and the semicolon indicates a ; . The closing ] indicates the end of the character set. The plus + indicates that one or more of the "previous item" must be present.
The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99. That's the easy part. Matching the three-digit numbers is a little more complicated, since we need to exclude numbers 256 through 999.
In regex, there are basically two types of characters: Regular characters, or literal characters, which means that the character is what it looks like. The letter "a" is simply the letter "a". A comma "," is simply a comma and has no special meaning.
(? i) makes the regex case insensitive. (? c) makes the regex case sensitive.
Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.
For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.
The solution is to make the group non-capturing
r'\d+(?:,\d+)?'
Option 2 does not work only so far as it won't match sequences that don't contain a comma.
Option 3 isn't great because it will match e.g. +,1.
I'd like to extract all the numbers from the following string ...
By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:
print re.findall(r'[0-9][0-9,.]+', my_str)
Output:
['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']
If by "numbers" you mean only the currency amounts, then use
print re.findall(r'[0-9]+,[0-9]+', my_str)
Or perhaps better yet,
print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)
Here is a solution, which parse the statement and put the result in a dictionary called bank_statement:
# -*- coding: utf-8 -*-
import itertools
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
bank_statement = {}
for line in my_str.split('\n'):
tokens = line.split()
#print tokens
it = iter(tokens)
category = ''
for token in it:
if token == ':':
category = category.strip(' *')
bank_statement[category] = next(it)
category = ''
else:
category += ' ' + token
# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
for k, v in sorted(bank_statement.items()))
The Output of this code:
Bruttodividende... 78,40
Depotinhaber...... ME
Einbeh. Steuer.... 20,67
Endbetrag......... 57,73
Extag............. 18.04.2013
Nettodividende.... 78,40
Valuta............ 18.04.2013
Zahlungstag....... 18.04.2013
pro Stück........ 0,9800
bank_statement['Extag'] for example, has the value of '18.04.2013'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With