Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regular expression (regex) match comma separated number - why does this not work?

Tags:

python

regex

I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.

My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?

Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.

# -*- coding: utf-8 -*-
import re


my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)

Output is

['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
like image 954
Matthias Kauer Avatar asked May 01 '13 15:05

Matthias Kauer


People also ask

How do you match a comma in regex?

Starting with the carat ^ indicates a beginning of line. The 0-9 indicates characters 0 through 9, the comma , indicates comma, and the semicolon indicates a ; . The closing ] indicates the end of the character set. The plus + indicates that one or more of the "previous item" must be present.

How do I match a number in regex?

The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99. That's the easy part. Matching the three-digit numbers is a little more complicated, since we need to exclude numbers 256 through 999.

Is comma special character in regex?

In regex, there are basically two types of characters: Regular characters, or literal characters, which means that the character is what it looks like. The letter "a" is simply the letter "a". A comma "," is simply a comma and has no special meaning.

What does (? I do in regex?

(? i) makes the regex case insensitive. (? c) makes the regex case sensitive.


3 Answers

Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.

For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.

The solution is to make the group non-capturing

r'\d+(?:,\d+)?'

Option 2 does not work only so far as it won't match sequences that don't contain a comma.

Option 3 isn't great because it will match e.g. +,1.

like image 103
MikeM Avatar answered Oct 26 '22 16:10

MikeM


I'd like to extract all the numbers from the following string ...

By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:

print re.findall(r'[0-9][0-9,.]+', my_str)

Output:

['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']

If by "numbers" you mean only the currency amounts, then use

print re.findall(r'[0-9]+,[0-9]+', my_str)

Or perhaps better yet,

print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)
like image 9
Dave Avatar answered Oct 26 '22 15:10

Dave


Here is a solution, which parse the statement and put the result in a dictionary called bank_statement:

# -*- coding: utf-8 -*-
import itertools

my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

bank_statement = {}

for line in my_str.split('\n'):
    tokens = line.split()
    #print tokens


    it = iter(tokens)
    category = ''
    for token in it:
        if token == ':':
            category = category.strip(' *')
            bank_statement[category] = next(it)
            category = ''
        else:
            category += ' ' + token

# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
                for k, v in sorted(bank_statement.items()))

The Output of this code:

Bruttodividende... 78,40  
Depotinhaber...... ME  
Einbeh. Steuer.... 20,67  
Endbetrag......... 57,73  
Extag............. 18.04.2013  
Nettodividende.... 78,40  
Valuta............ 18.04.2013  
Zahlungstag....... 18.04.2013  
pro Stück........ 0,9800  

Discussion

  • The code scans the statement string line by line
  • It then breaks each line into tokens
  • Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value. bank_statement['Extag'] for example, has the value of '18.04.2013'
  • Please note that all the values are strings, not number, but it is trivia to convert them.
like image 2
Hai Vu Avatar answered Oct 26 '22 14:10

Hai Vu