I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well. My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no? Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work. <pre class="prettyprint"><code># -*- coding: utf-8 -*- import re my_str = """ Dividendengutschrift für inländische Wertpapiere Depotinhaber : ME Extag : 18.04.2013 Bruttodividende Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR Valuta : 18.04.2013 Bruttodividende : 78,40 EUR *Einbeh. Steuer : 20,67 EUR Nettodividende : 78,40 EUR Endbetrag : 57,73 EUR """ print re.findall(r'\d+(,\d+)?', my_str) print re.findall(r'\d+,\d+', my_str) print re.findall(r'[-+]?\d*,\d+|\d+', my_str) </code></pre> Output is <pre class="prettyprint"><code>['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73'] ['0,9800', '78,40', '20,67', '78,40', '57,73'] ['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73'] </code></pre>

<blockquote> I'd like to extract all the numbers from the following string ... </blockquote> By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want: <pre class="prettyprint"><code>print re.findall(r'[0-9][0-9,.]+', my_str) </code></pre> Output: <pre class="prettyprint"><code>['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73'] </code></pre> If by "numbers" you mean only the currency amounts, then use <pre class="prettyprint"><code>print re.findall(r'[0-9]+,[0-9]+', my_str) </code></pre> Or perhaps better yet, <pre class="prettyprint"><code>print re.findall(r'[0-9]+,[0-9]+ EUR', my_str) </code></pre>

Here is a solution, which parse the statement and put the result in a dictionary called <code>bank_statement</code>: <pre class="prettyprint"><code># -*- coding: utf-8 -*- import itertools my_str = """ Dividendengutschrift für inländische Wertpapiere Depotinhaber : ME Extag : 18.04.2013 Bruttodividende Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR Valuta : 18.04.2013 Bruttodividende : 78,40 EUR *Einbeh. Steuer : 20,67 EUR Nettodividende : 78,40 EUR Endbetrag : 57,73 EUR """ bank_statement = {} for line in my_str.split('\n'): tokens = line.split() #print tokens it = iter(tokens) category = '' for token in it: if token == ':': category = category.strip(' *') bank_statement[category] = next(it) category = '' else: category += ' ' + token # bank_statement now has all the values print '\n'.join('{0:.<18} {1}'.format(k, v) \ for k, v in sorted(bank_statement.items())) </code></pre> The Output of this code: <pre class="prettyprint"><code>Bruttodividende... 78,40 Depotinhaber...... ME Einbeh. Steuer.... 20,67 Endbetrag......... 57,73 Extag............. 18.04.2013 Nettodividende.... 78,40 Valuta............ 18.04.2013 Zahlungstag....... 18.04.2013 pro Stück........ 0,9800 </code></pre> <h3>Discussion</h3> <ul> <li>The code scans the statement string line by line</li> <li>It then breaks each line into tokens</li> <li>Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value. <code>bank_statement['Extag']</code> for example, has the value of '18.04.2013' </li> <li>Please note that all the values are strings, not number, but it is trivia to convert them.</li> </ul>

Python regular expression (regex) match comma separated number - why does this not work?

Tags:

python

regex

I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.

My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?

Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.

# -*- coding: utf-8 -*-
import re


my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)

Output is

['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']

954

asked May 01 '13 15:05

Matthias Kauer

3 Answers

Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.

For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.

The solution is to make the group non-capturing

r'\d+(?:,\d+)?'

Option 2 does not work only so far as it won't match sequences that don't contain a comma.

Option 3 isn't great because it will match e.g. +,1.

103

answered Oct 26 '22 16:10

MikeM

I'd like to extract all the numbers from the following string ...

By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:

print re.findall(r'[0-9][0-9,.]+', my_str)

Output:

['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']

If by "numbers" you mean only the currency amounts, then use

print re.findall(r'[0-9]+,[0-9]+', my_str)

Or perhaps better yet,

print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)

answered Oct 26 '22 15:10

Dave

Here is a solution, which parse the statement and put the result in a dictionary called bank_statement:

# -*- coding: utf-8 -*-
import itertools

my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

bank_statement = {}

for line in my_str.split('\n'):
    tokens = line.split()
    #print tokens


    it = iter(tokens)
    category = ''
    for token in it:
        if token == ':':
            category = category.strip(' *')
            bank_statement[category] = next(it)
            category = ''
        else:
            category += ' ' + token

# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
                for k, v in sorted(bank_statement.items()))

The Output of this code:

Bruttodividende... 78,40  
Depotinhaber...... ME  
Einbeh. Steuer.... 20,67  
Endbetrag......... 57,73  
Extag............. 18.04.2013  
Nettodividende.... 78,40  
Valuta............ 18.04.2013  
Zahlungstag....... 18.04.2013  
pro Stück........ 0,9800

Discussion

The code scans the statement string line by line
It then breaks each line into tokens
Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value. bank_statement['Extag'] for example, has the value of '18.04.2013'
Please note that all the values are strings, not number, but it is trivia to convert them.

answered Oct 26 '22 14:10

Hai Vu

Related questions
                            
                                How to create Celery Windows Service?
                            
                                Find all tables in html using BeautifulSoup
                            
                                What is a subtraction function that is similar to sum() for subtracting items in list?
                            
                                How to set the foreign key to a default value on delete?
                            
                                How do I split models.py into different files for different models in Pyramid?
                            
                                Map different URLs to same view
                            
                                Greater than less than, python
                            
                                Danger of mixing numpy matrix and array
                            
                                Use different .ini file for alembic.ini
                            
                                Get joined string from list of lists of strings in Python
                            
                                Is Python's bool sorting defined?
                            
                                create new list without changing the original list
                            
                                How to set default value for FloatField in django model
                            
                                Python equivalent of sum() using xor()
                            
                                Autoincrementing option for Pandas DataFrame index
                            
                                Generating postgresql user password
                            
                                Simple example of using wx.TextCtrl and display data after button click in wxpython - new to wx
                            
                                How can I serve files with UTF-8 encoding using Python SimpleHTTPServer?
                            
                                Using cumsum in pandas on group()
                            
                                How to get Python division by -0.0 and 0.0 to result in -Inf and Inf, respectively?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With