Grouping data with a regex in Python

Question

I have some raw data like this:

Dear   John    Buy   1 of Coke, cost 10 dollars
       Ivan    Buy  20 of Milk
Dear   Tina    Buy  10 of Coke, cost 100 dollars
       Mary    Buy   5 of Milk

The rule of the data is:

Not everyone will start with "Dear", while if there is any, it must end with costs
The item may not always normal words, it could be written without limits (including str, num, etc.)

I want to group the information, and I tried to use regex. That's what I tried before:

for line in file.readlines():
    match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line)
    if match is not None:
        print(match.groups())
file.close()

Now the output looks like:

('John', '1', 'Coke', '10')
('Ivan', '20', 'Milk', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Milk', '')

Showing above is what I want. However, if the item is replaced by some strange string like A1~A10, some of outputs will get wrong info:

('Ivan', '20', 'A1', '10')
('Mary', '5', 'A1', '10')

I think the constant format in the item field is that it will always end with , (if there is any). But I just don't know how to use the advantage.

Thought it's temporarily success by using the code above, I thought the (?P<item>\w+) has to be replaced like (?P<item>.+). If I do so, it'll take wrong string in the tuple like:

('John', '1', 'Coke, cost 10 dollars', '')

How could I read the data into the format I want by using the regex in Python?

saikumarm · Accepted Answer

I have tried this regular expression

^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?

Explanation

^(Dear)? match line starting either with Dear if exists
(?P<name>\w*) a name capture group to capture the name
\D* match any non-digit characters
(?P<num>\d+) named capture group to get the num.
\sof\s matching string of
(?P<drink>\w*) to get the drink
(,\D*(?P<cost>\d+)\D*)? this is an optional group to get the cost of the drink

with

>>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?')

First data snippet

>>> data1 = 'Dear   John    Buy   1 of Coke, cost 10 dollars'
>>> match_object = reobject.search(data1)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('John', '1', 'Coke', '10')

Second data snippet

>>> data2 = '       Ivan    Buy  20 of Milk'
>>> match_object = reobject.search(data2)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('Ivan', '20', 'Milk', None)

Grouping data with a regex in Python

Tags:

python

regex

WenT

1 Answers

saikumarm

Recent Activity

Donate For Us

Grouping data with a regex in Python

Tags:

python

regex

WenT

1 Answers

saikumarm

Related questions

Recent Activity

Donate For Us