Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping data with a regex in Python

Tags:

python

regex

I have some raw data like this:

Dear   John    Buy   1 of Coke, cost 10 dollars
       Ivan    Buy  20 of Milk
Dear   Tina    Buy  10 of Coke, cost 100 dollars
       Mary    Buy   5 of Milk

The rule of the data is:

  • Not everyone will start with "Dear", while if there is any, it must end with costs

  • The item may not always normal words, it could be written without limits (including str, num, etc.)

I want to group the information, and I tried to use regex. That's what I tried before:

for line in file.readlines():
    match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line)
    if match is not None:
        print(match.groups())
file.close()

Now the output looks like:

('John', '1', 'Coke', '10')
('Ivan', '20', 'Milk', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Milk', '')

Showing above is what I want. However, if the item is replaced by some strange string like A1~A10, some of outputs will get wrong info:

('Ivan', '20', 'A1', '10')
('Mary', '5', 'A1', '10')

I think the constant format in the item field is that it will always end with , (if there is any). But I just don't know how to use the advantage.

Thought it's temporarily success by using the code above, I thought the (?P<item>\w+) has to be replaced like (?P<item>.+). If I do so, it'll take wrong string in the tuple like:

('John', '1', 'Coke, cost 10 dollars', '')

How could I read the data into the format I want by using the regex in Python?

like image 692
WenT Avatar asked Jan 20 '16 09:01

WenT


1 Answers

I have tried this regular expression

^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?

Explanation

  1. ^(Dear)? match line starting either with Dear if exists
  2. (?P<name>\w*) a name capture group to capture the name
  3. \D* match any non-digit characters
  4. (?P<num>\d+) named capture group to get the num.
  5. \sof\s matching string of
  6. (?P<drink>\w*) to get the drink
  7. (,\D*(?P<cost>\d+)\D*)? this is an optional group to get the cost of the drink

with

>>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?')

First data snippet

>>> data1 = 'Dear   John    Buy   1 of Coke, cost 10 dollars'
>>> match_object = reobject.search(data1)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('John', '1', 'Coke', '10')

Second data snippet

>>> data2 = '       Ivan    Buy  20 of Milk'
>>> match_object = reobject.search(data2)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('Ivan', '20', 'Milk', None)
like image 95
saikumarm Avatar answered Sep 18 '22 18:09

saikumarm