I'm writing a sort of parser for a little library.
My string is in the following format:
text = "Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean"
Just to be more clear, this is a list of people names, separated by commas and followed by two optionals separator (| and !), after the first there is the weight that is a number with 0-2 decimals, while after the "!" there is an integer number that represents the age. Separators and related values could appear in any order, as you can see for John and for Don.
I need to extract with Regex (I know I could do it in many other ways) all the names with a length between 2 and 4 and the two separator and the following values, if they are present.
This is my expected result:
[('John', '|85.56', '!26'), ('Don', '|78.00' ,'!18'), ('Dean', '', '')]
I'm trying with this code:
import re
text = "Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean"
pattern = re.compile(r'(\b\w{2,4}\b)(\!\d+)?(\|\d+(?:\.\d{1,2})?)?')
search_result = pattern.findall(text)
print(search_result)
But this is the actual result:
[('John', '', '|85.56'), ('26', '', ''), ('Don', '!18', '|78.0'), ('Dean', '', '')]
The following regex seems to be giving what you want:
re.findall(r'(\b[a-z]{2,4}\b)(?:(!\d+)|(\|\d+(?:\.\d{,2})?))*', text, re.I)
#[('John', '!26', '|85.56'), ('Don', '!18', '|78.0'), ('Dean', '', '')]
If you do not want those names, you can easily filter them out.
Pyparsing is good at composing complex expressions from simpler ones, and includes many builtins for optional, unordered, and comma-delimited values. See the comments in the code below:
import pyparsing as pp
real = pp.pyparsing_common.real
integer = pp.pyparsing_common.integer
name = pp.Word(pp.alphas, min=2, max=4)
# a valid person entry starts with a name followed by an optional !integer for age
# and an optional |real for weight; the '&' operator allows these to occur in either
# order, but at most only one of each will be allowed
expr = pp.Group(name("name")
+ (pp.Optional(pp.Suppress('!') + integer("age"), default='')
& pp.Optional(pp.Suppress('|') + real("weight"), default='')))
# other entries that we don't care about
other = pp.Word(pp.alphas, min=5)
# an expression for the complete input line - delimitedList defaults to using
# commas as delimiters; and we don't really care about the other entries, just
# suppress them from the results; whitespace is also skipped implicitly, but that
# is not an issue in your given sample text
input_expr = pp.delimitedList(expr | pp.Suppress(other))
# try it against your test data
text = "Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean"
input_expr.runTests(text)
Prints:
Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean
[['John', 85.56, 26], ['Don', 18, 78.0], ['Dean', '', '']]
[0]:
['John', 85.56, 26]
- age: 26
- name: 'John'
- weight: 85.56
[1]:
['Don', 18, 78.0]
- age: 18
- name: 'Don'
- weight: 78.0
[2]:
['Dean', '', '']
- name: 'Dean'
In this case, using the pre-defined real and integer expressions not only parses the values, but also does the conversion to int and float. The named parameters can be accessed like object attributes:
for person in input_expr.parseString(text):
print("({!r}, {}, {})".format(person.name, person.age, person.weight))
Gives:
('John', 26, 85.56)
('Don', 18, 78.0)
('Dean', , )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With