If I have few lines which read:
1,000 barrels
5 Megawatts hours (MWh)
80 Megawatt hours (MWh) (5 MW per peak hour).
What is the best way to capture the numeric elements (namely just the first instance) and the first parentheses if it exists.
My current approach is to use split the string for every ' '. and str.isalpha
to find the non alpha elements. But, not sure of how to obtain the first entry in the parantheses.
here's an approach using regexps:
import re
text = """1,000 barrels
5 Megawatts hours (MWh)
80 Megawatt hours (MWh) (...)"""
r_unit = re.compile("\((\w+)\)")
r_value = re.compile("([\d,]+)")
for line in text.splitlines():
unit = r_unit.search(line)
if unit:
unit = unit.groups()[0]
else:
unit = ""
value = r_value.search(line)
if value:
value = value.groups()[0]
else:
value = ""
print value, unit
or another simpler approach would be using a regexp like this:
r = re.compile("(([\d,]+).*\(?(\w+)?\)?)")
for line, value, unit in r.findall(text):
print value, unit
(I thought about that one just after writing the previous one :-p)
full explanation of last regexp:
( <- LINE GROUP
( <- VALUE GROUP
[ <- character grouping (i.e. read char is one of the following characters)
\d <- any digit
, <- a comma
]
+ <- one or more of the previous expression
)
. <- any character
* <- zero or more of the previous expression
\( <- a real parenthesis
? <- zero or one of the previous expression
( <- UNIT GROUP
[
\w <- any alphabetic/in-word character
+ <- one or more of the previous expression
]
)
? <- zero or one of the previous expression
\) <- a real ending parenthesis
? <- zero or one of the previous expression
)
)
For extraction numerical values you can use re
import re
value = """1,000 barrels
5 Megawatts hours (MWh)
80 Megawatt hours (MWh) (5 MW per peak hour)"""
re.findall("[0-9]+,?[0-9]*", value)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With