Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting numeric data Python

Tags:

python

If I have few lines which read:

1,000 barrels
5 Megawatts hours (MWh)
80 Megawatt hours (MWh) (5 MW per peak hour).

What is the best way to capture the numeric elements (namely just the first instance) and the first parentheses if it exists.

My current approach is to use split the string for every ' '. and str.isalpha to find the non alpha elements. But, not sure of how to obtain the first entry in the parantheses.

like image 828
Max Kim Avatar asked Jun 06 '13 14:06

Max Kim


2 Answers

here's an approach using regexps:

import re

text = """1,000 barrels
5 Megawatts hours (MWh)
80 Megawatt hours (MWh) (...)"""

r_unit = re.compile("\((\w+)\)")
r_value = re.compile("([\d,]+)")

for line in text.splitlines():
    unit = r_unit.search(line)
    if unit:
        unit = unit.groups()[0]
    else:
        unit = ""
    value = r_value.search(line)
    if value:
        value = value.groups()[0]
    else:
        value = ""
    print value, unit

or another simpler approach would be using a regexp like this:

r = re.compile("(([\d,]+).*\(?(\w+)?\)?)")
for line, value, unit in r.findall(text):
    print value, unit

(I thought about that one just after writing the previous one :-p)

full explanation of last regexp:

(      <- LINE GROUP
 (     <- VALUE GROUP
  [    <- character grouping (i.e. read char is one of the following characters)
   \d  <- any digit
   ,   <- a comma
  ]
  +    <- one or more of the previous expression
 )
 .     <- any character
 *     <- zero or more of the previous expression
 \(    <- a real parenthesis
 ?     <- zero or one of the previous expression
 (     <- UNIT GROUP
  [
   \w  <- any alphabetic/in-word character
   +   <- one or more of the previous expression
  ]
 )
 ?     <- zero or one of the previous expression
 \)    <- a real ending parenthesis
 ?     <- zero or one of the previous expression
 )
)
like image 122
zmo Avatar answered Oct 13 '22 16:10

zmo


For extraction numerical values you can use re

import re
value = """1,000 barrels
           5 Megawatts hours (MWh)
           80 Megawatt hours (MWh) (5 MW per peak hour)"""
re.findall("[0-9]+,?[0-9]*", value)
like image 26
Ayaz Ahmad Avatar answered Oct 13 '22 16:10

Ayaz Ahmad