Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Data with Python Regular Expressions

I am having some trouble wrapping my head around Python regular expressions to come up with a regular expression to extract specific values.

The page I am trying to parse has a number of productIds which appear in the following format

\"productId\":\"111111\"

I need to extract all the values, 111111 in this case.

like image 341
greyfox Avatar asked Apr 11 '13 20:04

greyfox


People also ask

Is regex good for parsing?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.

How do you extract text from a line in Python?

You can try the readlines command which would return a list. In case of the doing. You are first reading the whole file as one string (inp. read()), then you are using split() on that which causes the string to be split on whitespaces.


1 Answers

t = "\"productId\":\"111111\""
m = re.match("\W*productId[^:]*:\D*(\d+)", t)
if m:
    print m.group(1)

meaning match non-word characters (\W*), then productId followed by non-column characters ([^:]*) and a :. Then match non-digits (\D*) and match and capture following digits ((\d+)).

Output

111111
like image 77
perreal Avatar answered Oct 21 '22 10:10

perreal