I am having some trouble wrapping my head around Python regular expressions to come up with a regular expression to extract specific values.
The page I am trying to parse has a number of productIds which appear in the following format
\"productId\":\"111111\"
I need to extract all the values, 111111
in this case.
Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.
You can try the readlines command which would return a list. In case of the doing. You are first reading the whole file as one string (inp. read()), then you are using split() on that which causes the string to be split on whitespaces.
t = "\"productId\":\"111111\""
m = re.match("\W*productId[^:]*:\D*(\d+)", t)
if m:
print m.group(1)
meaning match non-word characters (\W*
), then productId
followed by non-column characters ([^:]*
) and a :
. Then match non-digits (\D*
) and match and capture following digits ((\d+)
).
Output
111111
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With