Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regex - find string between html tags [duplicate]

Tags:

python

html

regex

I am trying to extract the string between Html tags. I can see that similar questions have been asked on stack overflow before, but I am completely new to python and I am struggling.

So if I have

<b>Bold Stuff</b>

I want to have a regular expression that leaves me with

Bold Stuff

But all of my solutions so far have left me with things like

>Bold Stuff<

I would really appreciate any help with this.

I had

>.*?<

And I have seen a question on stack overflow with suggested solution

>([^<>]*)<

But neither of these are working for me. Please could someone explain how to write a regex that says "find me the string between characters x and y not including x and y".

Thanks for any help

like image 806
JungleBook Avatar asked Dec 25 '22 13:12

JungleBook


1 Answers

>>> a = '<b>Bold Stuff</b>'
>>> 
>>> import re
>>> re.findall(r'>(.+?)<', a)
['Bold Stuff']
>>> re.findall(r'>(.*?)<', a)[0] # non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.+?)<', a)[0] # or this, also is non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.*)<', a)[0] # greedy mode
'Bold Stuff'
>>> 

At this point, both of greedy mode and non-greedy mode can work.

You're using the first non-greedy mode. Here is an example about what about non-greedy mode and greedy mode:

>>> a = '<b>Bold <br> Stuff</b>'
>>> re.findall(r'>(.*?)<', a)[0]
'Bold '
>>> re.findall(r'>(.*)<', a)[0]
'Bold <br> Stuff'
>>> 

And here is about what is (...):

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;

the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.

To match the literals ( or ), use \( or \), or enclose them inside a character class: [(] [)].

like image 65
Remi Crystal Avatar answered Dec 27 '22 01:12

Remi Crystal