I am trying to extract the string between Html tags. I can see that similar questions have been asked on stack overflow before, but I am completely new to python and I am struggling.
So if I have
<b>Bold Stuff</b>
I want to have a regular expression that leaves me with
Bold Stuff
But all of my solutions so far have left me with things like
>Bold Stuff<
I would really appreciate any help with this.
I had
>.*?<
And I have seen a question on stack overflow with suggested solution
>([^<>]*)<
But neither of these are working for me. Please could someone explain how to write a regex that says "find me the string between characters x and y not including x and y".
Thanks for any help
>>> a = '<b>Bold Stuff</b>'
>>>
>>> import re
>>> re.findall(r'>(.+?)<', a)
['Bold Stuff']
>>> re.findall(r'>(.*?)<', a)[0] # non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.+?)<', a)[0] # or this, also is non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.*)<', a)[0] # greedy mode
'Bold Stuff'
>>>
At this point, both of greedy mode and non-greedy mode can work.
You're using the first non-greedy mode. Here is an example about what about non-greedy mode and greedy mode:
>>> a = '<b>Bold <br> Stuff</b>'
>>> re.findall(r'>(.*?)<', a)[0]
'Bold '
>>> re.findall(r'>(.*)<', a)[0]
'Bold <br> Stuff'
>>>
And here is about what is (...)
:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;
the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.
To match the literals
(
or)
, use\(
or\)
, or enclose them inside a character class:[(] [)]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With