I need to parse an html document which contains "code" tags
I'm getting the code blocks like this:
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')
The problem is, if i have a code tag like this:
<code class="csharp">
List<Person> persons = new List<Person>();
</code>
BeautifulSoup forse the closing of nested tags and transform the code block into:
<code class="csharp">
List<person> persons = new List</person><person>();
</person>
</code>
is there any way to extract the content of the code tags as text with BeautifulSoup without letting it fix what IT thinks are html markup errors?
Add the code tag to the QUOTE_TAGS dictionary.
from BeautifulSoup import BeautifulSoup
content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"
BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')
Output:
[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With