Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tell BeautifulSoup to extract the content of a specific tag as text? (without touching it)

I need to parse an html document which contains "code" tags

I'm getting the code blocks like this:

soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

The problem is, if i have a code tag like this:

<code class="csharp">
    List<Person> persons = new List<Person>();
</code>

BeautifulSoup forse the closing of nested tags and transform the code block into:

<code class="csharp">
    List<person> persons = new List</person><person>();
    </person>
</code>

is there any way to extract the content of the code tags as text with BeautifulSoup without letting it fix what IT thinks are html markup errors?

like image 289
BFil Avatar asked Feb 25 '23 04:02

BFil


1 Answers

Add the code tag to the QUOTE_TAGS dictionary.

from BeautifulSoup import BeautifulSoup

content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"

BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

Output:

[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]
like image 123
Rod Avatar answered Apr 06 '23 20:04

Rod