I'm writing a blog app with Django. I want to enable comment writers to use some tags (like <strong>
, a
, et cetera) but disable all others.
In addition, I want to let them put code in <code> tags, and have pygments parse them.
For example, someone might write this comment:
I like this article, but the third code example <em>could have been simpler</em>:
<code lang="c">
#include <stdbool.h>
#include <stdio.h>
int main()
{
printf("Hello World\n");
}
</code>
Problem is, when I parse the comment with BeautifulSoup to strip disallowed HTML tags, it also parses the insides of the <code> blocks, and treats <stdbool.h> and <stdio.h> as if they were HTML tags.
How could I tell BeautifulSoup not to parse the <code> blocks? Maybe there are other HTML parsers better for this job?
From Python wiki
>>>import cgi
>>>cgi.escape("<string.h>")
>>>'<string.h>'
>>>BeautifulSoup('<string.h>',
... convertEntities=BeautifulSoup.HTML_ENTITIES)
The problem is that <code>
is treated according to the normal rules for HTML markup, and content inside <code>
tags is still HTML (The tags exists mainly to drive CSS formatting, not to change the parsing rules).
What you are trying to do is create a different markup language that is very similar, but not identical, to HTML. The simple solution would be to assume certain rules, such as, "<code>
and </code>
must appear on a line by themselves," and do some pre-processing yourself.
^<code>$
with <code><![CDATA[
and ^</code>$
with ]]></code>
. It isn't completely reliable, because if the code block contains ]]>
, things will go horribly wrong.<
, >
and &
probably suffice) with their equivalent character entity references (<
, >
and &
). You can do this by passing each block of code you identify to cgi.escape(code_block)
.Once you've completed preprocessing, submit the result to BeautifulSoup as usual.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With