Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove "&nbsp" from html contents?

I have a html page as:

<div class="theater">
    <div class="desc" id="theater_16109207495969942346">
        <h2 class="name"><a href="/movies?near=pune&amp;tid=df8f66de0a592b4a" id="link_1_theater_16109207495969942346">Esquare Victory Camp</a></h2>
        <div class="info">site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975
            <a class="fl" href="" target="_top"></a>
        </div>
    </div>
    <div class="showtimes">
        <div class="show_left">
            <div class="movie">
                <div class="name"><a href="/movies?near=pune&amp;mid=1cdcf90092189400">Hawaa Hawaai</a>
                </div><span class="info">Drama - Hindi</span>
                <div class="times"><span style="color:#666"><span style="padding:0 "></span>
                    <!-- -->10:30am</span><span style="color:#666"><span style="padding:0 "> &amp;nbsp</span>
                    <!-- -->3:45</span><span style="color:#666"><span style="padding:0 "> &amp;nbsp</span>
                    <!-- -->6:00</span><span style="color:"><span style="padding:0 "> &amp;nbsp</span>
                    <!-- -->8:30pm</span>
                </div>
            </div>
        </div>
        <div class="show_right">
            <div class="movie">
                <div class="name"><a href="/movies?near=pune&amp;mid=6b59ad39004d895b">The Amazing Spider Man 2</a>
                </div><span class="info">Action/Adventure/Thriller - English - <a class="fl" href="/url?q=http://www.youtube.com/watch%3Fv%3DSCjCk59PIzw&amp;sa=X&amp;oi=movies&amp;ii=0&amp;usg=AFQjCNGpVM5U04h0acABA7eApb6EIO4Ejw">Trailer</a></span>
                <div class="times"><span style="color:#666"><span style="padding:0 "></span>
                    <!-- -->1:00</span><span style="color:"><span style="padding:0 "> &amp;nbsp</span>
                    <!-- -->10:45pm</span>
                </div>
            </div>
        </div>
        <p class="clear"></p>
    </div>
</div>

Where we can see we have &amp;nbsp at many places. There are many other unicode characters as well. I want to extract the contents of this page. What I am doing is:

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

myName = soup.findAll("div", {"class" : "theater"})
for x in myName:
   xt = str(x)
   print removeNonAscii(xt)
   print "<br>"

The Result:

Esquare Victory Camp
site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975
Hawaa Hawaai
Drama - Hindi
10:30am &nbsp3:45 &nbsp6:00 &nbsp8:30pm
The Amazing Spider Man 2
Action/Adventure/Thriller - English - Trailer
1:00 &nbsp10:45pm

Everything looks good except &nbsp . I tried replacing &nbsp, and searched for other solutions too but still have no solution. I think &nbsp without ; is creating problem. How can &nbsp be removed ?

like image 527
impossible Avatar asked May 12 '14 15:05

impossible


2 Answers

lxml.html may be a more suitable library for you, which will replace the "&nbsp" and other HTML Tags into the correct characters.

import lxml.html
import lxml.html.clean
html = """your HTML"""
doc = lxml.html.fromstring(html)
cleaner = lxml.html.clean.Cleaner(style=True)
doc = cleaner.clean_html(doc)
text = doc.text_content()
like image 185
Azure Avatar answered Oct 04 '22 03:10

Azure


Depending on the stage of processing in which you want to remove your nonbreaking space, it can be quite easy. For instance when you process the HTML fragment you provided you can just remove the string "&nbsp" from the text elements:

s = """your HTML"""
soup = BeautifulSoup(s)
texts = soup.find_all(text=True)
for t in texts:
   newtext = t.replace("&nbsp", "")
   t.replace_with(newtext)
like image 35
ofrommel Avatar answered Oct 04 '22 04:10

ofrommel