I am writing code to combine functions from the python rawdog RSS reader library and the BeautifulSoup webscraping library. There is a conflict somewhere in the innards that I am trying to overcome. I can replicate the problem with this simplified code: <pre class="prettyprint"><code> import sys, gzip def scrape(filename): contents = gzip.open(filename,'rb').read() contents = contents.decode('utf-8','replace') import BeautifulSoup as BS print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer from rawdoglib import rawdog as rd print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect answer </code></pre> It does not matter what order or where I do the imports, the import of rawdog always causes the BS.BeautifulSoup() method to return the wrong response. I don't actually need rawdog anymore by the time I get to needing BeautifulSoup, so I've tried removing the package at that point, but BS is still broken. Fixes I have tried that have not worked: <ul> <li>I noticed that the rawdog code does its own import of BeautifulSoup. So I tried removing <code>import BeautifulSoup</code> from the rawdog code and re-installing rawdog</li> <li>removing the rawdog modules before importing BeautifulSoup: <ul> <li><code>for x in filter(lambda y: y.startswith('rawdog'), sys.modules.keys()): del sys.modules[x]</code></li> </ul> </li> <li>importing more specific classes/methods from rawdog, e.g <code>from rawdoglib.rawdog import FeedState</code> </li> <li>give the problem method a new name, before and after importing rawdog: <code>from BeautifulSoup import BeautifulSoup as BS</code> </li> <li><code>from __future__ import absolute_import</code></li> </ul> No luck, I always get len(BeautifulSoup(contents)) == 3 if rawdog was ever imported into the namespace. Both packages are complex enough that I haven't been able to figure out exactly what the problem overlap is, and I'm not sure what tools to use to try to figure that out, other than searching through dir(BeautifulSoup) and dir(rawdog), where I haven't found good clues. Updates, responding to answers: I omitted that the problem does not occur with every input file, which is crucial, sorry. The offending files are quite large so I don't think I can post them here. I will try to figure out the crucial difference between the good and bad files and post it. Thanks for the debugging help so far. Further debugging! I have identified this block in the input text as problematic: <pre class="prettyprint"><code> function SwitchMenu(obj){ if(document.getElementById){ var el = document.getElementById(obj); var ar = document.getElementById("masterdiv").getElementsByTagName("span"); //DynamicDrive.com change if(el.style.display != "block"){ //DynamicDrive.com change for (var i=0; i<ar.length; i++){ if (ar[i].className=="submenu") //DynamicDrive.com change ar[i].style.display = "none"; } el.style.display = "block"; }else{ el.style.display = "none"; } } </code></pre> } If I comment out this block, then I get the correct parse through BeautifulSoup with or without the rawdog import. With the block, rawdog + BeautifulSoup is faulty. So should I just search my input for a block like this, or is there a better workaround?

It's a bug in <code>rawdoglib.feedparser.py</code>. <code>rawdog</code> is monkey patching <code>smglib</code>: on line 198 it reads: <pre class="prettyprint"><code>if sgmllib.endbracket.search(' <').start(0): class EndBracketMatch: endbracket = re.compile('''([^'"<>]|"[^"]*"(?=>|/|\s|\w+=)|'[^']*'(?=>|/|\s|\w+=))*(?=[<>])|.*?(?=[<>])''') def search(self,string,index=0): self.match = self.endbracket.match(string,index) if self.match: return self def start(self,n): return self.match.end(n) sgmllib.endbracket = EndBracketMatch() </code></pre> This is a script to reproduce the error: <pre class="prettyprint"><code>contents = '''<a><ar "none"; </a> ''' import BeautifulSoup as BS print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer from rawdoglib import rawdog as rd print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect </code></pre> It breaks on the "<" inside the "a" tag. In the OP's snippet, it is triggered by the line: <code>for (var i=0; i<ar.length; i++){</code> (note the "<" char). Issue submitted on rawdog's ML: http://lists.us-lot.org/pipermail/rawdog-users/2012-August/000327.html

python conflicts in two external packages

Tags:

python

packages

conflict

I am writing code to combine functions from the python rawdog RSS reader library and the BeautifulSoup webscraping library. There is a conflict somewhere in the innards that I am trying to overcome.

I can replicate the problem with this simplified code:

    import sys, gzip
    def scrape(filename):
        contents = gzip.open(filename,'rb').read()
        contents = contents.decode('utf-8','replace')
        import BeautifulSoup as BS
        print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer
        from rawdoglib import rawdog as rd
        print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect answer

It does not matter what order or where I do the imports, the import of rawdog always causes the BS.BeautifulSoup() method to return the wrong response. I don't actually need rawdog anymore by the time I get to needing BeautifulSoup, so I've tried removing the package at that point, but BS is still broken. Fixes I have tried that have not worked:

I noticed that the rawdog code does its own import of BeautifulSoup. So I tried removing import BeautifulSoup from the rawdog code and re-installing rawdog
removing the rawdog modules before importing BeautifulSoup:
- for x in filter(lambda y: y.startswith('rawdog'), sys.modules.keys()): del sys.modules[x]
importing more specific classes/methods from rawdog, e.g from rawdoglib.rawdog import FeedState
give the problem method a new name, before and after importing rawdog: from BeautifulSoup import BeautifulSoup as BS
from __future__ import absolute_import

No luck, I always get len(BeautifulSoup(contents)) == 3 if rawdog was ever imported into the namespace. Both packages are complex enough that I haven't been able to figure out exactly what the problem overlap is, and I'm not sure what tools to use to try to figure that out, other than searching through dir(BeautifulSoup) and dir(rawdog), where I haven't found good clues.

Updates, responding to answers: I omitted that the problem does not occur with every input file, which is crucial, sorry. The offending files are quite large so I don't think I can post them here. I will try to figure out the crucial difference between the good and bad files and post it. Thanks for the debugging help so far.

Further debugging! I have identified this block in the input text as problematic:

    function SwitchMenu(obj){
      if(document.getElementById){
      var el = document.getElementById(obj);
      var ar = document.getElementById("masterdiv").getElementsByTagName("span"); //DynamicDrive.com change
         if(el.style.display != "block"){ //DynamicDrive.com change
         for (var i=0; i<ar.length; i++){
            if (ar[i].className=="submenu") //DynamicDrive.com change
            ar[i].style.display = "none";
      }
      el.style.display = "block";
      }else{
        el.style.display = "none";
    }
}

}

If I comment out this block, then I get the correct parse through BeautifulSoup with or without the rawdog import. With the block, rawdog + BeautifulSoup is faulty. So should I just search my input for a block like this, or is there a better workaround?

430

asked Aug 13 '12 15:08

rodin

1 Answers

It's a bug in rawdoglib.feedparser.py. rawdog is monkey patching smglib: on line 198 it reads:

if sgmllib.endbracket.search(' <').start(0):
    class EndBracketMatch:
        endbracket = re.compile('''([^'"<>]|"[^"]*"(?=>|/|\s|\w+=)|'[^']*'(?=>|/|\s|\w+=))*(?=[<>])|.*?(?=[<>])''')
        def search(self,string,index=0):
            self.match = self.endbracket.match(string,index)
            if self.match: return self
        def start(self,n):
            return self.match.end(n)
    sgmllib.endbracket = EndBracketMatch()

This is a script to reproduce the error:

contents = '''<a><ar "none";                                                 
</a> '''                                                                     
import BeautifulSoup as BS                                                   
print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer
from rawdoglib import rawdog as rd                                           
print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect

It breaks on the "<" inside the "a" tag. In the OP's snippet, it is triggered by the line: for (var i=0; i<ar.length; i++){ (note the "<" char).

Issue submitted on rawdog's ML: http://lists.us-lot.org/pipermail/rawdog-users/2012-August/000327.html

answered Oct 16 '22 09:10

lbolla

Related questions
                            
                                Python/Django download Image from URL, modify, and save to ImageField
                            
                                Django Save Incomplete Progress on Form
                            
                                Python: midi to audio stream
                            
                                Tracking object allocation in python
                            
                                Setting up SCons to Autolint
                            
                                What is the most performant way to store a list of Tuples in App-Engine?
                            
                                Python - How do I write a more efficient, Pythonic reduce?
                            
                                Boolean function optimizer package for Python
                            
                                Typical Naming Conventions for Python Directories in Packages
                            
                                How to "stop" and "resume" long time running Python script?
                            
                                Is it safe to call an overridden method from __init__()?
                            
                                What are prevalent techniques for enabling user code extensions in Python?
                            
                                How to reinitialise an embedded Python interpreter?
                            
                                Streaming audio and video with Python
                            
                                Is python-markdown safe on untrusted input?
                            
                                Why does celery return a KeyError when executing my task?
                            
                                List modification in a loop
                            
                                Grab user input asynchronously and pass to an Event loop in python
                            
                                Read and write binary file in Python
                            
                                Python - do something until keypress or timeout

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With