I am writing code to combine functions from the python rawdog RSS reader library and the BeautifulSoup webscraping library. There is a conflict somewhere in the innards that I am trying to overcome.
I can replicate the problem with this simplified code:
import sys, gzip
def scrape(filename):
contents = gzip.open(filename,'rb').read()
contents = contents.decode('utf-8','replace')
import BeautifulSoup as BS
print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer
from rawdoglib import rawdog as rd
print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect answer
It does not matter what order or where I do the imports, the import of rawdog always causes the BS.BeautifulSoup() method to return the wrong response. I don't actually need rawdog anymore by the time I get to needing BeautifulSoup, so I've tried removing the package at that point, but BS is still broken. Fixes I have tried that have not worked:
import BeautifulSoup
from the rawdog code and re-installing rawdogfor x in filter(lambda y: y.startswith('rawdog'), sys.modules.keys()): del sys.modules[x]
from rawdoglib.rawdog import FeedState
from BeautifulSoup import BeautifulSoup as BS
from __future__ import absolute_import
No luck, I always get len(BeautifulSoup(contents)) == 3 if rawdog was ever imported into the namespace. Both packages are complex enough that I haven't been able to figure out exactly what the problem overlap is, and I'm not sure what tools to use to try to figure that out, other than searching through dir(BeautifulSoup) and dir(rawdog), where I haven't found good clues.
Updates, responding to answers: I omitted that the problem does not occur with every input file, which is crucial, sorry. The offending files are quite large so I don't think I can post them here. I will try to figure out the crucial difference between the good and bad files and post it. Thanks for the debugging help so far.
Further debugging! I have identified this block in the input text as problematic:
function SwitchMenu(obj){
if(document.getElementById){
var el = document.getElementById(obj);
var ar = document.getElementById("masterdiv").getElementsByTagName("span"); //DynamicDrive.com change
if(el.style.display != "block"){ //DynamicDrive.com change
for (var i=0; i<ar.length; i++){
if (ar[i].className=="submenu") //DynamicDrive.com change
ar[i].style.display = "none";
}
el.style.display = "block";
}else{
el.style.display = "none";
}
}
}
If I comment out this block, then I get the correct parse through BeautifulSoup with or without the rawdog import. With the block, rawdog + BeautifulSoup is faulty. So should I just search my input for a block like this, or is there a better workaround?
While two modules can import each other, it can get messy in practice. I found that when some of my methods were decorated with Keras registration decorators, I would get double-registration errors unless I hid them as inner methods within another method of the class.
Modules can import each other cyclically, but there's a catch. In the simple case, it should work by moving the import statements to the bottom of the file or not using the from syntax.
### Usage Simply run the command pipconflictchecker. If any dependency conflicts are found an output dump of all conflicts will be shown, and an exit code of 1 will be returned.
A package is the form of a collection of tools which helps in the initiation of the code. A python package acts as a user-variable interface for any source code. This makes a python package work at a defined time for any functionable code in the runtime.
It's a bug in rawdoglib.feedparser.py
. rawdog
is monkey patching smglib
:
on line 198 it reads:
if sgmllib.endbracket.search(' <').start(0):
class EndBracketMatch:
endbracket = re.compile('''([^'"<>]|"[^"]*"(?=>|/|\s|\w+=)|'[^']*'(?=>|/|\s|\w+=))*(?=[<>])|.*?(?=[<>])''')
def search(self,string,index=0):
self.match = self.endbracket.match(string,index)
if self.match: return self
def start(self,n):
return self.match.end(n)
sgmllib.endbracket = EndBracketMatch()
This is a script to reproduce the error:
contents = '''<a><ar "none";
</a> '''
import BeautifulSoup as BS
print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer
from rawdoglib import rawdog as rd
print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect
It breaks on the "<" inside the "a" tag. In the OP's snippet, it is triggered by the line: for (var i=0; i<ar.length; i++){
(note the "<" char).
Issue submitted on rawdog's ML: http://lists.us-lot.org/pipermail/rawdog-users/2012-August/000327.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With