I have code that uses the BeautifulSoup
library for parsing, but it is very slow. The code is written in such a way that threads cannot be used.
Can anyone help me with this?
I am using BeautifulSoup
for parsing and than save into a DB. If I comment out the save
statement, it still takes a long time, so there is no problem with the database.
def parse(self,text):
soup = BeautifulSoup(text)
arr = soup.findAll('tbody')
for i in range(0,len(arr)-1):
data=Data()
soup2 = BeautifulSoup(str(arr[i]))
arr2 = soup2.findAll('td')
c=0
for j in arr2:
if str(j).find("<a href=") > 0:
data.sourceURL = self.getAttributeValue(str(j),'<a href="')
else:
if c == 2:
data.Hits=j.renderContents()
#and few others...
c = c+1
data.save()
Any suggestions?
Note: I already ask this question here but that was closed due to incomplete information.
soup2 = BeautifulSoup(str(arr[i]))
arr2 = soup2.findAll('td')
Don't do this: Just call arr2 = arr[i].findAll('td')
instead.
This will also be slow:
if str(j).find("<a href=") > 0:
data.sourceURL = self.getAttributeValue(str(j),'<a href="')
Assuming that getAttributeValue gives you the href
attribute, use this instead:
a = j.find('a', href=True) #find first <a> with href attribute
if a:
data.sourceURL = a['href']
else:
#....
In general, you shouldn't need to convert the BeautifulSoup object back into a string if all you want to do is parse it and extract values. Since the find
and findAll
methods give you back searchable objects, you can keep searching by invoking the find
/findAll
/etc. methods on the results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With