I have a simple code like:
p = soup.find_all("p")
paragraphs = []
for x in p:
paragraphs.append(str(x))
I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:
"RuntimeError: maximum recursion depth exceeded while calling a Python object"
I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?
To convert a Tag object to a string in Beautiful Soup, simply use str(Tag) .
To convert a list to a string, use Python List Comprehension and the join() function. The list comprehension will traverse the elements one by one, and the join() method will concatenate the list's elements into a new string and return it as output.
The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P
, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053]
for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.
Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p>
tags, try this:
# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()
p = soup.find_all('p')
paragraphs = []
for x in p:
paragraphs.append(str(x))
I believe the issue is that the BeautifulsSoup
object p
is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p')
. Note the RecursionError
is similarly thrown when building soup.prettify()
.
For my solution I used the re
module to gather all <p>...</p>
tags (see code below). My final result was len(p) = 5571
. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.
import re
import urllib
from urllib.request import Request, urlopen
url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'
response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)
paragraphs = []
for x in p:
paragraphs.append(str(x))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With