how to convert a bs4.element.ResultSet to strings? Python

Tags:

I have a simple code like:

    p = soup.find_all("p")
    paragraphs = []

    for x in p:
        paragraphs.append(str(x))

I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:

"RuntimeError: maximum recursion depth exceeded while calling a Python object"

I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?

265

asked Jan 07 '14 09:01

samuraiexe

2 Answers

The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.

Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:

# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()

p = soup.find_all('p')
paragraphs = []
for x in p:
    paragraphs.append(str(x))

147

answered Oct 26 '22 00:10

senshin

I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().

For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.

import re
import urllib
from urllib.request import Request, urlopen

url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'

response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)

paragraphs = []
for x in p:
    paragraphs.append(str(x))

answered Oct 25 '22 23:10

mattcan

Related questions
                            
                                Can't fetch the profile name using Selenium after logging in using requests
                            
                                "Standardized" docstring/self-documentation of bash scripts
                            
                                using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function
                            
                                Python equivalent to Java's JNLP Web Start?
                            
                                Detect English verb tenses using NLTK
                            
                                In setup.py or pip requirements file, how to control order of installing package dependencies?
                            
                                How to adapt the Singleton pattern? (Deprecation warning)
                            
                                Python: inconsistence in the way you define the function __setattr__?
                            
                                Get display count and resolution for each display in Python without xrandr
                            
                                Python subprocess call returns "command not found", Terminal executes correctly
                            
                                How to set NetworkX edge labels offset? (to avoid label overlap)
                            
                                Select data at a particular level from a MultiIndex
                            
                                OpenCV imread hanging when called from a web request
                            
                                How to test database connectivity in python?
                            
                                Connect to SMTP (SSL or TLS) using Python
                            
                                True=False assignment in Python 2.x [duplicate]
                            
                                How to find the path to a SSL cert file?
                            
                                How to terminate multiprocessing Pool processes?
                            
                                Mocking Oauth providers while testing
                            
                                Find subset with K elements that are closest to eachother

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to convert a bs4.element.ResultSet to strings? Python

Tags:

python

beautifulsoup

runtime-error

samuraiexe

People also ask

2 Answers

senshin

mattcan

Recent Activity

Donate For Us