Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to convert a bs4.element.ResultSet to strings? Python

I have a simple code like:

    p = soup.find_all("p")
    paragraphs = []

    for x in p:
        paragraphs.append(str(x))

I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:

"RuntimeError: maximum recursion depth exceeded while calling a Python object"

I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?

like image 265
samuraiexe Avatar asked Jan 07 '14 09:01

samuraiexe


People also ask

How do you convert a class bs4 element tag to string?

To convert a Tag object to a string in Beautiful Soup, simply use str(Tag) .

How do you convert a list to a string in Python?

To convert a list to a string, use Python List Comprehension and the join() function. The list comprehension will traverse the elements one by one, and the join() method will concatenate the list's elements into a new string and return it as output.


2 Answers

The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.

Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:

# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()

p = soup.find_all('p')
paragraphs = []
for x in p:
    paragraphs.append(str(x))
like image 147
senshin Avatar answered Oct 26 '22 00:10

senshin


I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().

For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.

import re
import urllib
from urllib.request import Request, urlopen

url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'

response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)

paragraphs = []
for x in p:
    paragraphs.append(str(x))
like image 31
mattcan Avatar answered Oct 25 '22 23:10

mattcan