Currently, I'm trying to scrape 10-K submission text files on sec.gov.
Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt
The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.
First, I tried the obvious get_text()
method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.
Does anyone have a clean solution for me to accomplish my goal?
Here is my code so far:
import requests
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)
Let's set a dummy string based on the example:
original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""
Now let's remove all the javascript.
from lxml.html.clean import Cleaner # remove javascript
# Delete javascript tags (some other options are left for the sake of example).
cleaner = Cleaner(
comments = True, # True = remove comments
meta=True, # True = remove meta tags
scripts=True, # True = remove script tags
embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)
(From https://stackoverflow.com/a/46371211/1204332)
And then we can either remove the HTML tags (extract the text) with the HTMLParser
library:
from HTMLParser import HTMLParser
# Strip HTML.
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
text_content = strip_tags(clean_dom)
print text_content
(From: https://stackoverflow.com/a/925630/1204332)
Or we could get the text with the lxml
library:
from lxml.html import fromstring
print fromstring(original_content).text_content()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With