Web scraping SEC Edgar 10-K and 10-Q filings

Tags:

Are there anyone experienced with scraping SEC 10-K and 10-Q filings? I got stuck while trying to scrape monthly realised share repurchases from these filings. In specific, I would like to get the following information: 1. Period; 2. Total Number of Shares Purchased; 3. Average Price Paid per Share; 4. Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs; 5. Maximum Number (or Approximate Dollar Value) of Shares that May Yet Be Purchased Under the Plans or Programs for each month from 2004 to 2014. I have in total 90,000+ forms to parse, so it won't be feasible to do it manually.

This information is usually reported under "Part 2 Item 5 Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" in 10-Ks and "Part 2 Item 2 Unregistered Sales of Equity Securities and Use of Proceeds".

Here is one example of the 10-Q filings that I need to parse: https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm

If a firm have no share repurchase, this table can be missing from the quarterly report.

I have tried to parse the html files with Python BeautifulSoup, but the results are not satisfactory, mainly because these files are not written in a consistent format.

For example, the only way I can think of to parse these forms is

from bs4 import BeautifulSoup
import requests
import unicodedata
import re

url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'

def parse_html(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    tables = soup.find_all('table') 

    identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)

    n = len(tables) -1
    rep_tables = []

    while n >= 0:
        table = tables[n]
        remove_invalid_tags(table)
        table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
        if re.search(identifier, table_text):
            rep_tables += [table]
            n -= 1
        else:
            n -= 1

    return rep_tables

def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):
    for tag in invalid_tags:
        tags = soup.find_all(tag)
        if tags:
            [x.replaceWith(' ') for x in tags]

The above code only returns the messy that may contain the repurchase information. However, 1) it is not reliable; 2) it is very slow; 3) the following steps to scrape date/month, share price, and number of shares etc. are much more painful to do. I am wondering if there are more feasible languages/approaches/applications/databases to get such information? Thanks a million!

911

asked Jul 20 '15 22:07

Jiayuan Chen

1 Answers

I'm not sure about python, but in R there is an beautiful solution using 'finstr' package (https://github.com/bergant/finstr). 'finstr' automatically extracts the financial statements (income statement, balance sheet, cash flow and etc.) from EDGAR using XBRL format.

answered Nov 21 '22 22:11

Lamothy

Related questions
                            
                                WebException on HTTP request while debugging
                            
                                Compute Engine HTTP Load Balancing 502 error
                            
                                "Property not found on type" when using interface default methods in JSP EL
                            
                                All Build for iTunes Connect, there was an error importing this build [closed]
                            
                                Python: How to detect unused packages and remove them
                            
                                Create new application telegram API
                            
                                HTML reserve space for scrollbar
                            
                                IPython 5.0 and key bindings in console
                            
                                Send SMS with AWS Javascript SDK
                            
                                Why uploaded audio is corrupt when upload is clearly successful?
                            
                                Github GET on private repo with access token
                            
                                Why allow shared_ptr<T[N]>?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With