Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping SEC Edgar 10-K and 10-Q filings

Tags:

Are there anyone experienced with scraping SEC 10-K and 10-Q filings? I got stuck while trying to scrape monthly realised share repurchases from these filings. In specific, I would like to get the following information: 1. Period; 2. Total Number of Shares Purchased; 3. Average Price Paid per Share; 4. Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs; 5. Maximum Number (or Approximate Dollar Value) of Shares that May Yet Be Purchased Under the Plans or Programs for each month from 2004 to 2014. I have in total 90,000+ forms to parse, so it won't be feasible to do it manually.

This information is usually reported under "Part 2 Item 5 Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" in 10-Ks and "Part 2 Item 2 Unregistered Sales of Equity Securities and Use of Proceeds".

Here is one example of the 10-Q filings that I need to parse: https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm

If a firm have no share repurchase, this table can be missing from the quarterly report.

I have tried to parse the html files with Python BeautifulSoup, but the results are not satisfactory, mainly because these files are not written in a consistent format.

For example, the only way I can think of to parse these forms is

from bs4 import BeautifulSoup
import requests
import unicodedata
import re

url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'

def parse_html(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    tables = soup.find_all('table') 

    identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)

    n = len(tables) -1
    rep_tables = []

    while n >= 0:
        table = tables[n]
        remove_invalid_tags(table)
        table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
        if re.search(identifier, table_text):
            rep_tables += [table]
            n -= 1
        else:
            n -= 1

    return rep_tables

def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):
    for tag in invalid_tags:
        tags = soup.find_all(tag)
        if tags:
            [x.replaceWith(' ') for x in tags]

The above code only returns the messy that may contain the repurchase information. However, 1) it is not reliable; 2) it is very slow; 3) the following steps to scrape date/month, share price, and number of shares etc. are much more painful to do. I am wondering if there are more feasible languages/approaches/applications/databases to get such information? Thanks a million!

like image 911
Jiayuan Chen Avatar asked Jul 20 '15 22:07

Jiayuan Chen


People also ask

Is there an API to parse SEC filings on EDGAR?

We build easy-to-use and powerful APIs to access, parse and analyze any type of dataset published by the U.S. Securities and Exchange Commission. SEC API is your gateway to search the latest SEC filings and access all corporate documents from the SEC EDGAR archive filed since 1994.

What is an EDGAR filing with the SEC?

EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system, performs automated collection, validation, indexing, acceptance, and forwarding of submissions by companies and others who are required by law to file forms with the U.S. Securities and Exchange Commission (SEC).


1 Answers

I'm not sure about python, but in R there is an beautiful solution using 'finstr' package (https://github.com/bergant/finstr). 'finstr' automatically extracts the financial statements (income statement, balance sheet, cash flow and etc.) from EDGAR using XBRL format.

like image 56
Lamothy Avatar answered Nov 21 '22 22:11

Lamothy