Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing EDGAR filings

I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:

Example

EDGAR provides its Document Type Definitions starting on page 48 of this file:

DTD

The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.

import os
filename = 'parseme.txt'
with open(filename) as f:
    lines = f.readlines()

My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.

like image 241
philq Avatar asked Nov 22 '12 00:11

philq


People also ask

Is there an API to parse SEC filings on EDGAR?

We build easy-to-use and powerful APIs to access, parse and analyze any type of dataset published by the U.S. Securities and Exchange Commission. SEC API is your gateway to search the latest SEC filings and access all corporate documents from the SEC EDGAR archive filed since 1994.

Can you download pdfs from EDGAR?

From respective application select the File menu –> Print – this not the same as the "Save to PDF" option 2. Select "Adobe PDF" as the printer 3. You will be prompted to save the file 4.

Is there an API for EDGAR?

"data.sec.gov" was created to host RESTful data Application Programming Interfaces (APIs) delivering JSON-formatted data to external customers and to web pages on SEC.gov. These APIs do not require any authentication or API keys to access.

What is the EDGAR filing system?

EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system, performs automated collection, validation, indexing, acceptance, and forwarding of submissions by companies and others who are required by law to file forms with the U.S. Securities and Exchange Commission (SEC).


Video Answer


2 Answers

Look at the OpenSP toolkit, which has programs to process SGML files. Your simplest option is probably to use the osx program to get an XML version of the input file, after which you can use XML processing tools.

There may be some setup to do first, as the OpenSP package doesn't come with the EDGAR DTD or its SGML declaration (the first part of the stuff in your reference at page 48, starting with <!SGML "ISO 8879-1986"). You will have to get these as text files and add them to the catalogs where the SP parser can find them.

UPDATE: This document seems to be a more up-to-date version. A casual google search doesn't turn up any immediately machine processable versions, though. So you may have to copy-paste from the PDF.

However, if you do so, there will be some extraneous formatting you'll have to remove: there seem to be page break indicators, labelled "C-1", "C-2", and so on. They are not part of SGML and need to be deleted.

You can either add the SGML declaration and the EDGAR DTD to the catalog (in which case the DTD file should only have the part inside the [ after <!DOCTYPE submission and the matching ] at the end) or you can create a "prolog" file consisting of both parts together as is (i.e. including the <!DOCTYPE submission [ and ]>) and run any program in the toolkit on the prolog and your SGML file - i.e. put both names on the command line, with the prolog file first, so that the parser will read both files in the correct order. To understand what's happening, you need to know that an SGML parser needs three pieces of information for a parse: an SGML declaration to set some environmental and processing parameters, then a DTD to describe the structural constraints on a document, and finally the document itself.

like image 105
arayq2 Avatar answered Oct 07 '22 08:10

arayq2


The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.

like image 28
Cerin Avatar answered Oct 07 '22 08:10

Cerin