Parsing EDGAR filings

Tags:

I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:

Example

EDGAR provides its Document Type Definitions starting on page 48 of this file:

DTD

The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.

Click to copy

import os
filename = 'parseme.txt'
with open(filename) as f:
    lines = f.readlines()

My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.

241

asked Nov 22 '12 00:11

philq

Video Answer

2 Answers

Look at the OpenSP toolkit, which has programs to process SGML files. Your simplest option is probably to use the osx program to get an XML version of the input file, after which you can use XML processing tools.

There may be some setup to do first, as the OpenSP package doesn't come with the EDGAR DTD or its SGML declaration (the first part of the stuff in your reference at page 48, starting with <!SGML "ISO 8879-1986"). You will have to get these as text files and add them to the catalogs where the SP parser can find them.

UPDATE: This document seems to be a more up-to-date version. A casual google search doesn't turn up any immediately machine processable versions, though. So you may have to copy-paste from the PDF.

However, if you do so, there will be some extraneous formatting you'll have to remove: there seem to be page break indicators, labelled "C-1", "C-2", and so on. They are not part of SGML and need to be deleted.

You can either add the SGML declaration and the EDGAR DTD to the catalog (in which case the DTD file should only have the part inside the [ after <!DOCTYPE submission and the matching ] at the end) or you can create a "prolog" file consisting of both parts together as is (i.e. including the <!DOCTYPE submission [ and ]>) and run any program in the toolkit on the prolog and your SGML file - i.e. put both names on the command line, with the prolog file first, so that the parser will read both files in the correct order. To understand what's happening, you need to know that an SGML parser needs three pieces of information for a parse: an SGML declaration to set some environmental and processing parameters, then a DTD to describe the structural constraints on a document, and finally the document itself.

105

answered Oct 07 '22 08:10

arayq2

The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.

answered Oct 07 '22 08:10

Cerin

Related questions
                            
                                How should python dictionaries be stored in pytables?
                            
                                Matplotlib: Grab Single Subplot from Multiple Subplots
                            
                                What strategies exist for ensuring all locale-aware operations are handled correctly in all locales?
                            
                                Pylint E0202 False Positive? Or is this piece of code wrong?
                            
                                Python : Difference between static methods vs class method [duplicate]
                            
                                Enter hidden password in python
                            
                                "cannot execute binary file" error in python
                            
                                Bind a python library TO C
                            
                                Screenshot colour averaging of rectangles
                            
                                django model instance variables for transient use
                            
                                Modifying dictionary attributes in jinja2
                            
                                Python difference between print obj and print obj.__str__() [at least with Unicode?]
                            
                                How do I fix a ValueError: read of closed file exception?
                            
                                Iterating through all items in a DynamoDB table
                            
                                How can I generate "Go First" Dice for N dice?
                            
                                PIP/easy_install PIL in Virtualenv vcvarsall.bat error Windows 7
                            
                                Has the DataFrame object from pandas superceded the other alternatives for heterogeneous data types?
                            
                                Automatically document my REST API
                            
                                SQLAlchemy temporary table with Declarative Base
                            
                                Testing InlineFormset clean methods

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing EDGAR filings

Tags:

python

parsing

python-2.7

sgml