Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use lxml to parse text file with bad header in Python

Tags:

python

lxml

I would like to parse text files (stored locally) with lxml's etree. But all of my files (thousands) have headers, such as:

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: [email protected]
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ
 YVPEaPjyiqB4rV/GS2lj6A==

<SEC-DOCUMENT>0001193125-07-200376.txt : 20070913
<SEC-HEADER>0001193125-07-200376.hdr.sgml : 20070913
<ACCEPTANCE-DATETIME>20070913115715
ACCESSION NUMBER:       0001193125-07-200376
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      7
CONFORMED PERIOD OF REPORT: 20070630
FILED AS OF DATE:       20070913
DATE AS OF CHANGE:      20070913

and the first < isn't until line 51 in this case (and isn't 51 in all cases). The xml portions starts as follows:

</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>

Can I handle this on-the-fly with lxml? Or should I use a stream editor to omit each file's header? Thanks!

Here is my current code and error.

from lxml import etree
f = etree.parse('temp.txt')

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Edit:

FWIW, here is a link to the file.

like image 282
Richard Herron Avatar asked Sep 13 '12 18:09

Richard Herron


People also ask

What is lxml Etree in Python?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

What does lxml do?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

What is lxml HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML). Contents. Parsers. Parser options.

How to parse XML and HTML with lxml in Python?

Parsing XML and HTML with lxml 1 Parsers. Parsers are represented by parser objects. ... 2 The target parser interface. Starting with lxml 2.3, the .close () method will also be called in the error case. ... 3 The feed parser interface. ... 4 Incremental event parsing. ... 5 iterparse and iterwalk. ... 6 Python unicode strings. ...

Can I use parse () and feed () together in lxml?

In lxml.etree, you can use both interfaces to a parser at the same time: the parse() or XML() functions, and the feed parser interface. Both are independent and will not conflict (except if used in conjunction with a parser target object as described above).

Should you use Zen of Python for XML parsing?

On the one hand, the Zen of Python promises only one obvious way to achieve your goal. At the same time, the standard library follows the batteries included motto by letting you choose from not one but several XML parsers. Luckily, the Python community solved this surplus problem by creating even more XML parsing libraries.

Which XML parser should I use?

The lxml one described earlier is actually recommended by the official documentation and is currently the only XML parser supported by the library. Depending on the kind of documents you’ll want to parse, the desired efficiency, and feature availability, you can select one of these parsers:


2 Answers

Given that there's a standard for these files, it's possible to write a proper parser rather than guessing at things, or hoping beautifulsoup gets things right. That doesn't mean it's the best answer for you, but it's certainly work looking at.

According to the standard at http://www.sec.gov/info/edgar/pdsdissemspec910.pdf what you've got (inside the PEM enclosure) is an SGML document defined by the provided DTD. So, first go to pages 48-55, extract the text there, and save it as, say, "edgar.dtd".

The first thing I'd do is install SP and use its tools to make sure that the documents really are valid and parseable by that DTD, to make sure you don't waste a bunch of time on something that isn't going to pan out.

Python comes with a validating SGML parser, sgmllib. Unfortunately, it was never quite finished, and it's deprecated in 2.6-2.7 (and removed in 3.x). But that doesn't mean it won't work. So, try it and see if it works.

If not, I don't know of any good alternatives in Python; most of the SGML code out there is in C, C++, or Perl. But you can wrap up any C or C++ library (I'd start with SP) pretty easily, as long as you're comfortable writing your own wrapped in C/Cython/boost-python/whatever or using ctypes. You only need to wrap up the top-level functions, not build a complete set of bindings. But if you've never done anything like this before, it's probably not the best time to learn.

Alternatively, you can wrap up a command-line tool. SP comes with nsgmls. There's another good tool written in perl with the same name (I think part of http://savannah.nongnu.org/projects/perlsgml/ but I'm not positive.) And dozens of other tools.

Or, of course, you could write the whole thing, or just the parsing layer, in perl (or C++) instead of Python.

like image 190
abarnert Avatar answered Oct 08 '22 18:10

abarnert


You can easily get to the encapsulated text of the PEM (Privacy-Enhanced Message, specified in RFC 1421 ) by stripping the encapsulation boundries and separating everything in between into header and encapsulated text at the first blank line.

The SGML parsing is much more difficult. Here's an attempt that seems to work with a document from EDGAR:

from lxml import html

PRE_EB = "-----BEGIN PRIVACY-ENHANCED MESSAGE-----"
POST_EB = "-----END PRIVACY-ENHANCED MESSAGE-----"

def unpack_pem(pem_string):
    """Takes a PEM encapsulated message and returns a tuple
    consisting of the header and encapsulated text.  
    """

    if not pem_string.startswith(PRE_EB):
        raise ValueError("Invalid PEM encoding; must start with %s"
                         % PRE_EB)
    if not pem_string.strip().endswith(POST_EB):
        raise ValueError("Invalid PEM encoding; must end with %s"
                         % POST_EB)
    msg = pem_string.strip()[len(PRE_EB):-len(POST_EB)]
    header, encapsulated_text = msg.split('\n\n', 1)
    return (header, encapsulated_text)


filename = 'secdoc_htm.txt'
data = open(filename, 'r').read()

header, encapsulated_text = unpack_pem(data)

# Now parse the SGML
root = html.fromstring(encapsulated_text)
document = root.xpath('//document')[0]

metadata = {}
metadata['type'] = document.xpath('//type')[0].text.strip()
metadata['sequence'] = document.xpath('//sequence')[0].text.strip()
metadata['filename'] = document.xpath('//filename')[0].text.strip()

inner_html = document.xpath('//text')[0]

print(metadata)
print(inner_html)

Result:

{'filename': 'd371464d10q.htm', 'type': '10-Q', 'sequence': '1'}

<Element text at 80d250c>
like image 34
Lukas Graf Avatar answered Oct 08 '22 17:10

Lukas Graf