Use lxml to parse text file with bad header in Python

Q: What is lxml Etree in Python?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

Q: What does lxml do?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

Q: What is lxml HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML). Contents. Parsers. Parser options.

Q: Can I use parse () and feed () together in lxml?

In lxml.etree, you can use both interfaces to a parser at the same time: the parse() or XML() functions, and the feed parser interface. Both are independent and will not conflict (except if used in conjunction with a parser target object as described above).

Q: Should you use Zen of Python for XML parsing?

On the one hand, the Zen of Python promises only one obvious way to achieve your goal. At the same time, the standard library follows the batteries included motto by letting you choose from not one but several XML parsers. Luckily, the Python community solved this surplus problem by creating even more XML parsing libraries.

Q: Which XML parser should I use?

The lxml one described earlier is actually recommended by the official documentation and is currently the only XML parser supported by the library. Depending on the kind of documents you’ll want to parse, the desired efficiency, and feature availability, you can select one of these parsers:

Tags:

python

lxml

I would like to parse text files (stored locally) with lxml's etree. But all of my files (thousands) have headers, such as:

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: [email protected]
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ
 YVPEaPjyiqB4rV/GS2lj6A==

<SEC-DOCUMENT>0001193125-07-200376.txt : 20070913
<SEC-HEADER>0001193125-07-200376.hdr.sgml : 20070913
<ACCEPTANCE-DATETIME>20070913115715
ACCESSION NUMBER:       0001193125-07-200376
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      7
CONFORMED PERIOD OF REPORT: 20070630
FILED AS OF DATE:       20070913
DATE AS OF CHANGE:      20070913

and the first < isn't until line 51 in this case (and isn't 51 in all cases). The xml portions starts as follows:

</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>

Can I handle this on-the-fly with lxml? Or should I use a stream editor to omit each file's header? Thanks!

Here is my current code and error.

from lxml import etree
f = etree.parse('temp.txt')

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Edit:

FWIW, here is a link to the file.

282

asked Sep 13 '12 18:09

Richard Herron

2 Answers

Given that there's a standard for these files, it's possible to write a proper parser rather than guessing at things, or hoping beautifulsoup gets things right. That doesn't mean it's the best answer for you, but it's certainly work looking at.

According to the standard at http://www.sec.gov/info/edgar/pdsdissemspec910.pdf what you've got (inside the PEM enclosure) is an SGML document defined by the provided DTD. So, first go to pages 48-55, extract the text there, and save it as, say, "edgar.dtd".

The first thing I'd do is install SP and use its tools to make sure that the documents really are valid and parseable by that DTD, to make sure you don't waste a bunch of time on something that isn't going to pan out.

Python comes with a validating SGML parser, sgmllib. Unfortunately, it was never quite finished, and it's deprecated in 2.6-2.7 (and removed in 3.x). But that doesn't mean it won't work. So, try it and see if it works.

If not, I don't know of any good alternatives in Python; most of the SGML code out there is in C, C++, or Perl. But you can wrap up any C or C++ library (I'd start with SP) pretty easily, as long as you're comfortable writing your own wrapped in C/Cython/boost-python/whatever or using ctypes. You only need to wrap up the top-level functions, not build a complete set of bindings. But if you've never done anything like this before, it's probably not the best time to learn.

Alternatively, you can wrap up a command-line tool. SP comes with nsgmls. There's another good tool written in perl with the same name (I think part of http://savannah.nongnu.org/projects/perlsgml/ but I'm not positive.) And dozens of other tools.

Or, of course, you could write the whole thing, or just the parsing layer, in perl (or C++) instead of Python.

190

answered Oct 08 '22 18:10

abarnert

You can easily get to the encapsulated text of the PEM (Privacy-Enhanced Message, specified in RFC 1421 ) by stripping the encapsulation boundries and separating everything in between into header and encapsulated text at the first blank line.

The SGML parsing is much more difficult. Here's an attempt that seems to work with a document from EDGAR:

from lxml import html

PRE_EB = "-----BEGIN PRIVACY-ENHANCED MESSAGE-----"
POST_EB = "-----END PRIVACY-ENHANCED MESSAGE-----"

def unpack_pem(pem_string):
    """Takes a PEM encapsulated message and returns a tuple
    consisting of the header and encapsulated text.  
    """

    if not pem_string.startswith(PRE_EB):
        raise ValueError("Invalid PEM encoding; must start with %s"
                         % PRE_EB)
    if not pem_string.strip().endswith(POST_EB):
        raise ValueError("Invalid PEM encoding; must end with %s"
                         % POST_EB)
    msg = pem_string.strip()[len(PRE_EB):-len(POST_EB)]
    header, encapsulated_text = msg.split('\n\n', 1)
    return (header, encapsulated_text)


filename = 'secdoc_htm.txt'
data = open(filename, 'r').read()

header, encapsulated_text = unpack_pem(data)

# Now parse the SGML
root = html.fromstring(encapsulated_text)
document = root.xpath('//document')[0]

metadata = {}
metadata['type'] = document.xpath('//type')[0].text.strip()
metadata['sequence'] = document.xpath('//sequence')[0].text.strip()
metadata['filename'] = document.xpath('//filename')[0].text.strip()

inner_html = document.xpath('//text')[0]

print(metadata)
print(inner_html)

Result:

{'filename': 'd371464d10q.htm', 'type': '10-Q', 'sequence': '1'}

<Element text at 80d250c>

answered Oct 08 '22 17:10

Lukas Graf

Related questions
                            
                                matplot lib "fatal IO error 25 (Inappropriate ioctl for device) on X server "localhost:10.0"
                            
                                Keras LSTM - why different results with "same" model & same weights?
                            
                                cv2.imshow() crashes on Mac
                            
                                TypeError: 'NoneType' object is not subscriptable followed by AttributeError: 'NoneType' object has no attribute 'split'
                            
                                pandas dataframe filter to return True for ALL rows. how?
                            
                                Deploying Django to Heroku (Psycopg2 Error)
                            
                                A simple python server using SimpleHTTPServer and SocketServer, how do I close the socket down before rerunning .py file?
                            
                                PEP 8: How should __future__ imports be grouped?
                            
                                Python: Change list type for json decoding
                            
                                What's meaning of these formats in twisted's docstring?
                            
                                3d numpy record array
                            
                                What's the difference between /usr/lib/python and /usr/lib64/python?
                            
                                My own method used in list_display and value as boolean icon
                            
                                Cannot import Scikit-Learn
                            
                                matplotlib autoscale axes to include annotations
                            
                                Why use multiple arguments to log instead of interpolation?
                            
                                A QWidget like QTextEdit that wraps its height automatically to its contents?
                            
                                How to get a file object from mkstemp()?
                            
                                Flask and WTForms - how to get wtforms to refresh select data
                            
                                python regular expression matching anything

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With