Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a structured text file in Python (pyparsing)

For reasons I really do not understand, a REST API I'm using, instead of outputting JSON or XML, uses a peculiar structured text format. In its simplest form

SECTION_NAME    entry  other qualifying bits of the entry
                entry2 other qualifying bits
                ...

They are not tab-delimited, as the structure may seem, but instead space-delimited, and the qualifying bits may contain words with spaces. The space between SECTION_NAME and the entries is also variable, ranging from 1 to several (6 or more) spaces.

Also, one part of the format contains entries in the form

SECTION_NAME entry
  SUB_SECTION more information
  SUB_SECTION2 more information

For reference, an extract of real data (some sections omitted), which shows the use of the structure:

ENTRY       hsa04064                    Pathway
NAME        NF-kappa B signaling pathway - Homo sapiens (human)
DRUG        D09347  Fostamatinib (USAN)
            D09348  Fostamatinib disodium (USAN)
            D09692  Veliparib (USAN/INN)
            D09730  Olaparib (JAN/INN)
            D09913  Iniparib (USAN/INN)
REFERENCE   PMID:21772278
  AUTHORS   Oeckinghaus A, Hayden MS, Ghosh S
  TITLE     Crosstalk in NF-kappaB signaling pathways.
  JOURNAL   Nat Immunol 12:695-708 (2011)

As I'm trying to parse this weird format into something saner (a dictionary which can then be converted to JSON), I'm unsure on what to do: splitting blindly on spaces causes a mess (it also affects information with spaces), and I'm not sure on how I can figure when a section starts or not. Is text manipulation enough for the job or should I use more sophisticated methods?

EDIT:

I started using pyparsing for the job, but multiple-line records baffle me, here's an example with DRUG:

 from pyparsing import *
 punctuation = ",.'`&-"
 special_chars = "\()[]"

 drug = Keyword("DRUG")
 drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word(
      alphanums + special_chars))) + ZeroOrMore(LineEnd())
 drug_lines = OneOrMore(drug_content)
 drug_parser = drug + drug_lines

When applied to the first 3 lines of DRUG in the example, I get a wrong result(\n converted to actual returns to ease readability):

 ['DRUG', ['D09347', 'Fostamatinib (USAN)
        D09348  Fostamatinib disodium      (USAN)
        D09692  Veliparib (USAN']]

As you can see, the subsequent entries get lumped all together, while I'd expect:

 ['DRUG', [['D09347', 'Fostamatinib (USAN)'], ["D09348", "Fostamatinib disodium (USAN)"],
           ['D09692', ' Veliparib (USAN)']]]
like image 874
Einar Avatar asked Jul 04 '12 08:07

Einar


1 Answers

I'd recommend you use a parser-based approach. For example, Python PLY can be used for the task at hand.

like image 184
Mihai Maruseac Avatar answered Nov 15 '22 09:11

Mihai Maruseac