For reasons I really do not understand, a REST API I'm using, instead of outputting JSON or XML, uses a peculiar structured text format. In its simplest form
SECTION_NAME entry other qualifying bits of the entry
entry2 other qualifying bits
...
They are not tab-delimited, as the structure may seem, but instead space-delimited, and the qualifying bits may contain words with spaces. The space between SECTION_NAME and the entries is also variable, ranging from 1 to several (6 or more) spaces.
Also, one part of the format contains entries in the form
SECTION_NAME entry
SUB_SECTION more information
SUB_SECTION2 more information
For reference, an extract of real data (some sections omitted), which shows the use of the structure:
ENTRY hsa04064 Pathway
NAME NF-kappa B signaling pathway - Homo sapiens (human)
DRUG D09347 Fostamatinib (USAN)
D09348 Fostamatinib disodium (USAN)
D09692 Veliparib (USAN/INN)
D09730 Olaparib (JAN/INN)
D09913 Iniparib (USAN/INN)
REFERENCE PMID:21772278
AUTHORS Oeckinghaus A, Hayden MS, Ghosh S
TITLE Crosstalk in NF-kappaB signaling pathways.
JOURNAL Nat Immunol 12:695-708 (2011)
As I'm trying to parse this weird format into something saner (a dictionary which can then be converted to JSON), I'm unsure on what to do: splitting blindly on spaces causes a mess (it also affects information with spaces), and I'm not sure on how I can figure when a section starts or not. Is text manipulation enough for the job or should I use more sophisticated methods?
EDIT:
I started using pyparsing for the job, but multiple-line records baffle me, here's an example with DRUG:
from pyparsing import *
punctuation = ",.'`&-"
special_chars = "\()[]"
drug = Keyword("DRUG")
drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word(
alphanums + special_chars))) + ZeroOrMore(LineEnd())
drug_lines = OneOrMore(drug_content)
drug_parser = drug + drug_lines
When applied to the first 3 lines of DRUG in the example, I get a wrong result(\n converted to actual returns to ease readability):
['DRUG', ['D09347', 'Fostamatinib (USAN)
D09348 Fostamatinib disodium (USAN)
D09692 Veliparib (USAN']]
As you can see, the subsequent entries get lumped all together, while I'd expect:
['DRUG', [['D09347', 'Fostamatinib (USAN)'], ["D09348", "Fostamatinib disodium (USAN)"],
['D09692', ' Veliparib (USAN)']]]
I'd recommend you use a parser-based approach. For example, Python PLY can be used for the task at hand.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With