Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simply using parsec in python

I'm looking at this library, which has little documentation: https://pythonhosted.org/parsec/#examples

I understand there are alternatives, but I'd like to use this library.

I have the following string I'd like to parse:

mystr = """
<kv>
  key1: "string"
  key2: 1.00005
  key3: [1,2,3]
</kv>
<csv>
date,windspeed,direction
20190805,22,NNW
20190805,23,NW
20190805,20,NE
</csv>"""

While I'd like to parse the whole thing, I'd settle for just grabbing the <tags>. I have:

>>> import parsec
>>> tag_start = parsec.Parser(lambda x: x == "<")
>>> tag_end = parsec.Parser(lambda x: x == ">")
>>> tag_name = parsec.Parser(parsec.Parser.compose(parsec.many1, parsec.letter))
>>> tag_open = parsec.Parser(parsec.Parser.joint(tag_start, tag_name, tag_end))

OK, looks good. Now to use it:

>>> tag_open.parse(mystr)
Traceback (most recent call last):
...
TypeError: <lambda>() takes 1 positional argument but 2 were given

This fails. I'm afraid I don't even understand what it meant about my lambda expression giving two arguments, it's clearly 1. How can I proceed?

My optimal desired output for all the bonus points is:

[
{"type": "tag", 
 "name" : "kv",
 "values"  : [
    {"key1" : "string"},
    {"key2" : 1.00005},
    {"key3" : [1,2,3]}
  ]
},
{"type" : "tag",
"name" : "csv", 
"values" : [
    {"date" : 20190805, "windspeed" : 22, "direction": "NNW"}
    {"date" : 20190805, "windspeed" : 23, "direction": "NW"}
    {"date" : 20190805, "windspeed" : 20, "direction": "NE"}
  ]
}

The output I'd settle for understanding in this question is using functions like those described above for start and end tags to generate:

[
  {"tag": "kv"},
  {"tag" : "csv"}
]

And simply be able to parse arbitrary xml-like tags out of the messy mixed text entry.

like image 415
Mittenchops Avatar asked Aug 06 '19 04:08

Mittenchops


1 Answers

According to the tests, the proper way to parse your string would be the following:

from parsec import *

possible_chars = letter() | space() |  one_of('/.,:"[]') | digit()
parser =  many(many(possible_chars) + string("<") >> mark(many(possible_chars)) << string(">"))

parser.parse(mystr)
# [((1, 1), ['k', 'v'], (1, 3)), ((5, 1), ['/', 'k', 'v'], (5, 4)), ((6, 1), ['c', 's', 'v'], (6, 4)), ((11, 1), ['/', 'c', 's', 'v'], (11, 5))]

The construction of the parser:


For the sake of convenience, we first define the characters we wish to match. parsec provides many types:

  • letter(): matches any alphabetic character,

  • string(str): matches any specified string str,

  • space(): matches any whitespace character,

  • spaces(): matches multiple whitespace characters,

  • digit(): matches any digit,

  • eof(): matches EOF flag of a string,

  • regex(pattern): matches a provided regex pattern,

  • one_of(str): matches any character from the provided string,

  • none_of(str): match characters which are not in the provided string.


We can separate them with operators, according to the docs:

  • |: This combinator implements choice. The parser p | q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. NOTICE: without backtrack,

  • +: Joint two or more parsers into one. Return the aggregate of two results from this two parser.

  • ^: Choice with backtrack. This combinator is used whenever arbitrary look ahead is needed. The parser p || q first applies p, if it success, the value of p is returned. If p fails, it pretends that it hasn't consumed any input, and then parser q is tried.

  • <<: Ends with a specified parser, and at the end parser consumed the end flag,

  • <: Ends with a specified parser, and at the end parser hasn't consumed any input,

  • >>: Sequentially compose two actions, discarding any value produced by the first,

  • mark(p): Marks the line and column information of the result of the parser p.


Then there are multiple "combinators":

  • times(p, mint, maxt=None): Repeats parser p from mint to maxt times,

  • count(p,n): Repeats parser p n-times. If n is smaller or equal to zero, the parser equals to return empty list,

  • (p, default_value=None): Make a parser optional. If success, return the result, otherwise return default_value silently, without raising any exception. If default_value is not provided None is returned instead,

  • many(p): Repeat parser p from never to infinitely many times,

  • many1(p): Repeat parser p at least once,

  • separated(p, sep, mint, maxt=None, end=None): ,

  • sepBy(p, sep): parses zero or more occurrences of parser p, separated by delimiter sep,

  • sepBy1(p, sep): parses at least one occurrence of parser p, separated by delimiter sep,

  • endBy(p, sep): parses zero or more occurrences of p, separated and ended by sep,

  • endBy1(p, sep): parses at least one occurrence of p, separated and ended by sep,

  • sepEndBy(p, sep): parses zero or more occurrences of p, separated and optionally ended by sep,

  • sepEndBy1(p, sep): parses at least one occurrence of p, separated and optionally ended by sep.


Using all of that, we have a parser which matches many occurrences of many possible_chars, followed by a <, then we mark the many occurrences of possible_chars up until >.

like image 105
Ardweaden Avatar answered Sep 20 '22 20:09

Ardweaden