Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyparsing - where order of tokens in unpredictable

I want to be able to pull out the type and count of letters from a piece of text where the letters could be in any order. There is some other parsing going on which I have working, but this bit has me stumped!

input -> result
"abc" -> [['a',1], ['b',1],['c',1]]
"bbbc" -> [['b',3],['c',1]]
"cccaa" -> [['a',2],['c',3]]

I could use search or scan and repeat for each possible letter, but is there a clean way of doing it?

This is as far as I got:

from pyparsing import *


def handleStuff(string, location, tokens):

        return [tokens[0][0], len(tokens[0])]


stype = Word("abc").setParseAction(handleStuff)
section =  ZeroOrMore(stype("stype"))


print section.parseString("abc").dump()
print section.parseString("aabcc").dump()
print section.parseString("bbaaa").dump()
like image 383
PhoebeB Avatar asked Jan 25 '10 18:01

PhoebeB


3 Answers

I wasn't clear from your description whether the input characters could be mixed like "ababc", since in all your test cases, the letters were always grouped together. If the letters are always grouped together, you could use this pyparsing code:

def makeExpr(ch):
    expr = Word(ch).setParseAction(lambda tokens: [ch,len(tokens[0])])
    return expr

expr = Each([Optional(makeExpr(ch)) for ch in "abc"])

for t in tests:
    print t,expr.parseString(t).asList()

The Each construct takes care of matching out of order, and Word(ch) handles the 1-to-n repetition. The parse action takes care of converting the parsed tokens into the (character, count) tuples.

like image 130
PaulMcG Avatar answered Nov 19 '22 22:11

PaulMcG


One solution:

text = 'sufja srfjhvlasfjkhv lasjfvhslfjkv hlskjfvh slfkjvhslk'
print([(x,text.count(x)) for x in set(text)])

No pyparsing involved, but it seems like overkill.

like image 32
Lennart Regebro Avatar answered Nov 19 '22 21:11

Lennart Regebro


I like Lennart's one-line solution.

Alex mentions another great option if you're using 3.1

Yet another option is collections.defaultdict:

>>> from collections import defaultdict
>>> mydict = defaultdict(int)
>>> for c in 'bbbc':
...   mydict[c] += 1
...
>>> mydict
defaultdict(<type 'int'>, {'c': 1, 'b': 3})
like image 3
mechanical_meat Avatar answered Nov 19 '22 21:11

mechanical_meat