Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby parslet: parsing multiple lines

I'm looking for a way to match multiple lines Parslet. The code looks like this:

rule(:line) { (match('$').absent? >> any).repeat >> match('$') }
rule(:lines) { line.repeat }

However, lines will always end up in an infinite loop which is because match('$') will endlessly repeat to match end of string.

Is it possible to match multiple lines that can be empty?

irb(main)> lines.parse($stdin.read)
This
is

a
multiline

string^D

should match successfully. Am I missing something? I also tried (match('$').absent? >> any.maybe).repeat(1) >> match('$') but that doesn't match empty lines.

Regards,
Danyel.

like image 967
Danyel Avatar asked Jul 18 '13 17:07

Danyel


2 Answers

I usually define a rule for end_of_line. This is based on the trick in http://kschiess.github.io/parslet/tricks.html for matching end_of_file.

class MyParser < Parslet::Parser
  rule(:cr)         { str("\n") }
  rule(:eol?)       { any.absent? | cr }
  rule(:line_body)  { (eol?.absent? >> any).repeat(1) }
  rule(:line)       { cr | line_body >> eol? }
  rule(:lines?)     { line.repeat (0)}
  root(:lines?)
end

puts MyParser.new.parse(""" this is a line
so is this

that was too
This ends""").inspect

Obviously if you want to do more with the parser than you can achieve with String::split("\n") you will replace the line_body with something useful :)


I had a quick go at answering this question and mucked it up. I just though I would explain the mistake I made, and show you how to avoid mistakes of that kind.

Here is my first answer.

rule(:eol)   { str('\n') | any.absent?  }
rule(:line)  { (eol.absent? >> any).repeat >> eol }
rule(:lines) { line.as(:line).repeat }

I didn't follow my usual rules:

  • Always make repeat count explicit
  • Any rule that can match zero length strings, should have name ending in a '?'

So lets apply these...

rule(:eol?)   { str('\n') | any.absent?  } 
# as the second option consumes nothing

rule(:line?)  { (eol.absent? >> any).repeat(0) >> eol? } 
# repeat(0) can consume nothing

rule(:lines?) { line.as(:line?).repeat(0) }
# We have a problem! We have a rule that can consume nothing inside a `repeat`!

Here see why we get an infinite loop. As the input is consumed, you end up with just the end of file, which matches eol? and hence line? (as the line body can be empty). Being inside lines' repeat, it keeps matching without consuming anything and loops forever.

We need to change the line rule so it always consumes something.

rule(:cr)         { str('\n') }
rule(:eol?)       { cr | any.absent?  }
rule(:line_body)  { (eol.absent? >> any).repeat(1) }
rule(:line)       { cr | line_body >> eol? }
rule(:lines?)     { line.as(:line).repeat(0) }

Now line has to match something, either a cr (for empty lines), or at least one character followed by the optional eol?. All repeats have bodies that consume something. We are now golden.

like image 59
Nigel Thorne Avatar answered Oct 21 '22 06:10

Nigel Thorne


I think you have two, related, problems with your matching:

  • The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.

  • Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.

Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):

require 'parslet'

class Lines < Parslet::Parser
    rule(:text) { match("[^\n]") }
    rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
    rule(:lines) { line.as(:line).repeat }
    root :lines
end

s = "This
is

a
multiline
string"

p Lines.new.parse( s )

The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.

You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.

like image 37
Neil Slater Avatar answered Oct 21 '22 06:10

Neil Slater