Split a multi line string according to some regex, then "glue" it together again (depending on the match)

Question

I generate a Reveal.js presentation from Pandoc markdown. It works something like this:

Each heading on level 1 (# heading 1) and 2 (## heading 2) starts a new slide
One can also start a new slide using a horizontal ruler (---)

One can create a two columns layout using the following syntax (which creates <div>s with respective classes):

# My cool slide

:::columns
::::::column
This is column 1
::::::

::::::column
This is column 2
::::::
:::

But this is quite tedious. I'd rather use three plus signs (+++) to define a two column layout, like this:

# My cool slide

This is column 1

+++

This is column 2

# Another slide

But with no columns

I think it should be easy to convert this (+++) into the result expected by Pandoc (:::columns...).

I tried using split method first:

markdown.split(/^(##? .*)|(---)$/) do |slide|
  # Do another regex for the slide content that looks for a `+++`:
  # - If it finds one, replace it with the `:::columns` (etc.) syntax
  # - If it finds none, just leave it be
end.join # Glue everything together again

But I'm quite confused how this works.

1st iteration:

$1 is "# My cool slide"
slide is ""

2nd iteration:

$1 is "# My cool slide"
slide is "# My cool slide"

3rd iteration:

$1 is "# Another slide"
slide is " This is column 1 +++ This is column 2 "

4th iteration:

$1 is "# Another slide"
slide is "# Another slide"

5th iteration:

$1 is nil
slide is " But with no columns"

What is happening here?

Alex · Accepted Answer

Not sure if regexing your way out is a good solution. I'd break it up into logical chunks where you have a lot more freedom to do what is necessary (write a parser), then convert it into a different format:

# slides.rb

md = <<~MD
# My cool slide

This is column 1

+++

This is column 2

# Another slide

But with no columns
MD

enum = md.split("
").each
slides = {}

# break it up into slides
loop do
  slide = []
  line = enum.next
  loop do
    slide << line                   # collect lines
    break if enum.peek.match?(/^#/) # until next comment
    line = enum.next
  end
  comment, *body = slide            # assuming a single line comment

  # break slides into columns
  # join("
") if you want to keep newlines
  columns = body.join.split("+++")
  slides[comment] = columns
end

# p slides
# => {"# My cool slide" => ["This is column 1", "This is column 2"], "# Another slide" => ["But with no columns"]}

# join it together
slides.each do |comment, columns|
  puts comment
  puts
  if columns.size > 1
    puts ":::columns"
    columns.each do |col|
      puts "::::::column"
      puts col
      puts "::::::"
    end
    puts ":::"
  else
    puts columns
  end
  puts
end

Test:

$ ruby slides.rb

# My cool slide

:::columns
::::::column
This is column 1
::::::
::::::column
This is column 2
::::::
:::

# Another slide

But with no columns

Patrick Janser · Answer

I'll come out with a rather complicated regular expression to match slides (capturing the heading or separator in one group and the content of the slide in another group).

The commented pattern (using the `x` and `m` flags)

( # Capturing group n°0: begin, heading or slide separator.
  (?:
    \A                       # Begin of text (for the first slide).
  |
    ^\#{1,2}?[^\#
]+
?
 # A heading of level 1 or 2.
  |                          # or
    ^-{3,}
?
              # A horizontal ruler.
  )
)
( # Capturing group n°1: The content of the slide.
  (?:                        # A line of content.
    ^                        # Match begin of line.
    (?!\#{1,2}[^\#]|-{3,})   # Not followed by a heading or horizontal line.
    [^
]*(?:
?
|\z)     # The line content, new line or end of text.
  )+
)

See it in action here: https://regex101.com/r/MkTwXs/2

The Ruby code

markdown = <<~END_OF_MARKDOWN
    The first slide could start without a heading ;-)
    
    +++
    
    ![Welcome](/image/welcome.svg)
    
    # My cool slide
    
    This is column 1 with a link:
    
    [Go to the last slide](#the-end)
    
    +++
    
    This is column 2 followed by
    
    ### A title of level 3
    
    Some text, with list items:
    
    - Item 1
    - Item 2
      - Sub-item
    - Last item
    
    # Another slide
    
    But with no columns
    
    ## Another slide of level 2 because this is what you wanted
    
    And here comes the content of slide 3, in the first column
    
    +++
    
    Then the content in the second column.
    And `+++` or `---` should not break anything.
    
    ---
    
    A slide without a header but with some CSS:
    
    ```css
    body {
        font-family: Arial, sans-serif;
    }
    ```
    
    ---
    
    <a id="the-end"></a>
    
    ![The end](/images/the-end.png)
    
    +++
    
    Thanks for your attention!
    
    Any questions?
    
    END_OF_MARKDOWN

# The regular expression to match slides.
slidePattern = %r{
( # Capturing group n°0: begin, heading or slide separator.
  (?:
    \A                       # Begin of text (for the first slide).
  |
    ^\#{1,2}?[^\#
]+
?
 # A heading of level 1 or 2.
  |                          # or
    ^-{3,}
?
              # A horizontal ruler.
  )
)
( # Capturing group n°1: The content of the slide.
  (?:                        # A line of content.
    ^                        # Match begin of line.
    (?!\#{1,2}[^\#]|-{3,})   # Not followed by a heading or horizontal line.
    [^
]*(?:
?
|\z)     # The line content, new line or end of text.
  )+
)
}mx

# Get all the slide matches.
slides = markdown.scan(slidePattern)

# Convert each slide match (heading/separator + content) into a string.
slides.map! { |slideMatch|
    # Take the content and split it with the column separator.
    columns = slideMatch[1].split(/^\+{3,}$/m)
    if columns.length() > 1
        # Wrap each column into a child div with the `column` class.
        columns.map! { |column|
            # Trim the column content before wrapping it.
            column.gsub!(/\A(?:
?
)+|(?:
?
)+\z/, '')
            "
::::::column
#{column}
::::::
"
        }
        # Return the heading or separator and all the columns in the parent div.
        slideMatch[0] + "
:::columns
#{columns.join()}
:::

"
    else
        # No columns found, so return the heading or separator and the content.
        slideMatch[0] + slideMatch[1]
    end
}

puts slides.join()

The output:

:::columns

::::::column
The first slide could start without a heading ;-)
::::::

::::::column
![Welcome](/image/welcome.svg)
::::::

:::

# My cool slide

:::columns

::::::column
This is column 1 with a link:

[Go to the last slide](#the-end)
::::::

::::::column
This is column 2 followed by

### A title of level 3

Some text, with list items:

- Item 1
- Item 2
  - Sub-item
- Last item
::::::

:::

# Another slide

But with no columns

## Another slide of level 2 because this is what you wanted

:::columns

::::::column
And here comes the content of slide 3, in the first column
::::::

::::::column
Then the content in the second column.
And `+++` or `---` should not break anything.
::::::

:::

---

A slide without a header but with some CSS:

```css
body {
    font-family: Arial, sans-serif;
}
```

---

:::columns

::::::column
<a id="the-end"></a>

![The end](/images/the-end.png)
::::::

::::::column
Thanks for your attention!

Any questions?
::::::

:::

Caution : Regex is not the ideal tool for that

I would avoid using a regular expression to handle what you want to do. Instead, try to implement a Markdown extension on a proper parser. Why? Because of cases like this one:

A slide with some plain text:

```
What will happen if we have `+++` or `---` below?
---
Probably break everything!
+++
No?
```

The column separator +++ is inside a block of plain text and it should not be detected as a column separator :-/

As you pointed out in your final comment, creating a Pandoc filter will be safe and easier to implement. The AST (abstract syntax tree) is the best way to manipulate the document and change it before the final output.

Split a multi line string according to some regex, then "glue" it together again (depending on the match)

Tags:

regex

ruby

Joshua Muheim

2 Answers

Alex

The commented pattern (using the `x` and `m` flags)

The Ruby code

Caution : Regex is not the ideal tool for that

Patrick Janser

Recent Activity

Donate For Us

Split a multi line string according to some regex, then "glue" it together again (depending on the match)

Tags:

regex

ruby

Joshua Muheim

2 Answers

Alex

The commented pattern (using the x and m flags)

The Ruby code

Caution : Regex is not the ideal tool for that

Patrick Janser

Related questions

Recent Activity

Donate For Us

The commented pattern (using the `x` and `m` flags)