Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a multi line string according to some regex, then "glue" it together again (depending on the match)

Tags:

regex

ruby

I generate a Reveal.js presentation from Pandoc markdown. It works something like this:

  • Each heading on level 1 (# heading 1) and 2 (## heading 2) starts a new slide
  • One can also start a new slide using a horizontal ruler (---)

One can create a two columns layout using the following syntax (which creates <div>s with respective classes):

# My cool slide

:::columns
::::::column
This is column 1
::::::

::::::column
This is column 2
::::::
:::

But this is quite tedious. I'd rather use three plus signs (+++) to define a two column layout, like this:

# My cool slide

This is column 1

+++

This is column 2

# Another slide

But with no columns

I think it should be easy to convert this (+++) into the result expected by Pandoc (:::columns...).

I tried using split method first:

markdown.split(/^(##? .*)|(---)$/) do |slide|
  # Do another regex for the slide content that looks for a `+++`:
  # - If it finds one, replace it with the `:::columns` (etc.) syntax
  # - If it finds none, just leave it be
end.join # Glue everything together again

But I'm quite confused how this works.

1st iteration:

  • $1 is "# My cool slide"
  • slide is ""

2nd iteration:

  • $1 is "# My cool slide"
  • slide is "# My cool slide"

3rd iteration:

  • $1 is "# Another slide"
  • slide is "\n\nThis is column 1\n\n+++\n\nThis is column 2\n\n"

4th iteration:

  • $1 is "# Another slide"
  • slide is "# Another slide"

5th iteration:

  • $1 is nil
  • slide is "\n\nBut with no columns"

What is happening here?

like image 804
Joshua Muheim Avatar asked Nov 16 '25 06:11

Joshua Muheim


2 Answers

Not sure if regexing your way out is a good solution. I'd break it up into logical chunks where you have a lot more freedom to do what is necessary (write a parser), then convert it into a different format:

# slides.rb

md = <<~MD
# My cool slide

This is column 1

+++

This is column 2

# Another slide

But with no columns
MD

enum = md.split("\n").each
slides = {}

# break it up into slides
loop do
  slide = []
  line = enum.next
  loop do
    slide << line                   # collect lines
    break if enum.peek.match?(/^#/) # until next comment
    line = enum.next
  end
  comment, *body = slide            # assuming a single line comment

  # break slides into columns
  # join("\n") if you want to keep newlines
  columns = body.join.split("+++")
  slides[comment] = columns
end

# p slides
# => {"# My cool slide" => ["This is column 1", "This is column 2"], "# Another slide" => ["But with no columns"]}

# join it together
slides.each do |comment, columns|
  puts comment
  puts
  if columns.size > 1
    puts ":::columns"
    columns.each do |col|
      puts "::::::column"
      puts col
      puts "::::::"
    end
    puts ":::"
  else
    puts columns
  end
  puts
end

Test:

$ ruby slides.rb

# My cool slide

:::columns
::::::column
This is column 1
::::::
::::::column
This is column 2
::::::
:::

# Another slide

But with no columns

like image 155
Alex Avatar answered Nov 18 '25 20:11

Alex


I'll come out with a rather complicated regular expression to match slides (capturing the heading or separator in one group and the content of the slide in another group).

The commented pattern (using the x and m flags)

( # Capturing group n°0: begin, heading or slide separator.
  (?:
    \A                       # Begin of text (for the first slide).
  |
    ^\#{1,2}?[^\#\r\n]+\r?\n # A heading of level 1 or 2.
  |                          # or
    ^-{3,}\r?\n              # A horizontal ruler.
  )
)
( # Capturing group n°1: The content of the slide.
  (?:                        # A line of content.
    ^                        # Match begin of line.
    (?!\#{1,2}[^\#]|-{3,})   # Not followed by a heading or horizontal line.
    [^\r\n]*(?:\r?\n|\z)     # The line content, new line or end of text.
  )+
)

See it in action here: https://regex101.com/r/MkTwXs/2

The Ruby code

markdown = <<~END_OF_MARKDOWN
    The first slide could start without a heading ;-)
    
    +++
    
    ![Welcome](/image/welcome.svg)
    
    # My cool slide
    
    This is column 1 with a link:
    
    [Go to the last slide](#the-end)
    
    +++
    
    This is column 2 followed by
    
    ### A title of level 3
    
    Some text, with list items:
    
    - Item 1
    - Item 2
      - Sub-item
    - Last item
    
    # Another slide
    
    But with no columns
    
    ## Another slide of level 2 because this is what you wanted
    
    And here comes the content of slide 3, in the first column
    
    +++
    
    Then the content in the second column.
    And `+++` or `---` should not break anything.
    
    ---
    
    A slide without a header but with some CSS:
    
    ```css
    body {
        font-family: Arial, sans-serif;
    }
    ```
    
    ---
    
    <a id="the-end"></a>
    
    ![The end](/images/the-end.png)
    
    +++
    
    Thanks for your attention!
    
    Any questions?
    
    END_OF_MARKDOWN

# The regular expression to match slides.
slidePattern = %r{
( # Capturing group n°0: begin, heading or slide separator.
  (?:
    \A                       # Begin of text (for the first slide).
  |
    ^\#{1,2}?[^\#\r\n]+\r?\n # A heading of level 1 or 2.
  |                          # or
    ^-{3,}\r?\n              # A horizontal ruler.
  )
)
( # Capturing group n°1: The content of the slide.
  (?:                        # A line of content.
    ^                        # Match begin of line.
    (?!\#{1,2}[^\#]|-{3,})   # Not followed by a heading or horizontal line.
    [^\r\n]*(?:\r?\n|\z)     # The line content, new line or end of text.
  )+
)
}mx

# Get all the slide matches.
slides = markdown.scan(slidePattern)

# Convert each slide match (heading/separator + content) into a string.
slides.map! { |slideMatch|
    # Take the content and split it with the column separator.
    columns = slideMatch[1].split(/^\+{3,}$/m)
    if columns.length() > 1
        # Wrap each column into a child div with the `column` class.
        columns.map! { |column|
            # Trim the column content before wrapping it.
            column.gsub!(/\A(?:\r?\n)+|(?:\r?\n)+\z/, '')
            "\n::::::column\n#{column}\n::::::\n"
        }
        # Return the heading or separator and all the columns in the parent div.
        slideMatch[0] + "\n:::columns\n#{columns.join()}\n:::\n\n"
    else
        # No columns found, so return the heading or separator and the content.
        slideMatch[0] + slideMatch[1]
    end
}

puts slides.join()

The output:

:::columns

::::::column
The first slide could start without a heading ;-)
::::::

::::::column
![Welcome](/image/welcome.svg)
::::::

:::

# My cool slide

:::columns

::::::column
This is column 1 with a link:

[Go to the last slide](#the-end)
::::::

::::::column
This is column 2 followed by

### A title of level 3

Some text, with list items:

- Item 1
- Item 2
  - Sub-item
- Last item
::::::

:::

# Another slide

But with no columns

## Another slide of level 2 because this is what you wanted

:::columns

::::::column
And here comes the content of slide 3, in the first column
::::::

::::::column
Then the content in the second column.
And `+++` or `---` should not break anything.
::::::

:::

---

A slide without a header but with some CSS:

```css
body {
    font-family: Arial, sans-serif;
}
```

---

:::columns

::::::column
<a id="the-end"></a>

![The end](/images/the-end.png)
::::::

::::::column
Thanks for your attention!

Any questions?
::::::

:::

Caution : Regex is not the ideal tool for that

I would avoid using a regular expression to handle what you want to do. Instead, try to implement a Markdown extension on a proper parser. Why? Because of cases like this one:

A slide with some plain text:

```
What will happen if we have `+++` or `---` below?
---
Probably break everything!
+++
No?
```

The column separator +++ is inside a block of plain text and it should not be detected as a column separator :-/

As you pointed out in your final comment, creating a Pandoc filter will be safe and easier to implement. The AST (abstract syntax tree) is the best way to manipulate the document and change it before the final output.

like image 22
Patrick Janser Avatar answered Nov 18 '25 19:11

Patrick Janser



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!