Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandoc markdown page break

People also ask

How do I insert a page break in Markdown?

There are three ways to insert a thematic break in Markdown, using three or more asterisks * , hyphens - or underscores _ , possibly with whitespace in between them. They all result in the same HTML (or PDF) output, which can then be used to create page breaks.

Can pandoc convert HTML to Markdown?

Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LaTeX and Word docx.

Can pandoc convert PDF to Markdown?

Commonly used markup languages include Markdown, ReStructuredText, HTML, LaTex, ePub, and Microsoft Word DOCX. In plain English, Pandoc allows you to convert a bunch of files from one markup language into another one. Typical examples include converting a Markdown file into a presentation, LaTeX, PDF, or even ePub.


It looks like pandoc markdown uses standard LaTeX tags for this purpose:

\newpage and \pagebreak


TL;DR: use \newpage or \pagebreak together with the Lua filter below (or here) to get page breaks in many formats. R Markdown users won't have to do anything extra, the filter is already included by default.


Pandoc parses all inputs into an internal document format. The internal format has no dedicated way to represent page breaks, but it is still possible to encode the information in other ways. One way is to use raw LaTeX \newpage. This works perfectly when outputting LaTeX (or PDF created through LaTeX). However, one will run into problems when targeting different formats like HTML or docx.

A simple solution when targeting other formats is to use a pandoc filter which can transform the internal document representation such that it suites our needs. Pandoc 2.0 and later even allows to use the included Lua interpreter to perform this transformation.

Let's assume we are indicating page breaks by putting \newpage in a line surrounded like blank lines, like so:

lorem ipsum

\newpage

more text

The \newpage will be parsed as a RawBlock containing raw TeX. The block will only be included in the output if the target format can contain raw TeX (i.e., LaTeX, Markdown, Org, etc.).

We can use a simple Lua filter to translate this when targeting a different format. The following works for docx, LaTeX, epub, and light-weight markup.

--- Return a block element causing a page break in the given format.
local function newpage(format)
  if format == 'docx' then
    local pagebreak = '<w:p><w:r><w:br w:type="page"/></w:r></w:p>'
    return pandoc.RawBlock('openxml', pagebreak)
  elseif format:match 'html.*' then
    return pandoc.RawBlock('html', '<div style=""></div>')
  elseif format:match 'tex$' then
    return pandoc.RawBlock('tex', '\\newpage{}')
  elseif format:match 'epub' then
    local pagebreak = '<p style="page-break-after: always;"> </p>'
    return pandoc.RawBlock('html', pagebreak)
  else
    -- fall back to insert a form feed character
    return pandoc.Para{pandoc.Str '\f'}
  end
end

-- Filter function called on each RawBlock element.
function RawBlock (el)
  -- check that the block is TeX or LaTeX and contains only \newpage or
  -- \pagebreak.
  if el.text:match '\\newpage' then
    -- use format-specific pagebreak marker. FORMAT is set by pandoc to
    -- the targeted output format.
    return newpage(FORMAT)
  end
  -- otherwise, leave the block unchanged
  return nil
end

We published an updated, more featureful version. It's available from the official pandoc lua-filters repository. The R Markdown project maintains a fork; it ships with the R package, so the feature can be used right away.
Note: For converting latex to docx you have to set the from to latex+raw_tex for pandoc AST to pass it along github issue


I observed that this does not work for .doc and .odt formats. A workaround I found was to insert a horizontal line ----------------- and format the "horizontal line" style to break a page and be invisible, using the text editor (ibre office in my case)


can't edit LucasSeveryn answer, told queue full, so add some infomation here.

way 1: +raw_tex

\newpage and \pagebreak need raw_tex extension on.

// with pandoc 2.9.2.1, not work with docx or html output, --verbose says

[INFO] Not rendering RawBlock (Format "tex") "\\pagebreak"
[INFO] Not rendering RawBlock (Format "tex") "\\newpage"

way 2: +raw_attribute

https://pandoc.org/MANUAL.html#extension-raw_attribute

```{=openxml}
<w:p>
  <w:r>
    <w:br w:type="page"/>
  </w:r>
</w:p>
```

// also not support in gfm input format.
// this worked for docx output, not work with html output.

extension NOTICE

this need +raw_tex format extension. which is not support for all markdown variants in pandoc.

https://pandoc.org/MANUAL.html#markdown-variants

Note, however, that commonmark and gfm have limited support for extensions.  

Only those listed below (and smart, raw_tex, and hard_line_breaks) will work.  

The extensions can, however, all be individually disabled.

Also, raw_tex only affects gfm output, not input.

so -f markdown will work, but -f gfm not work.

format extension

https://pandoc.org/MANUAL.html#option--from

Extensions can be individually enabled or disabled by appending 
+EXTENSION or -EXTENSION to the format name.

for example

-t html+raw_tex: output enable raw_tex

-f markdown-raw_tex-raw_attribute: input disable raw_tex and raw_attribute