Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Figure sizes with pandoc conversion from markdown to docx

I type a report with Rmarkdown in Rstudio. When converting it in html with knitr, there is also a markdown file produced by knitr. I convert this file with pandoc as follows :

pandoc -f markdown -t docx input.md -o output.docx

The output.docx file is nice except for one problem: the sizes of the figures are altered, I need to manually resize the figures in Word. Is there something to do, maybe an option with pandoc, to get the right figures sizes ?

like image 463
Stéphane Laurent Avatar asked Feb 12 '13 09:02

Stéphane Laurent


People also ask

Can pandoc convert PDF to Word?

You can use the program pandoc on the SCF Linux and Mac machines (via the terminal window) to convert from formats such as HTML, LaTeX and Markdown to formats such as HTML, LaTeX, Word, OpenOffice, and PDF, among others.

Does Rmarkdown use pandoc?

The R package rmarkdown is a library which proceses and converts . Rmd files into a number of different formats. The core function is rmarkdown::render which stands on the shoulders of pandoc. This function 'renders the input file to the specified output format using pandoc.

What are pandoc files?

Pandoc includes a Haskell library and a standalone command-line program. The library includes separate modules for each input and output format, so adding a new input or output format just requires adding a new module. Pandoc is free software, released under the GPL. Copyright 2006–2022 John MacFarlane.


3 Answers

Here is a solution to resize the figures using ImageMagick from an R Script. The 70% ratio seems to be a nice choice.

# the path containing the Rmd file :
wd <- "..."
setwd(wd)

# the folder containing the figures :
fig.path <- paste0(wd, "/figure")
# all png figures :
figures <- list.files(fig.path, pattern=".png", all.files=TRUE)

# (safety) create copies of the original files
dir.create(paste0(fig.path,"_copy"))
for(i in 1:length(figures)){
  fig <- paste0(fig.path, "/", figures[i])
  file.copy(fig,"figure_copy")
}

# resize all figures
for(i in 1:length(figures)){
    fig <- paste0(fig.path, "/", figures[i])
    comm <- paste("convert -resize 70%", fig, fig)
    shell(comm)
}

# then run pandoc from a command line  
# or from the pandoc() function :
library(knitr)
pandoc("MyReport.md", "docx")

More info about the resize function of ImageMagick : www.perturb.org

like image 165
Stéphane Laurent Avatar answered Oct 05 '22 01:10

Stéphane Laurent


I also want to transform an R markdown into both an html and a .docx/.odt with figures at the good size and resolution. Until now, I found that the best way to do this is define explicitly the resolution and size of the graphs in the .md document (dpi, fig.width and fig.height options). If you do this you have good graphs usable for publication and the odt/docx is ok. The problem if you use dpi much higher than the default 72 dpi, is that the graphs will look too big in the html file. Here are 3 approaches I have used to handle this (NB I use R scripts with spin() syntax):

1) use out.extra ='WIDTH="75%"' in knitr options. This will force all graphs of the html to occupy 75% of the window width. This is a quick solution but not optimal if you have plots with very different sizes. (NB I prefer working with centimetres rather than inches, hence the /2.54 everywhere)

library(knitr)
opts_chunk$set(echo = FALSE, dev = c("png", "pdf"), dpi = 400,
               fig.width = 8/2.54, fig.height = 8/2.54,
               out.extra ='WIDTH="75%"'
)

data(iris)

#' # Iris datatset
summary(iris)
boxplot(iris[,1:4])

#+ fig.width=14/2.54, fig.height=10/2.54
par(mar = c(2,2,2,2))
pairs(iris[,-5])

2) use out.width and out.height to specify the size of the graphs in pixels into the html file. I use a constant "sc" to scale down the size of the plot into the html output. This is the more precise approach but the problem is that for each graph you have to define both fig.witdth/height and out.width/height and this is really boaring ! Ideally you should be able to specify in the global options that e.g. out.width = 150*fig.width (where fig.width changes from chunk to chunk). Maybe something like that is possible but I don't know how.

#+ echo = FALSE
library(knitr)
sc <- 150
opts_chunk$set(echo = FALSE, dev = c("png", "pdf"), dpi = 400,
                fig.width = 8/2.54, fig.height = 8/2.54,
                out.width = sc*8/2.54, out.height = sc*8/2.54
)

data(iris)

#' # Iris datatset
summary(iris)
boxplot(iris[,1:4])

#+ fig.width=14/2.54, fig.height=10/2.54, out.width= sc * 14/2.54, out.height= sc * 10/2.54
par(mar = c(2,2,2,2))
pairs(iris[,-5])

Note that for these two solution, I think that you can't transform directly your md file into odt with pandoc (the figures are not included). I transform the md into html and then the html into odt (didn't tried for docx). Something like that (if the previous R scripts is names "figsize1.R") :

library(knitr)
setwd("/home/gilles/")
spin("figsize1.R")

system("pandoc figsize1.md -o figsize1.html")
system("pandoc figsize1.html -o figsize1.odt")

3) Simply compile your document twice, once with low dpi value (~96) for the html output and once with high resolution (~300) for the odt/docx output. This is my preferred way now. The main disadvantage is that you must compile twice but this is not reallya problem to me since I generally need the odt file only at the very end of the job to provide to end users. I compile regularly the html during the work with the html notebook button in Rstudio.

#+ echo = FALSE
library(knitr)

opts_chunk$set(echo = FALSE, dev = c("png", "pdf"), 
               fig.width = 8/2.54, fig.height = 8/2.54
)

data(iris)

#' # Iris datatset
summary(iris)
boxplot(iris[,1:4])

#+ fig.width=14/2.54, fig.height=10/2.54
par(mar = c(2,2,2,2))
pairs(iris[,-5])

Then compile the 2 outputs with the following script (NB here you can directly transform the md file into html):

library(knitr)
setwd("/home/gilles")

opts_chunk$set(dpi=96)
spin("figsize3.R", knit=FALSE)
knit2html("figsize3.Rmd")

opts_chunk$set(dpi=400)
spin("figsize3.R")
system("pandoc figsize3.md -o figsize3.odt")
like image 21
Gilles Avatar answered Oct 05 '22 03:10

Gilles


Here is my solution: hack the docx converted by Pandoc, as docx is simply a bundle of xml files and adjusting the figure sizes is pretty straightforward.

The following is what a figure looks like in the word/document.xml extracted from a converted docx:

<w:p>
  <w:r>
    <w:drawing>
      <wp:inline>
        <wp:extent cx="1524000" cy="1524000" />
        ...
        <a:graphic>
          <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
            <pic:pic>
              ...
              <pic:blipFill>
                <a:blip r:embed="rId23" />
                ...
              </pic:blipFill>
              <pic:spPr bwMode="auto">
                <a:xfrm>
                  <a:off x="0" y="0" />
                  <a:ext cx="1524000" cy="1524000" />
                </a:xfrm>
                ...
              </pic:spPr>
            </pic:pic>
          </a:graphicData>
        </a:graphic>
      </wp:inline>
    </w:drawing>
  </w:r>
</w:p>

So substituting the cx & cy attributes of the nodes wp:extent & a:ext with desired value would do the resizing job. The following R code works for me. The widest figure would take up a whole line's width specified by the variable out.width, and the rest are proportionally resized.

require(XML)

## default linewidth (inch) for Word 2003
out.width <- 5.77
docx.file <- "report.docx"

## unzip the docx converted by Pandoc
system(paste("unzip", docx.file, "-d temp_dir"))
document.xml <- "temp_dir/word/document.xml"
doc <- xmlParse(document.xml)
wp.extent <- getNodeSet(xmlRoot(doc), "//wp:extent")
a.blip <- getNodeSet(xmlRoot(doc), "//a:blip")
a.ext <- getNodeSet(xmlRoot(doc), "//a:ext")

figid <- sapply(a.blip, xmlGetAttr, "r:embed")
figname <- dir("temp_dir/word/media/")
stopifnot(length(figid) == length(figname))
pdffig <- paste("temp_dir/word/media/",
                ## in case figure ids in docx are not in dir'ed order
                sort(figname)[match(figid, substr(figname, 1, nchar(figname) - 4))], sep="")

## get dimension info of included pdf figures
pdfsize <- do.call(rbind, lapply(pdffig, function (x) {
    fig.ext <- substr(x, nchar(x) - 2, nchar(x))
    pp <- pipe(paste(ifelse(fig.ext == 'pdf', "pdfinfo", "file"), x, sep=" "))
    pdfinfo <- readLines(pp); close(pp)
    sizestr <- unlist(regmatches(pdfinfo, gregexpr("[[:digit:].]+ X [[:digit:].]+", pdfinfo, ignore.case=T)))
    as.numeric(strsplit(sizestr, split=" x ")[[1]])
}))

## resizing pdf figures in xml DOM, with the widest figure taking up a line's width
wp.cx <- round(out.width*914400*pdfsize[,1]/max(pdfsize[,1]))
wp.cy <- round(wp.cx*pdfsize[, 2]/pdfsize[, 1])
wp.cx <- as.character(wp.cx)
wp.cy <- as.character(wp.cy)
sapply(1:length(wp.extent), function (i)
       xmlAttrs(wp.extent[[i]]) <- c(cx = wp.cx[i], cy = wp.cy[i]));
sapply(1:length(a.ext), function (i)
       xmlAttrs(a.ext[[i]]) <- c(cx = wp.cx[i], cy = wp.cy[i]));

## save hacked xml back to docx
saveXML(doc, document.xml, indent = F)
setwd("temp_dir")
system(paste("zip -r ../", docx.file, " *", sep=""))
setwd("..")
system("rm -fr temp_dir")
like image 30
lcn Avatar answered Oct 05 '22 03:10

lcn