Given a powerpoint file with a chart containing chart data, how can I extract the chart data as a data frame? That is, given the the tempf.pptx
file, how can I retrieve the iris
dataset?
library(magrittr)
library(mschart)
library(officer)
linec <- ms_linechart(data = iris, x = "Sepal.Length",
y = "Sepal.Width", group = "Species")
linec <- chart_ax_y(linec, num_fmt = "0.00", rotation = -90)
doc <- read_pptx()
doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")
doc <- ph_with_chart(doc, chart = linec)
print(doc, target = tempf.pptx <- tempfile(fileext = ".pptx"))
Another approach would be to directly import the xls file associated with the chart :
tempdir <- tempfile()
officer::unpack_folder(tempf.pptx, tempdir)
xl_file <- list.files(tempdir, recursive = TRUE, full.names = TRUE, pattern = "\\.xlsx$")
readxl::read_excel(xl_file)
Note: this code only works because there is only one dataset in the pptx file. If there were more than a file, the relationships file *.xml.rels
should be read to be sure we import the correct xlsx file (the xl reference is stored in ppt/charts/_rels/chart_file_title.xml.rels
)
"Cut and paste" is a seriously flawed anti-pattern for reproducible code & analyses or automation (all things we strive for in data science workflows).
This is starter code that gets you to the data elements (but you still have some "roll up your sleeves" work to do
library(xml2)
library(magrittr)
# temp holding space for the unzipped PPTX
td <- tempfile("dir")
# unzip it and keep file names
fils <- unzip(tempf.pptx, exdir = td)
# look for chart XML files
charts <- fils[grepl("chart.*\\.xml$", fils)]
# read in the first one
chart <- read_xml(charts[1])
Now that we found and read in a chart XML file, let's see if we figure out which kind of chart it is:
# find charts in the XML (i don't know if there can be more than one per-XML file)
(embedded_charts <- xml_find_all(chart, ".//c:chart/c:plotArea"))
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...
# get the node root of the first one (again, i'm not sure if there can be more than one)
(first_embed <- embedded_charts[1])
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...
# use it to get the kind of chart so we can target the values with it
(xml_children(first_embed) %>%
xml_name() %>%
grep("Chart", ., value=TRUE) -> embed_kind)
## [1] "lineChart"
Now we can try to find the data series for that chart.
(target <- xml_find_first(first_embed, sprintf(".//c:%s", embed_kind)))
## {xml_nodeset (1)}
## [1] <c:lineChart>\n <c:grouping val="standard"/>\n <c:varyColors val=" ...
# extract "column" metadata
col_refs <- xml_find_all(target, ".//c:ser/c:tx/c:strRef")
(xml_find_all(col_refs, ".//c:f") %>%
sapply(xml_text) -> col_specs)
## [1] "sheet1!$B$1" "sheet1!$C$1" "sheet1!$D$1"
(xml_find_all(col_refs, ".//c:v") %>%
sapply(xml_text))
## [1] "setosa" "versicolor" "virginica"
Extract "X" metadata & data:
x_val_refs <- xml_find_all(target, ".//c:cat")
(lapply(x_val_refs, xml_find_all, ".//c:f") %>%
sapply(xml_text) -> x_val_specs)
## [1] "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36"
(lapply(x_val_refs, xml_find_all, ".//c:v") %>%
sapply(xml_double) -> x_vals)
## [,1] [,2] [,3]
## [1,] 4.3 4.3 4.3
## [2,] 4.4 4.4 4.4
## [3,] 4.5 4.5 4.5
## [4,] 4.6 4.6 4.6
## [5,] 4.7 4.7 4.7
## [6,] 4.8 4.8 4.8
## [7,] 4.9 4.9 4.9
## [8,] 5.0 5.0 5.0
## [9,] 5.1 5.1 5.1
## [10,] 5.2 5.2 5.2
## [11,] 5.3 5.3 5.3
## [12,] 5.4 5.4 5.4
## [13,] 5.5 5.5 5.5
## [14,] 5.6 5.6 5.6
## [15,] 5.7 5.7 5.7
## [16,] 5.8 5.8 5.8
## [17,] 5.9 5.9 5.9
## [18,] 6.0 6.0 6.0
## [19,] 6.1 6.1 6.1
## [20,] 6.2 6.2 6.2
## [21,] 6.3 6.3 6.3
## [22,] 6.4 6.4 6.4
## [23,] 6.5 6.5 6.5
## [24,] 6.6 6.6 6.6
## [25,] 6.7 6.7 6.7
## [26,] 6.8 6.8 6.8
## [27,] 6.9 6.9 6.9
## [28,] 7.0 7.0 7.0
## [29,] 7.1 7.1 7.1
## [30,] 7.2 7.2 7.2
## [31,] 7.3 7.3 7.3
## [32,] 7.4 7.4 7.4
## [33,] 7.6 7.6 7.6
## [34,] 7.7 7.7 7.7
## [35,] 7.9 7.9 7.9
Extract "Y" metadata and data:
y_val_refs <- xml_find_all(target, ".//c:val")
(lapply(y_val_refs, xml_find_all, ".//c:f") %>%
sapply(xml_text) -> y_val_specs)
## [1] "sheet1!$B$2:$B$36" "sheet1!$C$2:$C$36" "sheet1!$D$2:$D$36"
(lapply(y_val_refs, xml_find_all, ".//c:v") %>%
sapply(xml_double) -> y_vals)
## [[1]]
## [1] 3.0 3.2 2.3 3.2 3.2 3.0 3.6 3.3 3.8 4.1 3.7 3.4 3.5 3.8 4.0
##
## [[2]]
## [1] 2.4 2.3 2.5 2.7 3.0 2.6 2.7 2.8 2.6 3.2 3.4 3.0 2.9 2.3 2.9 2.8 3.0
## [18] 3.1 2.8 3.1 3.2
##
## [[3]]
## [1] 2.5 2.8 2.5 2.7 3.0 3.0 2.6 3.4 2.5 3.1 3.0 3.0 3.2 3.1 3.0 3.0 2.9
## [18] 2.8 3.0 3.0 3.8
# see if there are X & Y titles
title_nodes <- xml_find_all(first_embed, ".//c:title")
(lapply(title_nodes, xml_find_all, ".//a:t") %>%
sapply(xml_text) -> titles)
## [1] "Sepal.Length" "Sepal.Width"
Unlike the impetus behind my docxtractr
package (for getting tables out of Word docs) I haven't seen much call for this particular need much so I'm not sure there will be a package for the above idiom in the near future.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With