Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract chart data from powerpoint slides

Tags:

r

Given a powerpoint file with a chart containing chart data, how can I extract the chart data as a data frame? That is, given the the tempf.pptx file, how can I retrieve the iris dataset?

library(magrittr)
library(mschart)
library(officer)

linec <- ms_linechart(data = iris, x = "Sepal.Length",
                      y = "Sepal.Width", group = "Species")
linec <- chart_ax_y(linec, num_fmt = "0.00", rotation = -90)

doc <- read_pptx()
doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")
doc <- ph_with_chart(doc, chart = linec)

print(doc, target = tempf.pptx <- tempfile(fileext = ".pptx"))
like image 379
Hugh Avatar asked Oct 20 '25 10:10

Hugh


2 Answers

Another approach would be to directly import the xls file associated with the chart :

tempdir <- tempfile()
officer::unpack_folder(tempf.pptx, tempdir)
xl_file <- list.files(tempdir, recursive = TRUE, full.names = TRUE, pattern = "\\.xlsx$")
readxl::read_excel(xl_file)

Note: this code only works because there is only one dataset in the pptx file. If there were more than a file, the relationships file *.xml.rels should be read to be sure we import the correct xlsx file (the xl reference is stored in ppt/charts/_rels/chart_file_title.xml.rels)

like image 140
David Gohel Avatar answered Oct 22 '25 00:10

David Gohel


"Cut and paste" is a seriously flawed anti-pattern for reproducible code & analyses or automation (all things we strive for in data science workflows).

This is starter code that gets you to the data elements (but you still have some "roll up your sleeves" work to do

library(xml2)
library(magrittr)

# temp holding space for the unzipped PPTX
td <- tempfile("dir")

# unzip it and keep file names
fils <- unzip(tempf.pptx, exdir = td)

# look for chart XML files
charts <- fils[grepl("chart.*\\.xml$", fils)]

# read in the first one
chart <- read_xml(charts[1])

Now that we found and read in a chart XML file, let's see if we figure out which kind of chart it is:

# find charts in the XML (i don't know if there can be more than one per-XML file)
(embedded_charts <- xml_find_all(chart, ".//c:chart/c:plotArea"))
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...

# get the node root of the first one (again, i'm not sure if there can be more than one)
(first_embed <- embedded_charts[1])
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...

# use it to get the kind of chart so we can target the values with it
(xml_children(first_embed) %>%
  xml_name() %>%
  grep("Chart", ., value=TRUE) -> embed_kind)
## [1] "lineChart"

Now we can try to find the data series for that chart.

(target <- xml_find_first(first_embed, sprintf(".//c:%s", embed_kind)))
## {xml_nodeset (1)}
## [1] <c:lineChart>\n  <c:grouping val="standard"/>\n  <c:varyColors val=" ...

# extract "column" metadata
col_refs <- xml_find_all(target, ".//c:ser/c:tx/c:strRef")
(xml_find_all(col_refs, ".//c:f") %>%
    sapply(xml_text) -> col_specs)
## [1] "sheet1!$B$1" "sheet1!$C$1" "sheet1!$D$1"

(xml_find_all(col_refs, ".//c:v") %>%
    sapply(xml_text))
## [1] "setosa"     "versicolor" "virginica"

Extract "X" metadata & data:

x_val_refs <- xml_find_all(target, ".//c:cat")
(lapply(x_val_refs, xml_find_all, ".//c:f") %>%
    sapply(xml_text) -> x_val_specs)
## [1] "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36"

(lapply(x_val_refs, xml_find_all, ".//c:v") %>%
    sapply(xml_double) -> x_vals)
##       [,1] [,2] [,3]
##  [1,]  4.3  4.3  4.3
##  [2,]  4.4  4.4  4.4
##  [3,]  4.5  4.5  4.5
##  [4,]  4.6  4.6  4.6
##  [5,]  4.7  4.7  4.7
##  [6,]  4.8  4.8  4.8
##  [7,]  4.9  4.9  4.9
##  [8,]  5.0  5.0  5.0
##  [9,]  5.1  5.1  5.1
## [10,]  5.2  5.2  5.2
## [11,]  5.3  5.3  5.3
## [12,]  5.4  5.4  5.4
## [13,]  5.5  5.5  5.5
## [14,]  5.6  5.6  5.6
## [15,]  5.7  5.7  5.7
## [16,]  5.8  5.8  5.8
## [17,]  5.9  5.9  5.9
## [18,]  6.0  6.0  6.0
## [19,]  6.1  6.1  6.1
## [20,]  6.2  6.2  6.2
## [21,]  6.3  6.3  6.3
## [22,]  6.4  6.4  6.4
## [23,]  6.5  6.5  6.5
## [24,]  6.6  6.6  6.6
## [25,]  6.7  6.7  6.7
## [26,]  6.8  6.8  6.8
## [27,]  6.9  6.9  6.9
## [28,]  7.0  7.0  7.0
## [29,]  7.1  7.1  7.1
## [30,]  7.2  7.2  7.2
## [31,]  7.3  7.3  7.3
## [32,]  7.4  7.4  7.4
## [33,]  7.6  7.6  7.6
## [34,]  7.7  7.7  7.7
## [35,]  7.9  7.9  7.9

Extract "Y" metadata and data:

y_val_refs <- xml_find_all(target, ".//c:val")
(lapply(y_val_refs, xml_find_all, ".//c:f") %>%
    sapply(xml_text) -> y_val_specs)
## [1] "sheet1!$B$2:$B$36" "sheet1!$C$2:$C$36" "sheet1!$D$2:$D$36"

(lapply(y_val_refs, xml_find_all, ".//c:v") %>%
    sapply(xml_double) -> y_vals)
## [[1]]
##  [1] 3.0 3.2 2.3 3.2 3.2 3.0 3.6 3.3 3.8 4.1 3.7 3.4 3.5 3.8 4.0
## 
## [[2]]
##  [1] 2.4 2.3 2.5 2.7 3.0 2.6 2.7 2.8 2.6 3.2 3.4 3.0 2.9 2.3 2.9 2.8 3.0
## [18] 3.1 2.8 3.1 3.2
## 
## [[3]]
##  [1] 2.5 2.8 2.5 2.7 3.0 3.0 2.6 3.4 2.5 3.1 3.0 3.0 3.2 3.1 3.0 3.0 2.9
## [18] 2.8 3.0 3.0 3.8

# see if there are X & Y titles
title_nodes <- xml_find_all(first_embed, ".//c:title")
(lapply(title_nodes, xml_find_all, ".//a:t") %>%
    sapply(xml_text) -> titles)
## [1] "Sepal.Length" "Sepal.Width" 

Unlike the impetus behind my docxtractr package (for getting tables out of Word docs) I haven't seen much call for this particular need much so I'm not sure there will be a package for the above idiom in the near future.

like image 31
hrbrmstr Avatar answered Oct 21 '25 23:10

hrbrmstr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!