I have a PDF, and I am using the PDE package. It is working, but not exactly the way I want.
library(PDE)
myTables <- PDE_pdfs2table(pdf = 'GPI-2023-Web.pdf')
Following file is processing: 'GPI-2023-Web.pdf'
No filter words chosen for analysis.
The following table was detected but not processable for extraction: Table 3.2 shows a breakdown of the change in the e
27 table(s) found in 'GPI-2023-Web.pdf'.
Analysis of 'GPI-2023-Web.pdf' complete.
This extracts ALL tables, and dumps as individual CSVs into a subfolder called tables.
cd tables/
[tables]$ ls
GPI-2023-Web_#010_table1.csv GPI-2023-Web_#024_table3.csv
GPI-2023-Web_#011_table1.csv GPI-2023-Web_#025_table1.csv
GPI-2023-Web_#012_table1.csv GPI-2023-Web_#026_table1.csv
GPI-2023-Web_#013_table3.csv GPI-2023-Web_#027_table1.csv
GPI-2023-Web_#014_table3.csv GPI-2023-Web_#02_table1.csv
GPI-2023-Web_#015_table3.csv GPI-2023-Web_#03_table1.csv
GPI-2023-Web_#017_table3.csv GPI-2023-Web_#04_table1.csv
GPI-2023-Web_#018_table3.csv GPI-2023-Web_#05_table1.csv
GPI-2023-Web_#019_table3.csv GPI-2023-Web_#06_table1.csv
GPI-2023-Web_#01_table1.csv GPI-2023-Web_#07_table1.csv
GPI-2023-Web_#020_table3.csv GPI-2023-Web_#08_table1.csv
GPI-2023-Web_#021_table3.csv GPI-2023-Web_#09_table1.csv
GPI-2023-Web_#022_table1.csv GPI-2023-Web_page39_w.table-000039.png
GPI-2023-Web_#023_table2.csv
[tables]$ grep -l 'Safety and Security domain' *.csv
GPI-2023-Web_#011_table1.csv
GPI-2023-Web_#01_table1.csv
GPI-2023-Web_#023_table2.csv
GPI-2023-Web_#03_table1.csv
[tables]$ vi GPI-2023-Web_#01_table1.csv
While I can then pick the specific table I want and post process, I want to extract a VERY specific table titled Table 1.1: Safety and Security domain, and NOTHING else.
Is this possible?
Using PDE_pdfs2table_searchandfilter sounded promising until none of the search.words and filter.words options I tried actually worked. It still extracted many tables.
PS: The above PDF file can be downloaded from here: GPI-2023-Web.pdf
PDE_pdfs2table_searchandfilter is quite good, especially with regular expressions as search.words (using regex is the default behaviour).
For the specific example you can use
search.words = 'TABLE 1\\.1\\b'
The first escape sequence \. (double slash evaluates to single slash in string before being passed to regex) is to match the dot character; in regex the dot . is a special character used to match any single character, so the regex 1.1 (without escape) matches "1.1" but also "101".
The second escape sequence \b stands for a word boundary; so without it,
regex 1\\.1 matches 1.1, but also 1.11 (partial match)
The full call to PDE_pdfs2table_searchandfilter could be (essential argument values that correspond to default values are commented out):
PDE_pdfs2table_searchandfilter(
pdf = 'GPI-2023-Web.pdf',
search.words = 'TABLE 1\\.1\\b', # short for c('TABLE 1\\.1\\b')
#ignore.case.sw = FALSE, # search words are case sensitive (default)
#regex.sw = TRUE, # use regex rules for search words
eval.abbrevs = FALSE, # don't detect abbreviations, use search words as they are
exp.nondetc.tabs = FALSE, # don't save images for failed to read tables
write.tab.doc.file = FALSE # don't write info about failed to read tables
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With