Extract a single, numbered table from PDF using PDE package

Question

I have a PDF, and I am using the PDE package. It is working, but not exactly the way I want.

library(PDE)

myTables <- PDE_pdfs2table(pdf = 'GPI-2023-Web.pdf')
Following file is processing: 'GPI-2023-Web.pdf'
No filter words chosen for analysis.
The following table was detected but not processable for extraction: Table 3.2 shows a breakdown of the change in the e
27 table(s) found in 'GPI-2023-Web.pdf'.
Analysis of 'GPI-2023-Web.pdf' complete.

This extracts ALL tables, and dumps as individual CSVs into a subfolder called tables.

cd tables/
[tables]$ ls
GPI-2023-Web_#010_table1.csv        GPI-2023-Web_#024_table3.csv
GPI-2023-Web_#011_table1.csv        GPI-2023-Web_#025_table1.csv
GPI-2023-Web_#012_table1.csv        GPI-2023-Web_#026_table1.csv
GPI-2023-Web_#013_table3.csv        GPI-2023-Web_#027_table1.csv
GPI-2023-Web_#014_table3.csv        GPI-2023-Web_#02_table1.csv
GPI-2023-Web_#015_table3.csv        GPI-2023-Web_#03_table1.csv
GPI-2023-Web_#017_table3.csv        GPI-2023-Web_#04_table1.csv
GPI-2023-Web_#018_table3.csv        GPI-2023-Web_#05_table1.csv
GPI-2023-Web_#019_table3.csv        GPI-2023-Web_#06_table1.csv
GPI-2023-Web_#01_table1.csv     GPI-2023-Web_#07_table1.csv
GPI-2023-Web_#020_table3.csv        GPI-2023-Web_#08_table1.csv
GPI-2023-Web_#021_table3.csv        GPI-2023-Web_#09_table1.csv
GPI-2023-Web_#022_table1.csv        GPI-2023-Web_page39_w.table-000039.png
GPI-2023-Web_#023_table2.csv
[tables]$ grep -l 'Safety and Security domain' *.csv
GPI-2023-Web_#011_table1.csv
GPI-2023-Web_#01_table1.csv
GPI-2023-Web_#023_table2.csv
GPI-2023-Web_#03_table1.csv
[tables]$ vi GPI-2023-Web_#01_table1.csv

While I can then pick the specific table I want and post process, I want to extract a VERY specific table titled Table 1.1: Safety and Security domain, and NOTHING else.

Is this possible?

Using PDE_pdfs2table_searchandfilter sounded promising until none of the search.words and filter.words options I tried actually worked. It still extracted many tables.

PS: The above PDF file can be downloaded from here: GPI-2023-Web.pdf

kikon · Accepted Answer

PDE_pdfs2table_searchandfilter is quite good, especially with regular expressions as search.words (using regex is the default behaviour).

For the specific example you can use

search.words = 'TABLE 1\.1\b'

The first escape sequence \. (double slash evaluates to single slash in string before being passed to regex) is to match the dot character; in regex the dot . is a special character used to match any single character, so the regex 1.1 (without escape) matches "1.1" but also "101".

The second escape sequence \b stands for a word boundary; so without it, regex 1\.1 matches 1.1, but also 1.11 (partial match)

The full call to PDE_pdfs2table_searchandfilter could be (essential argument values that correspond to default values are commented out):

PDE_pdfs2table_searchandfilter(
    pdf = 'GPI-2023-Web.pdf',
    search.words = 'TABLE 1\.1\b', # short for c('TABLE 1\.1\b')
    #ignore.case.sw = FALSE, # search words are case sensitive (default)
    #regex.sw = TRUE, # use regex rules for search words
    eval.abbrevs = FALSE, # don't detect abbreviations, use search words as they are
    exp.nondetc.tabs = FALSE, # don't save images for failed to read tables
    write.tab.doc.file = FALSE # don't write info about failed to read tables
)

Extract a single, numbered table from PDF using PDE package

Tags:

r

Gopala

1 Answers

kikon

Recent Activity

Donate For Us

Extract a single, numbered table from PDF using PDE package

Tags:

r

Gopala

1 Answers

kikon

Related questions

Recent Activity

Donate For Us