I wonder if anyone has any clues to the missing column? I have been using pdfplumber to extract table data with good results apart from one particular set of PDFs. The problem is that while page.search finds the rightmost column in the table, extract_table omits the rightmost column. This is on Windows 11.
Here is an image of the PDF:
Link to PDF file on Dropbox:-
https://www.dropbox.com/scl/fi/d3cg802h7cawl6vw9i7cm/testdoc.pdf?rlkey=tmz390ly5fbug0xi0kx06b2kt&dl=0
Here is the page image with vertical lines superimposed:
Here is the minimal code:`
# pdftesting.py
import pdfplumber
import sys
print('pdfplumber version:', pdfplumber.__version__)
print('Python version:', sys.version)
filepath = 'C:/ProgramData/PythonProgs/testing/testdoc.pdf'
fn = pdfplumber.open(filepath)
page = fn.pages[0]
vlines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, \
760.15, 818.9811199999999]
imagefile = 'C:/ProgramData/PythonProgs/testing/testdoc.png'
im = page.to_image(resolution=300)
im.draw_vlines(vlines, stroke_width=3)
im.save(imagefile)
lines = page.extract_table(table_settings=\
{"vertical_strategy":"explicit",\
"explicit_vertical_lines":vlines,\
"horizontal_strategy": 'text',\
"snap_tolerance": 5})
for item in lines:
print('line:', item)
print('page width:', page.width)
target = 'inc'
X0 = page.search(target)[0]['x0']
X1 = page.search(target)[0]['x1']
size = page.search(target)[0]['chars'][0]['size']
print('Found:', target, X0, X1, size)
`
Here is the output from the code:
pdfplumber version: 0.11.0
Python: 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]
line: ['', '', '', 'tne minute, rou', 'naea up to tn', 'e nearest mi', 'nute', '', '']
line: ['UK calls', '', '', '', '', '', '', '', '']
line: ['', '', '', '', '', '', '', '', '']
line: ['Date', 'Time', 'Phone number', 'Destination', 'Duration', 'Charged', 'Included?', 'VAT', 'VAT']
line: ['', '', '', '', 'hh:mm:ss', 'hh:mm:ss', '', 'ex', 'rate']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sun 17 May', '15:55', '07755221961', 'UK mobile', '00:05:26', '00:05:26', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Thu 21 May', '11:15', '07818818242', 'Vodafone mobile', '00:00:07', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Fri 22 May', '15:44', '05706000459', 'Landline', '00:00:04', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Mon 25 May', '20:48', '02085462206', 'Landline', '00:15:12', '00:15:12', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sat 50 May', '10:58', '02056549856', 'Landline', '00:00:08', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Fri 5 Jun', '09:58', '07818818242', 'Vodafone mobile', '00:00:11', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sat 6 Jun', '07:17', '07716065665', 'Vodafone mobile', '00:01:14', '00:01:14', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['', '', '', 'Tot', 'al of 7 calls', '23 mins 52 s', '', '£0.000', '']
page width: 856.800048828
Found: inc 761.15 773.9811199999999 9.961000000000013
Here's what happens when you choose the horizontal strategy "text"
:
"top"
parameter."x0"
and rightmost "x1"
for all.The issue you experienced occurred at step 4 because the horizontal edges built on words are shorter than the maximum width between the vertical lines. Let's visualize this:
import pdfplumber
filepath = '/home/jakito/Desktop/testdoc.pdf'
pdf = pdfplumber.open(filepath)
page = pdf.pages[0]
v_lines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, 760.15, 818.9811199999999]
table_settings={
"vertical_strategy": "explicit",
"explicit_vertical_lines": v_lines,
"horizontal_strategy": "text",
"snap_tolerance": 5}
page.to_image(resolution=400).debug_tablefinder(table_settings).show()
Note that the horizontal edges do not touch the rightmost vertical line. As a result, there are not enough vertices to construct cells in the last column.
Here are several ways to resolve the issue:
v_lines[-1] = max(char['x1'] for char in page.chars)
"text"
strategy to both directions with adjusted word limitstable_settings={
"vertical_strategy":"text",
"horizontal_strategy": "text",
"min_words_vertical": 3,
"min_words_horizontal": 11
}
page.to_image(resolution=300).debug_tablefinder(table_settings).show()
"explicit"
horizontal strategyfrom pdfplumber.table import words_to_edges_h
words = page.extract_words()
h_edges = words_to_edges_h(words, word_threshold=6)
h_lines = [x['top'] for x in h_edges[::2] if 0 <= x['top'] <= page.height]
v_lines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, 760.15, 818.9811199999999]
table_settings={
"vertical_strategy":"explicit",
"explicit_vertical_lines":v_lines,
"horizontal_strategy": "explicit",
"explicit_horizontal_lines": h_lines
}
page.to_image(resolution=300).debug_tablefinder(table_settings).show()
Note: Currently, words_to_edges_h
returns two edges for each cluster, which is excessive. To address this, I filtered them using h_edges[::2]
. The lowest line can be added manually if needed, but in this case, it can be omitted. Additionally, I applied filtering based on 0
and the page height due to the specifics of the sample document, which appears to be a cropped version of a larger one. word_threshold=6
was added to avoid splitting "hh:mm:ss ..."
into a separate line.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With