Iam using tabula_py to read tables on a pdf. Some are big. I have a lot of cases where a table is on more than one page. Isuue is tabula_py is treating as new table for each page, instead of reading as one large table. Same issue with Camelot
You're right. Both Camelot and Tabula work page by page.
Anyway, you can write your custom function to know if tables are united. If so, you can merge their content and treat them together.
For example, I created this function to process Camelot output:
from numpy import allclose
def are_tables_united(table1_dict,table2_dict):
if table2['page']==(table1['page']+1):
if len(table2['cols'])==len(table1['cols']):
# extract the vertical coordinates of the tables
_,y_bottom_table1,_,_=table1['_bbox']
_,_,_,y_top_table2=table2['_bbox']
page_height=792
# If the first table ends in the last 15% of the page
# and the second table starts in the first 15% of the page
if y_bottom_table1<.15*page_height and\
y_top_table2>.85*page_height:
table1_cols=table1['cols']
table2_cols=table2['cols']
table1_cols_width=[col[1]-col[0] for col in table1_cols]
table2_cols_width=[col[1]-col[0] for col in table2_cols]
# evaluate if the column widths of the two tables are similar
return(allclose(table1_cols_width,table2_cols_width,atol=3,rtol=0))
else:
return False
else:
return False
Function arguments table1_dict and table2_dict are Camelot output tables __dict__ attributes.
For example:
tables=camelot.read_pdf(...)
table1_dict=tables[0].__dict__
table2_dict=tables[1].__dict__
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With