Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tabula extract tables by area coordinates

Tags:

python

pdf

tabula

We are given the option to extract tables from a PDF document by specifying its coordinates. For windows users, in order to get the coordinates, you have to upload the PDF file to Tabula web page and export the script which contains the coordinates then input the coordinates into your code. For Mac users, you just have to use the Preview app and the crop inspector. I'm just wondering if there are any third party programs or plug-ins which offer this to Windows user? I think this will be handy under the following situation:

  1. When you do not have internet access.
  2. I think the preview app will be more accurate because I have experienced inaccurate coordinates produced from the Tabula web page.

Will be grateful if anyone can point me to where I can find such thing. Much thanks.

like image 504
Eric Choi Avatar asked Aug 02 '17 09:08

Eric Choi


People also ask

Which is better Camelot or Tabula?

We found that Camelot works better than Tabula in all Lattice cases. Tabula does better table detection for Stream cases, but it still fails to give good parsing output, which Camelot solves for with its configuration parameters.

How do you define Tabula in Python?

What is Tabula? Tabular is a basic wrapper of tabula-java that allows users to the extraction of the table and converts the PDF file directly into Data frames or JSON using Python Programming language. The user can also extract tables from PDF and convert them into TSV, CSV, or JSON format files.


2 Answers

Tabula needs areas to be specified in PDF units, which are defined to be 1/72 of an inch. If using Acrobat Reader DC, you can use the Measure tool and multiply its readings by 72.

Tabula needs the area to be specified as the top, left, bottom and right distances. To obtain them, you can measure the distances from the top of the page to the beginning of the table and so on.

enter image description here

like image 72
Manuel Aristarán Avatar answered Sep 22 '22 16:09

Manuel Aristarán


Reader only allows measurements if the PDF creator had allowed it. Found this instead: https://graphicdesign.stackexchange.com/a/81666

Brief steps:

  1. Download SumatraPDF. It is also available as zip, no install needed.
  2. Open PDF with the Sumatra reader.
  3. Press 'm' - this shows cursor position in top left corner.
  4. Use tabula with options -p for page, -a for area. (top,left,bottom,right)
like image 40
Deepak Garud Avatar answered Sep 19 '22 16:09

Deepak Garud