I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. <code>Supported Devices</code>. <pre class="prettyprint"><code>pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \ | sed '$d' \ | sed -r 's/ +/,/g; s/ //g' \ > output.csv </code></pre> The resulting file should be in CSV spreadsheet format (comma separated value fields). In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?

As Martin R commented, <code>tabula-java</code> is the new version of <code>tabula-extractor</code> and active. 1.0.0 was released on July 21st, 2017. Download the jar file and with the latest java: <pre class="prettyprint"><code>java -jar ./tabula-1.0.0-jar-with-dependencies.jar \ --pages=all \ ./DAC06E7D1302B790429AF6E84696FCFAB20B.pdf > support_devices.csv </code></pre>

How to extract table data from PDF as CSV from the command line?

Tags:

grep

pdf

pdftotext

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.

pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
 | sed '$d'                                                  \
 | sed -r 's/ +/,/g; s/ //g'                                 \
 > output.csv

The resulting file should be in CSV spreadsheet format (comma separated value fields).

In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?

453

asked May 18 '15 18:05

user706838

3 Answers

I'll offer you another solution as well.

While in this case the pdftotext method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).

Here the not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor is the best choice.

I myself am using the direct GitHub checkout:

$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff $ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

I wrote myself a pretty simple wrapper script like this:

$ cat ~/bin/tabulaextr   #!/bin/bash  cd ${HOME}/svn-stuff/git.tabula-extractor/bin  ./tabula $@

Since ~/bin/ is in my $PATH, I just run

$ tabulaextr --pages all                                 \          $(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \         | tee my.csv

to extract all the tables from all pages and convert them to a single CSV file.

The first ten (out of a total of 8727) lines of the CVS look like this:

$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv    Retail Branding,Marketing Name,Device,Model  "","",AD681H,Smartfren Andromax AD681H  "","",FJL21,FJL21  "","",Luno,Luno  "","",T31,Panasonic T31  "","",hws7721g,MediaPad 7 Youth 2  3Q,OC1020A,OC1020A,OC1020A  7Eleven,IN265,IN265,IN265  A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1  AG Mobile,Status,Status,Status

which in the original PDF look like this:

Screenshot from top of first page of sample PDF

It even got these lines on the last page, 293, right:

 nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A  nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A

which look on the PDF page like this:

last page of sample PDF

TabulaPDF and Tabula-Extractor are really, really cool for jobs like this!

Update

Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

asciicast

140

answered Nov 02 '22 17:11

Kurt Pfeifle

As Martin R commented, tabula-java is the new version of tabula-extractor and active. 1.0.0 was released on July 21st, 2017.

Download the jar file and with the latest java:

java -jar ./tabula-1.0.0-jar-with-dependencies.jar \     --pages=all \     ./DAC06E7D1302B790429AF6E84696FCFAB20B.pdf     > support_devices.csv

answered Nov 02 '22 17:11

Nobu

This can be done easily with an IntelliGet (http://akribiatech.com/intelliget) script as below

userVariables = brand, name, device, model;
{ start = Not(Or(Or(IsSubstring("Supported Devices",Line(0)),
                  IsSubstring("Retail Branding",Line(0))),
                IsEqual(Length(Trim(Line(0))),0))); 
  brand = Trim(Substring(Line(0),10,44));
  name = Trim(Substring(Line(0),45,79));
  device = Trim(Substring(Line(0),80,114));
  model = Trim(Substring(Line(0),115,200));
  output = Concat(brand, ",", name, ",", device, ",", model);
}

answered Nov 02 '22 16:11

user3354850

Related questions
                            
                                Extract text from PDF between two dividers with ITextSharp
                            
                                How to render table from htmlTable package in pdf_document in rmarkdown?
                            
                                MuPDF Android Library Vertical Scroll
                            
                                Pandoc Syntax Highlighting in PDF not working
                            
                                R Markdown: plots within a loop going out of margin when typesetting to PDF
                            
                                nodejs: merging pdf streams \ buffers
                            
                                ASP.Net C# Save PDF to directory
                            
                                Binary data in String representation to PDF file
                            
                                png images to one pdf in python
                            
                                Django - pdf response has wrong encoding - xhtml2pdf
                            
                                Tabula-py is not splitting columns right
                            
                                Download PDF file and save in document directory
                            
                                How can I convert pdf to asciidoc using pandoc?
                            
                                Python : download as pdf all emails from a label (gmail)
                            
                                Is it possible to add page breaks into Google Spreadsheet using Apps Script?
                            
                                "Find Tag from Selection" is not working in tagged pdf?
                            
                                How can I extract text fragments from PDF with their coordinates in Python?
                            
                                Is it possible to redact PDF areas with PDFBox by position?
                            
                                Free tool or library to convert Tiff files to pdf in .Net [closed]
                            
                                snapshot image from PDF document

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With