I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices
.
pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
| sed '$d' \
| sed -r 's/ +/,/g; s/ //g' \
> output.csv
The resulting file should be in CSV spreadsheet format (comma separated value fields).
In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?
The most basic method of extracting data from a PDF file to Excel is to simply copy and paste. This consists of opening the file, selecting the relevant text, and copying and pasting it into an Excel sheet. This method may be the best option if you only have a few PDF files.
I'll offer you another solution as well.
While in this case the pdftotext
method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).
Here the not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor
is the best choice.
I myself am using the direct GitHub checkout:
$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff $ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
I wrote myself a pretty simple wrapper script like this:
$ cat ~/bin/tabulaextr #!/bin/bash cd ${HOME}/svn-stuff/git.tabula-extractor/bin ./tabula $@
Since ~/bin/
is in my $PATH
, I just run
$ tabulaextr --pages all \ $(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \ | tee my.csv
to extract all the tables from all pages and convert them to a single CSV file.
The first ten (out of a total of 8727) lines of the CVS look like this:
$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv Retail Branding,Marketing Name,Device,Model "","",AD681H,Smartfren Andromax AD681H "","",FJL21,FJL21 "","",Luno,Luno "","",T31,Panasonic T31 "","",hws7721g,MediaPad 7 Youth 2 3Q,OC1020A,OC1020A,OC1020A 7Eleven,IN265,IN265,IN265 A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1 AG Mobile,Status,Status,Status
which in the original PDF look like this:
It even got these lines on the last page, 293, right:
nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A
which look on the PDF page like this:
TabulaPDF and Tabula-Extractor are really, really cool for jobs like this!
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema
command line tool), starring tabula-extractor
:
As Martin R commented, tabula-java
is the new version of tabula-extractor
and active. 1.0.0 was released on July 21st, 2017.
Download the jar file and with the latest java:
java -jar ./tabula-1.0.0-jar-with-dependencies.jar \ --pages=all \ ./DAC06E7D1302B790429AF6E84696FCFAB20B.pdf > support_devices.csv
This can be done easily with an IntelliGet (http://akribiatech.com/intelliget) script as below
userVariables = brand, name, device, model;
{ start = Not(Or(Or(IsSubstring("Supported Devices",Line(0)),
IsSubstring("Retail Branding",Line(0))),
IsEqual(Length(Trim(Line(0))),0)));
brand = Trim(Substring(Line(0),10,44));
name = Trim(Substring(Line(0),45,79));
device = Trim(Substring(Line(0),80,114));
model = Trim(Substring(Line(0),115,200));
output = Concat(brand, ",", name, ",", device, ",", model);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With