Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert PDF with columns to text

Tags:

python

pdf

In Unix or Windows, I want to convert this dictionary to a Python dictionary. I copied the contents of the PDF dictionary and put them in a .rtf file, intending to read them with Python. However, it gives something like:

A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: Some- one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.)
AA
abdominal distension /􏰄b􏰁dɒmn(ə)l ds 􏰂tenʃ(ə)n/ noun a condition in which the abdo-
men is stretched because of gas or fluid
A
abdominal distension
AA abbr Alcoholics Anonymous

It has essentially squashed the columns from the PDF into a strange mismash. How do I convert a PDF to text so that the columns are respected? In other words, the desired output is:

A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: Some- one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.)
AA abbr Alcoholics Anonymous

...and so on

like image 745
user0 Avatar asked Mar 28 '15 16:03

user0


People also ask

Is there a way to convert a PDF to text?

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.

How do I separate columns in a PDF?

Open the file with MS Word File → Open. Confirm conversion. Select document or paragraph. Change Columns Layout → Columns → 1 Column.

How do I change the columns in a PDF?

Choose Table > Insert > Column. Specify the number of columns you want. Specify whether the new column or columns should appear before or after the current column, and then click OK.


1 Answers

You have basically two options to get to the text:

  1. Direct text extraction from each page as-is.
  2. Split each page into two along the column space and extract the text from each half separately

For the first option I'll suggest you first try pdftotext, but with the parameter -layout. (There are other tools, such as TET, the Text Extraction Toolkit from the PDFlib folks, which you can try if pdftotext isn't good enough.)

For following the road of the second option using Ghostscript and other tools, you may want check out my answers to the following questions:

  • Linux-based tool to chop PDFs into multiple pages (Superuser)
  • Convert PDF 2 sides per page to 1 side per page (Superuser)
  • How can I split a PDF's pages down the middle? (Superuser)
  • Cropping a PDF using Ghostscript 9.01 (Stackoverflow)
  • Split one PDF page into two (Stackoverflow)
  • PDF - Remove White Margins (Stackoverflow)

pdftotext -layout

You can try it with the command line tool pdftotext. You'll have to decide if it is "good enough" for your purpose.

The following command extracts the text from page 8 only (first page with dual column layout) and prints it to <stdout>:

$ pdftotext -f 8 -l 8 -layout                                         \
           Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf - \
 | head -n 30

results in:

Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM

                                                          A
 A /e/ noun a human blood type of the ABO                abdominal distension /bdɒmn(ə)l ds
 A                                                        abdominal distension
 system, containing the A antigen (NOTE: Some-              tenʃ(ə)n/ noun a condition in which the abdo-
 one with type A can donate to people of the              men is stretched because of gas or fluid
 same group or of the AB group, and can receive           abdominal pain /b dɒmn(ə)l pen/ noun
                                                          abdominal pain
 blood from people with type A or type O.)                pain in the abdomen caused by indigestion or
 AA
 AA abbr Alcoholics Anonymous                             more serious disorders
 A & E /e ənd  i
                     /, A & E department /e ənd           abdominal viscera /bdɒmn(ə)l    vsərə/
 A & E                                                    abdominal viscera
    i
      d pɑ
           tmənt/ noun same as accident and
                                                          plural noun the organs which are contained in
 emergency department                                     the abdomen, e.g. the stomach, liver and intes-
 A & E medicine /e ənd     i
                              med(ə)sn/
 A & E medicine
                                                          tines
                                                          abdominal wall /b dɒmn(ə)l wɔ
                                                                                        l/ noun
                                                          abdominal wall
 noun the medical procedures used in A & E de-                                                            
 partments                                                muscular tissue which surrounds the abdomen
                                                          abdomino- /bdɒmnəυ/ prefix referring to
                                                          abdomino-

Note the use of -layout! Without it, the extracted text would look like this:

Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM A A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: SomeA

one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.) AA abbr Alcoholics Anonymous A & E /e ənd i /, A & E department /e ənd i d pɑ tmənt/ noun same as accident and emergency department A & E medicine /e ənd i med(ə)sn/ noun the medical procedures used in A & E deAA

A & E A & E medicine partments AB /e bi / noun a human blood type of the ABO system, containing the A and B antigens AB

I noted that the file uses on page 8, but has not embedded, the fonts Courier, Helvetica, Helvetica-Bold, Times-Roman and Times-Italic.

This does not pose a problem for text extraction, since all these fonts use /WinAnsiEncoding.

However, there are other fonts, which are embedded as a subset. These fonts do use a /Custom encoding, but they do not provide a /ToUnicode table. This table is required for reliable text extraction (back-translating the glyph names to character names).

What I said can be seen in this table:

$ pdffonts -f 8 -l 8 Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf 
 name                           type        encoding      emb sub uni object ID
 ------------------------------ ----------- ------------- --- --- --- ---------
 Helvetica-Bold                 Type 1      WinAnsi       no  no  no    1505  0
 Courier                        Type 1      WinAnsi       no  no  no    1507  0
 Helvetica                      Type 1      WinAnsi       no  no  no    1497  0
 MOEKLA+Times-PhoneticIPA       Type 1C     Custom        yes yes yes   1509  0
 Times-Roman                    Type 1      WinAnsi       no  no  no    1506  0
 Times-Italic                   Type 1      WinAnsi       no  no  no    1499  0
 IGFBAL+EuropeanPi-Three        Type 1C     Custom        yes yes no    1502  0

It so happened that I recently hand-coded 5 different PDF files, with commented source code, for a new GitHub project. These 5 files demonstrate the importance of a correct /ToUnicode table for each font that is embedded as a subset. They can be found here, along with a README that explains some more detail

  • https://github.com/angea/PDF101/tree/master/handcoded/textextract
like image 74
Kurt Pfeifle Avatar answered Sep 23 '22 02:09

Kurt Pfeifle