In Unix or Windows, I want to convert this dictionary to a Python dictionary
. I copied the contents of the PDF
dictionary and put them in a .rtf
file, intending to read
them with Python. However, it gives something like:
A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: Some- one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.)
AA
abdominal distension /bdɒmn(ə)l ds tenʃ(ə)n/ noun a condition in which the abdo-
men is stretched because of gas or fluid
A
abdominal distension
AA abbr Alcoholics Anonymous
It has essentially squashed the columns from the PDF into a strange mismash. How do I convert a PDF to text so that the columns are respected? In other words, the desired output is:
A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: Some- one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.)
AA abbr Alcoholics Anonymous
...and so on
Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.
Open the file with MS Word File → Open. Confirm conversion. Select document or paragraph. Change Columns Layout → Columns → 1 Column.
Choose Table > Insert > Column. Specify the number of columns you want. Specify whether the new column or columns should appear before or after the current column, and then click OK.
You have basically two options to get to the text:
For the first option I'll suggest you first try pdftotext
, but with the parameter -layout
. (There are other tools, such as TET
, the Text Extraction Toolkit from the PDFlib folks, which you can try if pdftotext
isn't good enough.)
For following the road of the second option using Ghostscript and other tools, you may want check out my answers to the following questions:
pdftotext -layout
You can try it with the command line tool pdftotext
. You'll have to decide if it is "good enough" for your purpose.
The following command extracts the text from page 8 only (first page with dual column layout) and prints it to <stdout>
:
$ pdftotext -f 8 -l 8 -layout \
Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf - \
| head -n 30
results in:
Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM
A
A /e/ noun a human blood type of the ABO abdominal distension /bdɒmn(ə)l ds
A abdominal distension
system, containing the A antigen (NOTE: Some- tenʃ(ə)n/ noun a condition in which the abdo-
one with type A can donate to people of the men is stretched because of gas or fluid
same group or of the AB group, and can receive abdominal pain /b dɒmn(ə)l pen/ noun
abdominal pain
blood from people with type A or type O.) pain in the abdomen caused by indigestion or
AA
AA abbr Alcoholics Anonymous more serious disorders
A & E /e ənd i
/, A & E department /e ənd abdominal viscera /bdɒmn(ə)l vsərə/
A & E abdominal viscera
i
d pɑ
tmənt/ noun same as accident and
plural noun the organs which are contained in
emergency department the abdomen, e.g. the stomach, liver and intes-
A & E medicine /e ənd i
med(ə)sn/
A & E medicine
tines
abdominal wall /b dɒmn(ə)l wɔ
l/ noun
abdominal wall
noun the medical procedures used in A & E de-
partments muscular tissue which surrounds the abdomen
abdomino- /bdɒmnəυ/ prefix referring to
abdomino-
Note the use of -layout
! Without it, the extracted text would look like this:
Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM A A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: SomeA
one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.) AA abbr Alcoholics Anonymous A & E /e ənd i /, A & E department /e ənd i d pɑ tmənt/ noun same as accident and emergency department A & E medicine /e ənd i med(ə)sn/ noun the medical procedures used in A & E deAA
A & E A & E medicine partments AB /e bi / noun a human blood type of the ABO system, containing the A and B antigens AB
I noted that the file uses on page 8, but has not embedded, the fonts Courier
, Helvetica
, Helvetica-Bold
, Times-Roman
and Times-Italic
.
This does not pose a problem for text extraction, since all these fonts use /WinAnsiEncoding
.
However, there are other fonts, which are embedded as a subset. These fonts do use a /Custom
encoding, but they do not provide a /ToUnicode
table. This table is required for reliable text extraction (back-translating the glyph names to character names).
What I said can be seen in this table:
$ pdffonts -f 8 -l 8 Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf
name type encoding emb sub uni object ID
------------------------------ ----------- ------------- --- --- --- ---------
Helvetica-Bold Type 1 WinAnsi no no no 1505 0
Courier Type 1 WinAnsi no no no 1507 0
Helvetica Type 1 WinAnsi no no no 1497 0
MOEKLA+Times-PhoneticIPA Type 1C Custom yes yes yes 1509 0
Times-Roman Type 1 WinAnsi no no no 1506 0
Times-Italic Type 1 WinAnsi no no no 1499 0
IGFBAL+EuropeanPi-Three Type 1C Custom yes yes no 1502 0
It so happened that I recently hand-coded 5 different PDF files, with commented source code, for a new GitHub project. These 5 files demonstrate the importance of a correct /ToUnicode
table for each font that is embedded as a subset. They can be found here, along with a README that explains some more detail
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With