Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for recommendation on how to convert PDF into structured format

I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up for auction.

I'm wondering if the community has any thoughts as to how I can approach parsing said PDF into a structured format for insertion into a db or to create a spreadsheet of the properties.

Here's an image of what each page represents: Property Guide

And here's a page that lists some properties: Sample List of Properties

I'm comfortable with python and ruby so I don't have any issues scripting up a solution, but because the "columns" and the data in those said columns aren't necessary tied together, it seems like this would be a dubious proposition.

Any ideas would be greatly appreciated.

like image 975
doremi Avatar asked Aug 19 '13 18:08

doremi


People also ask

Which PDF format should I use?

PDF/X – This standard best suits print professionals, graphic designers, and creative professionals who need to create high-quality, professional-grade documents. This PDF standard will ensure print-ready documents by correctly embedding fonts, images, color profiles, and more.

How a PDF is structured?

File format. A PDF file is organized using ASCII characters, except for certain elements that may have binary content. The file starts with a header containing a magic number (as a readable string) and the version of the format, for example %PDF-1.7 .


1 Answers

After mucking around with this for 3 hours, I was able to create a parseable XML document from the data. Unfortunately, I was unsuccessful with putting together a completely reusable set of steps that I can use for future auctions publications.

As an aside, I did attempt to call and ask Los Angeles County if they could provide an alternative format of the properties up for auction (excel, etc) and the answer was no. That's government for you.

Here's a high-level view of my approach:

  • Convert the PDF into a text file using Poppler
  • Use RegEx foo to clean up and create XML nodes from the data
  • Use an XML beautifier / validator to find errors and do cleanup
  • Use Python/ruby to add Google Maps Link node, and link to LA County Assessors Map (http://assessormap.co.la.ca.us/mapping/rolldata.asp?ain=APN-GOES_HERE) and
  • Convert XML to CSV with Ruby

I used http://xmlbeautifier.com/ as my XML beautifier / validator because it was fast and it gave accurate error reporting, including line numbers.

Use Homebrew to install Poppler for Mac:

brew install poppler

After Poppler is installed, you should have access to the pdftotext utility to convert the PDF:

pdftotext -layout -f 24 -l 687 AuctionBook2013.pdf auction_book.txt

Here's a preview of the XML (Click here for full XML):

<?xml version="1.0" encoding="UTF-8"?>
<listings>
   <item id="1">
      <nsb>536</nsb>
      <minbid>3,422</minbid>
      <apn>2006 003 001</apn>
      <delinquent_year>03</delinquent_year>
      <apn_old>2006 003 001</apn_old>
      <description>LICENSED SURVEYOR'S MAP
          AS PER BK 25 PG 28 OF L S LOT 1              
          BLK 1 ASSESSED TO    J   AND   S
          LIMITED LLC C/O DUNA CSARDAS -
          JULIUS JANCSO LOCATION COUNTY OF
          LOS ANGELES</description>
      <address>VACANT LOT</address>
   </item>

Edit: Adding the Ruby I wrote to convert the XML to a CSV.

require 'rexml/document'
require 'CSV'

class Auction

  def initialize

    f = File.new('AuctionBook2013.xml', 'r')
    doc = REXML::Document.new(f)

    CSV.open("auction.csv", "w+b") do |csv|
      csv << ['id', 'minbid', 'apn', 'delinquent_year', 'apn_old', 'description', 'address']

      doc.elements.each('/listings/item') do |item|
        csv << [item.attributes['id'],
                item.elements['minbid'].text,
                item.elements['apn'].text,
                item.elements['delinquent_year'].text,
                item.elements['apn_old'].text,
                item.elements['description'].text,
                item.elements['address'].text]
      end
    end
  end
end

a = Auction.new()

Link to Final CSV

like image 150
doremi Avatar answered Oct 19 '22 05:10

doremi