I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up for auction. I'm wondering if the community has any thoughts as to how I can approach parsing said PDF into a structured format for insertion into a db or to create a spreadsheet of the properties. Here's an image of what each page represents: <img src="https://i.stack.imgur.com/9c6Oe.png" alt="Property Guide"> And here's a page that lists some properties: <img src="https://i.stack.imgur.com/lk98z.png" alt="Sample List of Properties"> I'm comfortable with python and ruby so I don't have any issues scripting up a solution, but because the "columns" and the data in those said columns aren't necessary tied together, it seems like this would be a dubious proposition. Any ideas would be greatly appreciated.

After mucking around with this for 3 hours, I was able to create a parseable XML document from the data. Unfortunately, I was unsuccessful with putting together a completely reusable set of steps that I can use for future auctions publications. As an aside, I did attempt to call and ask Los Angeles County if they could provide an alternative format of the properties up for auction (excel, etc) and the answer was no. That's government for you. Here's a high-level view of my approach: <ul> <li>Convert the PDF into a text file using Poppler </li> <li>Use RegEx foo to clean up and create XML nodes from the data</li> <li>Use an XML beautifier / validator to find errors and do cleanup</li> <li>Use Python/ruby to add Google Maps Link node, and link to LA County Assessors Map (http://assessormap.co.la.ca.us/mapping/rolldata.asp?ain=APN-GOES_HERE) and</li> <li>Convert XML to CSV with Ruby</li> </ul> I used http://xmlbeautifier.com/ as my XML beautifier / validator because it was fast and it gave accurate error reporting, including line numbers. Use Homebrew to install Poppler for Mac: <pre class="prettyprint"><code>brew install poppler </code></pre> After Poppler is installed, you should have access to the pdftotext utility to convert the PDF: <pre class="prettyprint"><code>pdftotext -layout -f 24 -l 687 AuctionBook2013.pdf auction_book.txt </code></pre> Here's a preview of the XML (Click here for full XML): <pre class="prettyprint"><code><?xml version="1.0" encoding="UTF-8"?> <listings> <item id="1"> <nsb>536</nsb> <minbid>3,422</minbid> <apn>2006 003 001</apn> <delinquent_year>03</delinquent_year> <apn_old>2006 003 001</apn_old> <description>LICENSED SURVEYOR'S MAP AS PER BK 25 PG 28 OF L S LOT 1 BLK 1 ASSESSED TO J AND S LIMITED LLC C/O DUNA CSARDAS - JULIUS JANCSO LOCATION COUNTY OF LOS ANGELES</description> <address>VACANT LOT</address> </item> </code></pre> Edit: Adding the Ruby I wrote to convert the XML to a CSV. <pre class="prettyprint"><code>require 'rexml/document' require 'CSV' class Auction def initialize f = File.new('AuctionBook2013.xml', 'r') doc = REXML::Document.new(f) CSV.open("auction.csv", "w+b") do |csv| csv << ['id', 'minbid', 'apn', 'delinquent_year', 'apn_old', 'description', 'address'] doc.elements.each('/listings/item') do |item| csv << [item.attributes['id'], item.elements['minbid'].text, item.elements['apn'].text, item.elements['delinquent_year'].text, item.elements['apn_old'].text, item.elements['description'].text, item.elements['address'].text] end end end end a = Auction.new() </code></pre> Link to Final CSV

Looking for recommendation on how to convert PDF into structured format

Tags:

python

parsing

ruby

pdf

pdf-parsing

I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up for auction.

I'm wondering if the community has any thoughts as to how I can approach parsing said PDF into a structured format for insertion into a db or to create a spreadsheet of the properties.

Here's an image of what each page represents: Property Guide

And here's a page that lists some properties: Sample List of Properties

I'm comfortable with python and ruby so I don't have any issues scripting up a solution, but because the "columns" and the data in those said columns aren't necessary tied together, it seems like this would be a dubious proposition.

Any ideas would be greatly appreciated.

975

asked Aug 19 '13 18:08

doremi

1 Answers

After mucking around with this for 3 hours, I was able to create a parseable XML document from the data. Unfortunately, I was unsuccessful with putting together a completely reusable set of steps that I can use for future auctions publications.

As an aside, I did attempt to call and ask Los Angeles County if they could provide an alternative format of the properties up for auction (excel, etc) and the answer was no. That's government for you.

Here's a high-level view of my approach:

Convert the PDF into a text file using Poppler
Use RegEx foo to clean up and create XML nodes from the data
Use an XML beautifier / validator to find errors and do cleanup
Use Python/ruby to add Google Maps Link node, and link to LA County Assessors Map (http://assessormap.co.la.ca.us/mapping/rolldata.asp?ain=APN-GOES_HERE) and
Convert XML to CSV with Ruby

I used http://xmlbeautifier.com/ as my XML beautifier / validator because it was fast and it gave accurate error reporting, including line numbers.

Use Homebrew to install Poppler for Mac:

brew install poppler

After Poppler is installed, you should have access to the pdftotext utility to convert the PDF:

pdftotext -layout -f 24 -l 687 AuctionBook2013.pdf auction_book.txt

Here's a preview of the XML (Click here for full XML):

<?xml version="1.0" encoding="UTF-8"?>
<listings>
   <item id="1">
      <nsb>536</nsb>
      <minbid>3,422</minbid>
      <apn>2006 003 001</apn>
      <delinquent_year>03</delinquent_year>
      <apn_old>2006 003 001</apn_old>
      <description>LICENSED SURVEYOR'S MAP
          AS PER BK 25 PG 28 OF L S LOT 1              
          BLK 1 ASSESSED TO    J   AND   S
          LIMITED LLC C/O DUNA CSARDAS -
          JULIUS JANCSO LOCATION COUNTY OF
          LOS ANGELES</description>
      <address>VACANT LOT</address>
   </item>

Edit: Adding the Ruby I wrote to convert the XML to a CSV.

require 'rexml/document'
require 'CSV'

class Auction

  def initialize

    f = File.new('AuctionBook2013.xml', 'r')
    doc = REXML::Document.new(f)

    CSV.open("auction.csv", "w+b") do |csv|
      csv << ['id', 'minbid', 'apn', 'delinquent_year', 'apn_old', 'description', 'address']

      doc.elements.each('/listings/item') do |item|
        csv << [item.attributes['id'],
                item.elements['minbid'].text,
                item.elements['apn'].text,
                item.elements['delinquent_year'].text,
                item.elements['apn_old'].text,
                item.elements['description'].text,
                item.elements['address'].text]
      end
    end
  end
end

a = Auction.new()

Link to Final CSV

150

answered Oct 19 '22 05:10

doremi

Related questions
                            
                                RSA signing and verification with C#,BouncyCastle and imported RSA key - Working Python example and non-working C# code sample inside
                            
                                Recursion depth error when using BeautifulSoup with multiprocessing pool map
                            
                                How to run localtunnel v2 properly
                            
                                Is the Visitor pattern useful for dynamically typed languages?
                            
                                How to make CMake execute some script after it generates visual studio solution
                            
                                Disabling nose coverage report to STDOUT when HTML report is enabled?
                            
                                Using Parent Directories in Manifest.in
                            
                                How to indent attributes in when prettyprinting xml in python?
                            
                                Is there a plugin for sublime text 2 for debugging python?
                            
                                Documenting a non-existing member with Doxygen
                            
                                OSError: [Errno 11] Resource temporarily unavailable. What causes this?
                            
                                How to properly package flask application with static files
                            
                                Issue with querying Teradata in Python/Pyodbc
                            
                                Bad status line exception while opening a page on django development server
                            
                                PyTables dealing with data with size many times larger than size of memory
                            
                                Numpy Matrix Subtraction Confusion
                            
                                Select rows in a Numpy 2D array with a boolean vector
                            
                                Python Flask cross site HTTP POST - doesn't work for specific allowed origins
                            
                                Is PyBluez alive?
                            
                                What happens if you mix old and new style classes in a hierarchy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Looking for recommendation on how to convert PDF into structured format

Tags:

python

parsing

ruby

pdf

pdf-parsing

doremi

People also ask

1 Answers

doremi

Recent Activity

Donate For Us