Parsing huge (~100mb) kml (xml) file taking *hours* without any sign of actual parsing

Question

I'm currently trying to parse a very large kml (xml) file with ruby (Nokogiri) and am having a little bit of trouble.

The parsing code is good, in fact I'll share it just for the heck of it, even though this code doesn't have much to do with my problem:

geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361)
f = File.open("horry_parcels.kml")
kmldoc = Nokogiri::XML(f)

kmldoc.css("//Placemark").each_with_index do |placemark, i|
      puts i
      tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td")
      h = HorryParcel.new
      h.owner_name = tds.shift.text
      tds.shift
      tds.each_slice(2) do |k, v|
        col = k.text.downcase
        eval("h.#{col} = v.text")
      end
      coords = kmldoc.search("//MultiGeometry")[i].text.gsub("
", "").gsub("	", "").split(",0 ").map {|x| x.split(",")}
      points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") }
      geo_shape = geofactory.polygon(geofactory.linear_ring(points))
      proj_shape = geo_shape.projection
      h.geo_shape = geo_shape
      h.proj_shape = proj_shape
      h.save
    end

Anyway, I've tested this code with a much, much smaller sample of kml and it works.

However, when I load the real thing, ruby simply waits, as if it is processing something. This "processing", however, has now spanned several hours while I've been doing other things. As you might have noticed, I have a counter (each_with_index) on the array of Placemarks and during this multi-hour period, not a single i value has been put to the command line. Oddly enough it hasn't timed out yet, but even if this works there has got to be a better way to do this thing.

I know I could open up the KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but the way things appear to be set up, this would be a very manual, unprofessional process.

Here's a sample of the kml (w/ just one placemark) if that helps.

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
    <name>justone.kml</name>
    <Style id="PolyStyle00">
        <LabelStyle>
        <color>00000000</color>
        <scale>0</scale>
    </LabelStyle>
    <LineStyle>
        <color>ff0000ff</color>
    </LineStyle>
    <PolyStyle>
        <color>00f0f0f0</color>
    </PolyStyle>
</Style>
<Folder>
    <name>justone</name>
    <open>1</open>
    <Placemark id="ID_010161">
        <name>STUART CHARLES A JR</name>
        <Snippet maxLines="0"></Snippet>
        <description>""</description>
        <styleUrl>#PolyStyle00</styleUrl>
        <MultiGeometry>
            <Polygon>
                <outerBoundaryIs>
                    <LinearRing>
                        <coordinates>
                            -78.941896,33.867893,0     -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0 
                        </coordinates>
                    </LinearRing>
                </outerBoundaryIs>
            </Polygon>
        </MultiGeometry>
    </Placemark>
      </Folder>
  </Document>
</kml>

EDIT: 99.9% of the data I work with is in *.shp format, so I've just ignored this problem for the past week. But I'm going to get this process running on my desktop computer (off of my laptop) and run it until it either times out or finishes.

class ClassName

attr_reader :before, :after

def go
  @before = Time.now
  run_actual_code
  @after = Time.now
  puts "process took #{(@after - @before) seconds} to complete"
end

def run_actual_code
  ...
end

end

The above code should tell me how long it took. From that (if it does actually finish) we should be able to compute a rough rule of thumb for how long you should expect your (otherwise PERFECT) code to run without SAX parsing or "atomization" of the document's text components.

fotanus · Accepted Answer

For a huge XML file, you should not use default XML parser from Nokogiri, because it parses as DOM. A much better parsing strategy for large XML files is SAX. Luckly we are, Nokogiri supports SAX.

The downside is that using a SAX parser all logic should be done with callbacks. The idea is simple: The sax parser starts to read a file and let you know whenever it finds something interesting, for example a tag opening, a tag close, or a text. You will be able to bind callbacks to these events, and extract whatever you need.

Of course you don't want to use a SAX parser to load all file into the memory and work with it there - this is exactly what SAX want to avoid. You will need to do whatever you want with this file part-by-part.

So this is basically a rewrite your parsing with callbacks logic. To learn more about XML DOM vs SAX parsers, you might want to check this FAQ from cs.nmsu.edu

boulder_ruby · Answer

I actually ended up getting a copy of the data from a more accessible source, but I'm back here because I wanted to present a possible solution to the general problem. Less. Less was a built long time ago & is a part of unix by default in most cases.

http://en.wikipedia.org/wiki/Less_%28Unix%29

Not related to the stylesheet language ("LESS"), less is a text viewer (cannot edit files, only read them) that does not load the entire document it is reading until you have scanned through the entire thing yourself. I.e., it loads the first "page", so to speak, and waits for you to call for the next one.

If a ruby script could somehow pipe "pages" of text into...oh wait....the XML structure wouldn't allow it due to the fact that it wouldn't have the closing delimeters from the end of the undigested text file......So what you would have to do is some custom work on the front end, cut out those first couple parent brackets so that you can pluck out the XML children one by one and have the last closing parent brackets break the script because the parser will think it is finished and come across another closing bracket I guess.

I haven't tried this and don't have anything to try it on. But if I did, I'd probably try piping n-lot blocks of text into ruby (or python, etc) via less or something similar to it. Perhaps something more primitive than less I'm not sure

Parsing huge (~100mb) kml (xml) file taking hours without any sign of actual parsing

Tags:

xml

ruby

large-data

timeout

kml

boulder_ruby

2 Answers

fotanus

boulder_ruby

Recent Activity

Donate For Us

Parsing huge (~100mb) kml (xml) file taking *hours* without any sign of actual parsing

Tags:

xml

ruby

large-data

timeout

kml

boulder_ruby

2 Answers

fotanus

boulder_ruby

Related questions

Recent Activity

Donate For Us

Parsing huge (~100mb) kml (xml) file taking hours without any sign of actual parsing