Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find repeat patterns in webpages in ruby

I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.

EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.

For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.

The other sites will be listed in a different way but each with a repeated pattern.

Does anyone know how, or have any experience of this sort of thing?

i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?

like image 493
rick Avatar asked Nov 13 '22 22:11

rick


1 Answers

Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.

Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".

Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be //a, every h1 would be //h1, and every image directly inside a div, where the image has the class "car" would be something like: //div/image[class="car"].

The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the content() of the elements, and build relationships to extract the data you need.

The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.

Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is Anemone which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.

like image 142
Lee Hambley Avatar answered Dec 06 '22 08:12

Lee Hambley