Using Nokogiri to Split Content on BR tags

Question

I have a snippet of code im trying to parse with nokogiri that looks like this:

<td class="j">
    <a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br>
    <a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br>
    <a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br>
</td>

I have access to the source of the td.j using something like this: data_items = doc.css("td.j")

My goal is to split each of those lines up into an array of hashes. The only logical splitting point i can see is to split on the BRs and then use some regex on the string.

I was wondering if there's a Better way to do this maybe using nokogiri only? Even if i could use nokogiri to suck out the 3 line items it would make things easier for me as i could just do some regex parsing on the .content result.

Not sure how to use Nokogiri to grab lines ending with br though -- should i be using xpaths? any direction is appreciated! thank you

the Tin Man · Accepted Answer

I'm not sure about the point of using an array of hashes, and without an example I can't suggest something. However, for splitting the text on <br> tags, I'd go about it this way:

require 'nokogiri'

doc = Nokogiri::HTML('<td class="j">
    <a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br>
    <a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br>
    <a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br>
</td>')

doc.search('br').each do |n|
  n.replace("
")
end
doc.at('tr.j').text.split("
") # => ["", "    Link 1 (info1), Blah 1,", "Link 2 (info1), Blah 1,", "Link 3 (info2), Blah 1 Foo 2,"]

This will get you closer to a hash:

Hash[*doc.at('td.j').text.split("
")[1 .. -1].map{ |t| t.strip.split(',')[0 .. 1] }.flatten] # => {"Link 1 (info1)"=>" Blah 1", "Link 2 (info1)"=>" Blah 1", "Link 3 (info2)"=>" Blah 1 Foo 2"}

mu is too short · Answer

If your data really is that regular and you don't need the attributes from the <a> elements, then you could parse the text form of each table cell without having to worry about the <br> elements at all.

Given some HTML like this in html:

<table>
    <tbody>
        <tr>
            <td class="j">
                <a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br>
                <a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br>
                <a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br>
            </td>
            <td class="j">
                <a title="title text1" href="http://link4.com">Link 4</a> (info1), Blah 2,<br>
                <a title="title text2" href="http://link5.com">Link 5</a> (info1), Blah 2,<br>
                <a title="title text2" href="http://link6.com">Link 6</a> (info2), Blah 2 Foo 2,<br>
            </td>
        </tr>
        <tr>
            <td class="j">
                <a title="title text1" href="http://link7.com">Link 7</a> (info1), Blah 3,<br>
                <a title="title text2" href="http://link8.com">Link 8</a> (info1), Blah 3,<br>
                <a title="title text2" href="http://link9.com">Link 9</a> (info2), Blah 3 Foo 2,<br>
            </td>
            <td class="j">
                <a title="title text1" href="http://linkA.com">Link A</a> (info1), Blah 4,<br>
                <a title="title text2" href="http://linkB.com">Link B</a> (info1), Blah 4,<br>
                <a title="title text2" href="http://linkC.com">Link C</a> (info2), Blah 4 Foo 2,<br>
            </td>
        </tr>
    </tbody>
</table>

You could do this:

chunks = doc.search('.j').map { |td| td.text.strip.scan(/[^,]+,[^,]+/) }

and have this:

[
    [ "Link 1 (info1), Blah 1", "Link 2 (info1), Blah 1", "Link 3 (info2), Blah 1 Foo 2" ],
    [ "Link 4 (info1), Blah 2", "Link 5 (info1), Blah 2", "Link 6 (info2), Blah 2 Foo 2" ],
    [ "Link 7 (info1), Blah 3", "Link 8 (info1), Blah 3", "Link 9 (info2), Blah 3 Foo 2" ],
    [ "Link A (info1), Blah 4", "Link B (info1), Blah 4", "Link C (info2), Blah 4 Foo 2" ]
]

in chunks. Then you could convert that to whatever hash form you needed.

Using Nokogiri to Split Content on BR tags

Tags:

parsing

ruby

xpath

screen-scraping

nokogiri

Mario Zigliotto

Video Answer

2 Answers

the Tin Man

mu is too short

Recent Activity

Donate For Us

Using Nokogiri to Split Content on BR tags

Tags:

parsing

ruby

xpath

screen-scraping

nokogiri

Mario Zigliotto

Video Answer

2 Answers

the Tin Man

mu is too short

Related questions

Recent Activity

Donate For Us