Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup: select() returns empty when it shouldn't

I am trying to select the infobox on Wikipedia's Google entry page: http://en.m.wikipedia.org/wiki/Google

So, I call:

contentDiv = document.select("div[id=content]").first();

Which works as expected, then I do:

Elements infoboxes = contentDiv.select("table[class=infobox]");

Then I check infoboxes.isEmpty() and I am stunned to discover that it is empty!

I checked and verified that the element contentDiv contains the following:

<table class="infobox vcard" style="width: 22em;" cellspacing="5">

So, why does contentDiv.select("table[class=infobox]") return empty???

UPDATE: I tested the above with contentDiv.select("table[class=infobox vcard]") and it works fine! This is weird since I know that unlike the table.infobox.vcard notation which only selects the exact multiclass element, table[class=infobox] should select all tables that have at least infobox in their listed classes.

BTW, I tested the code, with a different Wikipedia entry, containing:

<table class="infobox biota" style="text-align: left; width: 200px; font-size: 100%;">

And that contentDiv.select("table[class=infobox]") behaves exactly as expected, returning that table element as the first item in infoboxes.

Any idea why the inconsistency? What could explain this odd behavior?

Is it possible that I just stumbled on a Jsoup bug?

(I'm using jsoup-1.5.2, not the latest but I don't need HTML5 support and for various reasons I can't upgrade immediately to the latest 1.6.1).

like image 469
Regex Rookie Avatar asked Feb 23 '23 18:02

Regex Rookie


1 Answers

The [attributename=attributevalue] selector is an exact match. This is specified in CSS selector spec (emphasis mine):

[att=val]
        Match when the element's "att" attribute value is exactly "val".

You want to use the [attributename~=attributevalue] instead:

Elements infoboxes = contentDiv.select("table[class~=infobox]");
// ...

or, better actually, the .classname selector:

Elements infoboxes = contentDiv.select("table.infobox");
// ...

See also:

  • CSS selector spec - attribute selectors - class selectors
  • Jsoup selector cookbook
  • Jsoup Selector API

As to your test with different Wikipedia entry, I can't reproduce this. But I can tell that this page contains another <table class="infobox"> which must be the one you're actually retrieving.

like image 133
BalusC Avatar answered Mar 02 '23 17:03

BalusC