I'm making a small Android application for a class where I find cancer-related events from the American Cancer Society's website. I've been using JSoup to get basic information about the events, and to get specific information from the website I've tried to use the select() method. However, the current method that I'm using grabs way more HTML nodes than I would like and I couldn't figure out why. The table that I'm trying to grab looks like this:
EDIT: I realized that the where id = "pnlResults" does not end at that table, it ends after about 3 more tables, all with information that I would like to grab. Here is the table again
<div id="pnlResults"> <h2><span id="lblEventName">American Cancer Society 44th Annual Walter Hagen Golf Tournament</span></h2> <!-- General Information Box --> <div class="text-box boxed wide"> <h3 class="head" style="width:97%;"> General Information </h3> <div class="content"> <p> <label>Event Times:</label><span id="lblStartDate">Monday, July 30, 2012</span><span id="lblEndDate"></span><br /> <label> </label><span id="lblStartTime">10:00 AM</span> - <span id="lblEndTime">9:00 PM</span> </p> <p> <label>Time Zone:</label><span id="lblTimeZone">Eastern</span> </p> <p> <label>Description:</label><span id="lblDesc" class="fieldData long">The American Cancer Society Walter Hagen Golf Tournament highlights the Society’s role in supporting research and patient care here in Rochester. Funds raised through this event help us make a difference in patents’ lives every day though programs including Road to Recovery and Patient Navigation as well as support grants to our research institutions. 144 golfers will play a round of golf and then enjoy cocktails, dinner, and silent auction following the tournament. </span> </p> <p> <label>Agenda:</label><span id="lblAgenda" class="fieldData long">10:00am - Check-in, 11:00am - Lunch, 12:15pm - Shot gun start, 6:00 - Cocktails and silent auction, 7:00pm Dinner and program</span> </p> </div> </div> <div id="pnlStandardDisplay"> <!-- Event Location Box --> <div class="text-box boxed wide line"> <h3 class="head" style="width:97%;"> Event Location </h3> <div class="content" style="display:inline-block; width:97%;"> <div > <div id="mapOutsideContainer" class="resource-map"> <div id="map_canvas" class="resource-map" ></div> </div> <script type="text/javascript"> var mapDataPoints = [{ "lat":43.1075545,"lng":-77.5164518, "title":"Golf Event","content":"<b>American Cancer Society 44th Annual Walter Hagen Golf Tournament<\/b><br/><\/br>4045 East Avenue<br /><br/>Rochester, New York 14618<br /><br />Phone: <br />Fax: "} ]; buildMap(mapDataPoints, -5); </script> </div> <h4><span id="lblLocationName">Irondequoit Country Club</span></h4> <p> <label>Address:</label><span id="lblAddress" class="fieldData" style="width:150px;">4045 East Avenue<br />Rochester, New York 14618</span> </p> <p> <label nowrap="nowrap">Handicap Accessible:</label><span id="lblHandicapAccesible">Yes</span> </p> </div> </div> <!-- Primary Contact Box --> <div class ="line" > <div id="eventPrimaryContact_divContact" class="text-box boxed wide"> <h3 class="head" style="width:97%;"> Primary Contact </h3> <div class="content"> <p> <label>Contact:</label><span id="eventPrimaryContact_lblContact">Katerina Kormas (<a href="mailto:[email protected]?subject=American Cancer Society 44th Annual Walter Hagen Golf Tournament">Contact ACS for Details</a>)</span> </p> <p> <label>Contact Type:</label><span id="eventPrimaryContact_lblContactType">ACS Staff</span> </p> <p> <label>Phone:</label><span id="eventPrimaryContact_lblContactPhone">(585) 288-1950</span> </p> <p> <label>Additional Information:</label><span id="eventPrimaryContact_lblContactAddlInfo" class="fieldData long">Direct line is 585-224-4919 or cell 585-645-8912</span> </p> </div> </div> </div> <!-- Registration Information Box --> <div class="text-box boxed wide line"> <h3 class="head" style="width:97%;"> Registration Information </h3> <div class="content"> <p> <label nowrap="nowrap">Registration Required?: </label><span id="lblRegRequired">Yes</span> </p> </div> </div> <!-- Event Cost Box --> <div class ="line" > <div id="eventCost_divCost" class="text-box boxed wide"> <h3 class="head" style="width:97%;"> Event Cost </h3> <div class="content"> <p> <label>Cost/Registration Fee: </label><span id="eventCost_lblCostRegFee" class="fieldData long">$350 per golfer</span> </p> <p> <label>Payment Type: </label><span id="eventCost_lblPaymentTypes" class="fieldData">Cash, Check, American Express, Mastercard, Visa, Discover</span> </p> <p> <label>Check Payable To: </label><span id="eventCost_lblCheckPayable" class="fieldData">American Cancer Society</span> </p> <p> <label>Memo Line: </label><span id="eventCost_lblCheckMemo" class="fieldData">American Cancer Society 44th Annual Walter Hagen Golf Tourna</span> </p> <p> <label>Mail Check To:</label><span id="eventCost_lblCheckMailTo" class="fieldData">American Cancer Society<br />1120 South Goodman St<br />Rochester, New York 14620</span> </p> </div> </div> </div> <!-- Tax Deduction Information Box --> <div class="line"> <div class="text-box boxed wide"> <h3 class="head" style="width:97%;"> Tax Deduction Information </h3> <div class="content"> <p> $210 per golfer is tax deductible </p> </div> </div> </div> </div> <!-- end standard display --> <!-- end daffodil display -->
EDIT: Given these new tables, I would like to extract the General Information, and Event location. How would I go about doing that? Maybe using the subset of select I just got to select again Where the headers are what I want?
The code where I'm using the select() is shown below. As I said before, I tried to use
select("div[id=pnlResults]);
but the returned data is much more than just the div where the id is pnlResults.
public ArrayList<Event> results() { ArrayList<Event> results = new ArrayList<Event>(); Document doc = Jsoup.parse(page); Elements links = doc.select("a[href*=event-details]"); for(Element e: links) { String title = e.text(); String link = "http://www.cancer.org/involved/participate/app/"+e.attr("href"); try{ Document eventInfo = Jsoup.connect(link).get(); Elements info = eventInfo.select("div[id*=pnlResults"); } catch(MalformedURLException exception) { exception.printStackTrace(); } catch(IOException exception) { exception.printStackTrace(); } } return results; }
Any help would be greatly appreciated.
With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.
Jsoup can also be used to parse and build XML.
clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.
Try:
Elements info = eventInfo.select("div#pnlResults");
Update for your update:
Since you now have more data, and since the HTML itself isn't that great you'll just have to work through it to pick out your data. If the content you need all have id
values then use the id
attribute of those elements to get the text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With