Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there anything faster than Jsoup for HTML scraping? [closed]

Tags:

android

jsoup

So I'm building an app that displays an imageboard from a website I go to in a more user-friendly interface. There's a lot of problems with it at the moment, but the biggest one right now is fetching the images to display them.

The way I have it right now, the images are displayed in a GridView of size 12, mirroring the number of images on each page of the imageboard. I'm using Jsoup to scrape the page for the thumbnail image URLs to display in the GridView, as well as getting the URLs for the full size images to display when a user clicks on the thumbnail.

The problem right now is that it takes anywhere from 8-12 seconds on average for Jsoup to get the HTML page to scrape. This I find unacceptable and I was wondering if there was any way to make this faster or if this is going to be an inherent bottleneck that I can't do anything about.

Here's the code I'm using to fetch the page to scrape:

try {
    Document doc = Jsoup.connect(url).get();
    Elements links = doc.select("img[src*=/alt2/]");
    for (Element link : links) {
        thumbURL = link.attr("src");
        linkURL = thumbURL.replace("/alt2/", "/").replace("s.jpg", ".jpg");
        imgSrc.add(new Pair<String, String>(thumbURL, linkURL));
    }
}
catch {
    e.printStackTrace();
}
like image 662
seraphzero Avatar asked Apr 24 '12 04:04

seraphzero


3 Answers

I used Jsoup for a TLFN scraper and I had no issues with speed. You should narrow down the bottleneck. I presume its your scraping that is causing the speed issue. Try tracing your selector and your network traffic separately and see which is to blame. If your selector is to blame then consider finding another approach for querying and benchmark the results.

For faster, general idea, testing you can always run Jsoup from a normal Java project and when you feel like you have improved it, throw it back on a device and see if it has similar performance improvements.

EDIT

Not that this is your issue but be aware that using iterators 'can' cause quite a bit of garbage collection to trigger. Typically this is not a concern although if you use them in many places with much repetition, they can cause some devices to take a noticeable performance hit.

not great

for (Element link : links)

better

int i;
Element tempLink;
for (i=0;i<links.size();i++) {
   tempLink = links.get(i);
}

EDIT 2

If the image URLs are starting with /alt2/ you may be able to use ^= instead of *= which could potentially make the search faster. Additionally, depending on the amount of HTML, you may be wasting a lot of time looking in the completely wrong place for these images. Check to see if these images are wrapped inside an identifiable container such as something like <div class="posts">. If you can narrow down the amount of HTML to sift through it may improve the performance.

like image 116
ian.shaun.thomas Avatar answered Nov 11 '22 08:11

ian.shaun.thomas


Though a slightly different, this question has the same answer as Scraping dynamically generated html inside Android app.

In short, you should offload the "download & parse" part to a remote web service. See Web Scraping from Android for a discussion.

like image 3
Yevgeniy Avatar answered Nov 11 '22 08:11

Yevgeniy


I ran into the very same issue:

The Logcat on my HTC One S clearly shows that the connection-response only takes the first 4 Seconds (3 Connections in parallel). The Parsing takes almost 30-40 Seconds which is a HUGE time .. notice that the HTC One S has a very fast dualcore @ 1,4ghz .. The problem is clearly not connected to the emulator

02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:59.002: DEBUG/MyActivity(10735): <r=
02-27 14:11:59.012: DEBUG/MyActivity(10735): <r=
02-27 14:11:59.422: DEBUG/MyActivity(10735): <r=
02-27 14:12:33.949: DEBUG/MyActivity(10735): <d=
02-27 14:12:37.463: DEBUG/MyActivity(10735): <d=
02-27 14:12:38.294: DEBUG/MyActivity(10735): <d=

This is my code:

// Jsoup-Connection
Connection c = Jsoup.connect(urls[0]);
// Request timeout in ms
c.timeout(5000);
Connection.Response r = c.execute();
Log.d("MyActivity","<r= doInBackground ("+urls[0]+")");

// Get the actual Document
Document doc = r.parse();
Log.d("MyActivity","<d= doInBackground ("+urls[0]+")");

Update:

02-27 20:38:25.649: INFO/MyActivity(18253): !=c> 
02-27 20:38:27.511: INFO/MyActivity(18253): !<r= 
02-27 20:38:28.873: INFO/MyActivity(18253): !#d=

I got some new results .. the previosu ones were from running my app on android as DEBUGGING .. the now posted results are from running without debugging mode (from IntelliJ IDE) .. any explanation why debugging makes Jsoup so slow?

Running on debuggin on my i5-Desktop-Machine I got no performance-penalty.

The culprit why my code is so slow on Android is definitly the DEBUG-Mode mode .. it slows jsoup down by factor 100.

like image 2
cimba007 Avatar answered Nov 11 '22 07:11

cimba007