Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to limit download size with jsoup?

Tags:

jsoup

I'm trying to limit the size of a downloaded page/link with JSoup, given something like the following (Scala code):

val document = Jsoup.connect(theURL).get();

I'd like to only get the first few KB of a given page, and stop trying to download beyond that. If there's a really large page (or theURL is a link that isn't html, and is a large file), I'd like to not have to spend time downloading the rest.

My usecase is a page title snarfer for an IRC bot.

Bonus question:

Is there any reason why Jsoup.connect(theURL).timeout(3000).get(); isn't timing out on large files? It ends up causing the bot to ping out if someone pastes something like a never-ending audio stream or a large ISO (which can be solved by fetching URL titles in a different thread (or using Scala actors and timing out there), but that seems like overkill for a very simple bot when I think timeout() is supposed to accomplish the same end result).

like image 783
Ricky Elrod Avatar asked Jul 16 '12 19:07

Ricky Elrod


2 Answers

Now you can limit the max body size with version 1.7.2 using maxBodySize() method. http://jsoup.org/apidocs/org/jsoup/Connection.Request.html#maxBodySize() By default is limited to 1MB and this will prevent from memory leaks.

like image 67
Alex Moleiro Avatar answered Sep 21 '22 07:09

Alex Moleiro


Bonus answer to your bonus question: the timeout is defined as the connect and socket transfer timeouts. So if the connection takes less time than the timeout, and you're receiving packets from the server more frequently than the timeout, the timeout will never trigger.

I understand that's not fantastically intuitive and would like to move it to a total elapsed wallclock timeout. But for backwards compatibility I probably need to make it a different method (opinions solicited).

The never-ending audio stream should be prevented now in 1.7.2+ with the max body size. But without the wallclock timeout, it can still get caught with deliberately slow servers which eke out a response bit by bit with 3 second delays.

like image 22
Jonathan Hedley Avatar answered Sep 21 '22 07:09

Jonathan Hedley