I'm trying to limit the size of a downloaded page/link with JSoup, given something like the following (Scala code):
val document = Jsoup.connect(theURL).get();
I'd like to only get the first few KB of a given page, and stop trying to download beyond that. If there's a really large page (or theURL
is a link that isn't html, and is a large file), I'd like to not have to spend time downloading the rest.
My usecase is a page title snarfer for an IRC bot.
Bonus question:
Is there any reason why Jsoup.connect(theURL).timeout(3000).get();
isn't timing out on large files? It ends up causing the bot to ping out if someone pastes something like a never-ending audio stream or a large ISO (which can be solved by fetching URL titles in a different thread (or using Scala actors and timing out there), but that seems like overkill for a very simple bot when I think timeout()
is supposed to accomplish the same end result).
Now you can limit the max body size with version 1.7.2 using maxBodySize() method. http://jsoup.org/apidocs/org/jsoup/Connection.Request.html#maxBodySize() By default is limited to 1MB and this will prevent from memory leaks.
Bonus answer to your bonus question: the timeout is defined as the connect and socket transfer timeouts. So if the connection takes less time than the timeout, and you're receiving packets from the server more frequently than the timeout, the timeout will never trigger.
I understand that's not fantastically intuitive and would like to move it to a total elapsed wallclock timeout. But for backwards compatibility I probably need to make it a different method (opinions solicited).
The never-ending audio stream should be prevented now in 1.7.2+ with the max body size. But without the wallclock timeout, it can still get caught with deliberately slow servers which eke out a response bit by bit with 3 second delays.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With