So I am trying to write a program which can collect certain information from different articles and combine them. The step in which I am having trouble is extracting the article from the web page.
I was wondering whether you could provide any suggestions to java libraries/methods for extracting text from a web page?
I have also found this product: http://www.diffbot.com/products/automatic/article/ and was wondering whether you think this is the way to go? If so can someone point me to a java implementation - cannot seem to find one although apparently it exists.
Many thanks
Clarification - I am more looking for an algorithm/library/method for detecting where where in an html dom tree a block of text that could be an article is located. Like Safari's reader function. ps if you think this is much easier done in something like python just say - although my program has to run in Java as it should eventually run on a server (using java framework) I could try having it make use of python scripts - although would only do this if you advise that Python is the way to go.
Have a look at Apache Tika. It's meant to be used together with a crawler and can extract both text and metadata for you. You can also select various output types.
I have found an open source solution which was extremely highly rated. https://code.google.com/p/boilerpipe/
A review on different text extraction algorithms: http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/
It appears that diffbot does perform very well but is not open source. So in terms of open source, boiler pipe could be the way to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With