Best way to extract text (e.g. articles) from web page [closed]

Question

So I am trying to write a program which can collect certain information from different articles and combine them. The step in which I am having trouble is extracting the article from the web page.

I was wondering whether you could provide any suggestions to java libraries/methods for extracting text from a web page?

I have also found this product: http://www.diffbot.com/products/automatic/article/ and was wondering whether you think this is the way to go? If so can someone point me to a java implementation - cannot seem to find one although apparently it exists.

Many thanks

Clarification - I am more looking for an algorithm/library/method for detecting where where in an html dom tree a block of text that could be an article is located. Like Safari's reader function. ps if you think this is much easier done in something like python just say - although my program has to run in Java as it should eventually run on a server (using java framework) I could try having it make use of python scripts - although would only do this if you advise that Python is the way to go.

Jakub Kotowski · Accepted Answer

Have a look at Apache Tika. It's meant to be used together with a crawler and can extract both text and metadata for you. You can also select various output types.

Saad Attieh · Answer

I have found an open source solution which was extremely highly rated. https://code.google.com/p/boilerpipe/

A review on different text extraction algorithms: http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/

It appears that diffbot does perform very well but is not open source. So in terms of open source, boiler pipe could be the way to go.

Best way to extract text (e.g. articles) from web page [closed]

Tags:

java

web

diffbot

Saad Attieh

2 Answers

Jakub Kotowski

Saad Attieh

Recent Activity

Donate For Us

Best way to extract text (e.g. articles) from web page [closed]

Tags:

java

web

diffbot

Saad Attieh

2 Answers

Jakub Kotowski

Saad Attieh

Related questions

Recent Activity

Donate For Us