Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to extract text (e.g. articles) from web page [closed]

Tags:

java

web

diffbot

So I am trying to write a program which can collect certain information from different articles and combine them. The step in which I am having trouble is extracting the article from the web page.

I was wondering whether you could provide any suggestions to java libraries/methods for extracting text from a web page?

I have also found this product: http://www.diffbot.com/products/automatic/article/ and was wondering whether you think this is the way to go? If so can someone point me to a java implementation - cannot seem to find one although apparently it exists.

Many thanks

Clarification - I am more looking for an algorithm/library/method for detecting where where in an html dom tree a block of text that could be an article is located. Like Safari's reader function. ps if you think this is much easier done in something like python just say - although my program has to run in Java as it should eventually run on a server (using java framework) I could try having it make use of python scripts - although would only do this if you advise that Python is the way to go.

like image 618
Saad Attieh Avatar asked Dec 24 '13 23:12

Saad Attieh


2 Answers

Have a look at Apache Tika. It's meant to be used together with a crawler and can extract both text and metadata for you. You can also select various output types.

like image 90
Jakub Kotowski Avatar answered Nov 19 '22 00:11

Jakub Kotowski


I have found an open source solution which was extremely highly rated. https://code.google.com/p/boilerpipe/

A review on different text extraction algorithms: http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/

It appears that diffbot does perform very well but is not open source. So in terms of open source, boiler pipe could be the way to go.

like image 29
Saad Attieh Avatar answered Nov 19 '22 01:11

Saad Attieh