Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to extract web page textual content in java? [closed]

Tags:

java

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

thanks

like image 910
Radi Avatar asked Jun 14 '10 10:06

Radi


2 Answers

Use jsoup. This is currently the most elegant library for screen scraping.

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

I just love its CSS selector syntax.

like image 169
Pascal Thivent Avatar answered Oct 07 '22 01:10

Pascal Thivent


Use a HTML parser if at all possible; there are many available for Java.

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

Related questions

  • Java HTML Parsing
  • Which Html Parser is best?
  • Any good Java HTML parsers?
  • recommendations for a java HTML parser/editor
  • What HTML parsing libraries do you recommend in Java

Text extraction:

  • Text Extraction from HTML Java
  • Text extraction with java html parsers

Tag stripping:

  • Stripping HTML tags in Java
  • How to strip HTML attributes except “src” and “alt” in JAVA
  • Removing HTML from a Java String
like image 23
polygenelubricants Avatar answered Oct 06 '22 23:10

polygenelubricants