Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I count the number of words (text) in an HTML source

Tags:

java

html

count

I have some html documents for which I need to return the number of words in the document. This count should only include actual text (so no html tags e.g. html, br, etc).

Any ideas how to do this? Naturally, I would prefer to re-use some code.

Thanks,

Assaf

like image 609
Assafn Avatar asked May 17 '11 10:05

Assafn


1 Answers

  • Strip out the HTML tags , get the text content , reuse Jsoup

  • Read file line by line , hold a Map<String, Integer> wordToCountMap and read through and operate on the Map

like image 196
jmj Avatar answered Oct 06 '22 01:10

jmj