Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML into formatted plaintext using jsoup

Tags:

java

maven

jsoup

I was working on a maven project that allows me to parse a html data from a website. I was able to parse it using this code below:

public void parseData(){
        String url = "http://stackoverflow.com/help/on-topic";
        try {
            Document doc = Jsoup.connect(url).get();
            Element essay = doc.select("div.col-section").first();
            String essayText = essay.text();
            jTextAreaAdem.setText(essayText);


        } catch (IOException ex) {
            Logger.getLogger(formAdem.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

So far I have no problems. I can parse the html data. I was using select method from jsoup and retrieving data using "div.col-section" which means I'm looking for div element with the class is col-section. I wanted to print the data in a textarea. The result that I have is a huge one paragraph even though the real data on the website is more than one paragraphs. So how to parse the data just like the one on the website?

like image 991
GoGo Avatar asked Oct 13 '14 18:10

GoGo


1 Answers

The reason that it is not formatted is that the formatting is in the HTML -- with <p> and <ol> tags etc. Calling .text() on a block element loses that formatting.

Jsoup has an example HTML to Plain Text convertor which you can adapt to your needs -- by providing the div element as the focus.

Alternatively, you could just select "div.col-section > *", and iterate through each Element, and print out that text with a newline.

like image 74
Jonathan Hedley Avatar answered Sep 28 '22 05:09

Jonathan Hedley