Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text within HTML <br> tags JSOUP

Tags:

java

html

jsoup

I am writing a JAVA program to extract HTML data for a project. This is the HTML code

 <td align="left" valign="top" class="style3">
        PC / Van<br>$14 (Mon-Fri, excl PH)
        <br>
        $18 (Sat, Sun & PH)<br><br>$70/Day(Mon-Fri, excl PH: Entry - 24:00)   
        <br>
        $100/day (Sat, Sun & PH: Entry - 24:00)
 </td></tr>

The following is my JAVA code for extraction.

 String connect1 = url1.toString();
 Document doc1 = Jsoup.connect(connect1).get();


        // get all links
        Elements type1 = doc1.select("[class=\"style3\"]");     

        int size = type1.size();

            try {       
                String text =type1.first.text();
                System.out.println(text);

                } catch (Exception e) {
                e.printStackTrace();

            }   

The output I get is

PC / Van$14 (Mon-Fri, excl PH)$18 (Sat, Sun & PH)$70/Day(Mon-Fri, excl PH: Entry - 24:00)$100/day (Sat, Sun & PH: Entry - 24:00)

How can I split them from the < br > tags?

like image 784
Kennedy Kan Avatar asked Aug 07 '15 05:08

Kennedy Kan


People also ask

What is jsoup parse?

Description. The parse(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom.

Can jsoup parse JavaScript?

Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.


1 Answers

you can replace all <br> labels to \n symbol,the code example is shown below:

Document doc1 = Jsoup.parse(s);
Elements type1 = doc1.select("[class=\"style3\"]");
try {       
    String text =type1.first().html();
    text = text.replaceAll("<br>", "\n");
    System.out.println(text);
} catch (Exception e) {
    e.printStackTrace();
} 

or split the text to string array with <br> label

Document doc1 = Jsoup.parse(s);
Elements type1 = doc1.select("[class=\"style3\"]");
try {       
    String text =type1.first().html();
    String[] textSplitResult = text.split("<br>");
    if (null != textSplitResult) {
         for (String t : textSplitResult) {
             System.out.println(t);
         }
    }
} catch (Exception e) {
    e.printStackTrace();
} 

or use java8 lambda to output result

String text =type1.first().html();
String[] textSplitResult = text.split("<br>");
if (null != textSplitResult) {
    Arrays.stream(textSplitResult).peek((x) -> System.out.println(x)).count();
    //or Arrays.stream(textSplitResult).peek(System.out::println).count();
} 

The executing result:

PC / Van
$14 (Mon-Fri, excl PH)
$18 (Sat, Sun &amp; PH)

$70/Day(Mon-Fri, excl PH: Entry - 24:00)
$100/day (Sat, Sun &amp; PH: Entry - 24:00)
like image 151
Javy Avatar answered Oct 04 '22 17:10

Javy