I am trying to get all html between 2 h1 tags. Actual task is to break the html into frames(chapters) based of the h1(heading 1) tags.
Appreciate any help.
Thanks Sunil
If you want to get and process all elements between two consecutive h1 tags you can work on siblings. Here's some example code:
public static void h1s() {
String html = "<html>" +
"<head></head>" +
"<body>" +
" <h1>title 1</h1>" +
" <p>hello 1</p>" +
" <table>" +
" <tr>" +
" <td>hello</td>" +
" <td>world</td>" +
" <td>1</td>" +
" </tr>" +
" </table>" +
" <h1>title 2</h1>" +
" <p>hello 2</p>" +
" <table>" +
" <tr>" +
" <td>hello</td>" +
" <td>world</td>" +
" <td>2</td>" +
" </tr>" +
" </table>" +
" <h1>title 3</h1>" +
" <p>hello 3</p>" +
" <table>" +
" <tr>" +
" <td>hello</td>" +
" <td>world</td>" +
" <td>3</td>" +
" </tr>" +
" </table>" +
"</body>" +
"</html>";
Document doc = Jsoup.parse(html);
Element firstH1 = doc.select("h1").first();
Elements siblings = firstH1.siblingElements();
List<Element> elementsBetween = new ArrayList<Element>();
for (int i = 1; i < siblings.size(); i++) {
Element sibling = siblings.get(i);
if (! "h1".equals(sibling.tagName()))
elementsBetween.add(sibling);
else {
processElementsBetween(elementsBetween);
elementsBetween.clear();
}
}
if (! elementsBetween.isEmpty())
processElementsBetween(elementsBetween);
}
private static void processElementsBetween(
List<Element> elementsBetween) {
System.out.println("---");
for (Element element : elementsBetween) {
System.out.println(element);
}
}
I don't know Jsoup that good, but a straight forward approach could look like this:
public class Test {
public static void main(String[] args){
Document document = Jsoup.parse("<html><body>" +
"<h1>First</h1><p>text text text</p>" +
"<h1>Second</h1>more text" +
"</body></html>");
List<List<Node>> articles = new ArrayList<List<Node>>();
List<Node> currentArticle = null;
for(Node node : document.getElementsByTag("body").get(0).childNodes()){
if(node.outerHtml().startsWith("<h1>")){
currentArticle = new ArrayList<Node>();
articles.add(currentArticle);
}
currentArticle.add(node);
}
for(List<Node> article : articles){
for(Node node : article){
System.out.println(node);
}
System.out.println("------- new page ---------");
}
}
}
Do you know the structure of the articles and is it always the same? What do you want to do with the articles? Have you considered splitting them on the client side? This would be an easy jQuery Job.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With