Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Meaning of position or offset in HTMLDocument text

Tags:

java

html

swing

I'm trying to understand how positions/offsets work in HTMLDocument. Position/offset semantics are described here. My interpretation is that these are indices in the sequence of on-screen characters represented by the HTMLDocument.

Consider the example HTML from the HTMLDocument documentation:

 <html>
   <head>
     <title>An example HTMLDocument</title>
     <style type="text/css">
       div { background-color: silver; }
       ul { color: red; }
     </style>
   </head>
   <body>
     <div id="BOX">
       <p>Paragraph 1</p>
       <p>Paragraph 2</p>
     </div>
   </body>
 </html>

When I open this HTML in a browser, I only see "Paragraph 1" and "Paragraph 2" (and no leading spaces or newlines). So I would think that "Paragraph 1" starts at offset 0.

But consider the following code, where I print the text in the example HTML and the offset of the body:

import java.io.StringReader;
import javax.swing.text.Element;
import javax.swing.text.html.*;

public class Test {
    public static void main(String[] args) throws Exception {
        String html = " <html>\n"
                    + "   <head>\n"
                    + "     <title>An example HTMLDocument</title>\n"
                    + "     <style type=\"text/css\">\n"
                    + "       div { background-color: silver; }\n"
                    + "       ul { color: red; }\n"
                    + "     </style>\n"
                    + "   </head>\n"
                    + "   <body>\n"
                    + "     <div id=\"BOX\">\n"
                    + "       <p>Paragraph 1</p>\n"
                    + "       <p>Paragraph 2</p>\n"
                    + "     </div>\n"
                    + "   </body>\n"
                    + " </html>\n";

        HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument doc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlKit.read(new StringReader(html), doc, 0);

        System.out.println("doc length: " + doc.getLength());
        String text = doc.getText(0, doc.getLength());
        System.out.println("doc text, surrounded by quotes, with newlines replaced with /: \""
                + text.replace('\n', '/') + "\"");

        Element element = doc.getDefaultRootElement().getElement(1);
        System.out.println("element name: " + element.getName());
        int offset = element.getStartOffset();
        System.out.println("offset of body: " + offset);
    }
}

Output:

doc length: 26
doc text, surrounded by quotes, with newlines replaced with /: "  /Paragraph 1/Paragraph 2"
element name: body
offset of body: 3

Basic questions: Why is "Paragraph 1" (the start of the body) at index 3? Where do the first three characters (two spaces and a newline) of the text come from? Am I misinterpreting what "offset" means?

Challenge question: Given some HTML (simple enough to completely understand by inspection), how can I rigorously figure out the offsets of all DOM elements by hand?


More info:

If I remove the style tag from the HTML, I get the same result (body offset of 3). If I also remove the title, I get a body offset of 1. If I finally remove head entirely, I get a body offset of 0 (as expected). So apparently style contributes 0, title contributes 2, and head contributes 1 to the body's offset? What is the reasoning behind this?

This also doesn't appear to be affected by whitespace in the HTML string.

like image 306
k_ssb Avatar asked May 30 '18 10:05

k_ssb


Video Answer


1 Answers

Good question. You can figure out the offsets (and therefore the necessary caret positions in a JEditorPane) according to a few rules - you've mentioned the most important ones already.

Perhaps a few key tags are:

  • <head> +1
  • <title> +2
  • <meta> +1
  • <p> text length +1 (for a CR)

If you've not found it already, the simplest way to see that list of offsets, and how they break down is HTMLDocument.dump(System.out);. E.g. for the example HTML above:

<html
  name=html
>
  <head
    name=head
  >
    <p-implied
      name=p-implied
    >
      <title
        name=title
      >
        [0,1][ ]
      <title
        endtag=true
        name=title
      >
        [1,2][ ]
      <content
        CR=true
        name=content
      >
        [2,3][
]
  <body
    name=body
  >
    <div
      id=BOX
      name=div
    >
      <p
        name=p
      >
        <content
          name=content
        >
          [3,14][Paragraph 1]
        <content
          CR=true
          name=content
        >
          [14,15][
]
      <p
        name=p
      >
        <content
          name=content
        >
          [15,26][Paragraph 2]
        <content
          CR=true
          name=content
        >
          [26,27][
]
<bidi root>
  <bidi level
    bidiLevel=0
  >
    [0,27][  
Paragraph 1
Paragraph 2
]

If you're interested to drill deeper, it will mean exploring the rules in the Swing parsing logic for HTML. There are a lot of rules for different tag types - you can see the list in the source.

Each tag uses an 'Action' class in this hierarchy:

swing-html-actions

For example <p> is a ParagraphAction, and <head> is a HeadAction, and both of these are types of BlockAction. A <div> is also directly a BlockAction.

A BlockAction can add that extra <content CR...> element, to finish the block, hence the extra +1 on the offset. It normally only does if there was direct text content in the tag. For <head> though, the HeadAction subclass adds the <p-implied> you can see in the dump above, which is causing one of the extra offsets. (You can't see it in this example, but it's worth noting a <div> with text content also inserts that extra <p-implied> - to hold the block text).

Things get steadily more specific from there. E.g. <title> (along with <applet> and <object>) seem to be 'non-empty' HiddenActions. This means an element is inserted for both the start and end tags. <meta> though, for example, is an empty HiddenAction, so just gets one element for the start tag.

Hopefully that's enough of an explanation as to how to figure out the offset for any given tag. If browsing the source for the XxxActions classes, look for lines like new ElementSpec(..., 0, 1) - that last parameter is the length.

You also mentioned whitespace being ignored. This at least is normal in HTML parsing, in browsers too. Whitespace between tags, or before and after text is routinely ignored - only the whitespace between words is kept. And then, sequences of whitespace are collapsed to a single whitespace.


That all said, I'm still not clear why the extra offsets are needed for the <head> and <title>. E.g. if you use setCaretPosition(x) against a JEditorPane based on the doc and htmlKit above, you only see the caret if x is 3 or more. Perhaps someone else can shed some light on this...

like image 136
df778899 Avatar answered Oct 18 '22 05:10

df778899