I'm trying to understand how positions/offsets work in HTMLDocument
. Position/offset semantics are described here. My interpretation is that these are indices in the sequence of on-screen characters represented by the HTMLDocument
.
Consider the example HTML from the HTMLDocument
documentation:
<html>
<head>
<title>An example HTMLDocument</title>
<style type="text/css">
div { background-color: silver; }
ul { color: red; }
</style>
</head>
<body>
<div id="BOX">
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</div>
</body>
</html>
When I open this HTML in a browser, I only see "Paragraph 1" and "Paragraph 2" (and no leading spaces or newlines). So I would think that "Paragraph 1" starts at offset 0
.
But consider the following code, where I print the text in the example HTML and the offset of the body:
import java.io.StringReader;
import javax.swing.text.Element;
import javax.swing.text.html.*;
public class Test {
public static void main(String[] args) throws Exception {
String html = " <html>\n"
+ " <head>\n"
+ " <title>An example HTMLDocument</title>\n"
+ " <style type=\"text/css\">\n"
+ " div { background-color: silver; }\n"
+ " ul { color: red; }\n"
+ " </style>\n"
+ " </head>\n"
+ " <body>\n"
+ " <div id=\"BOX\">\n"
+ " <p>Paragraph 1</p>\n"
+ " <p>Paragraph 2</p>\n"
+ " </div>\n"
+ " </body>\n"
+ " </html>\n";
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) htmlKit.createDefaultDocument();
htmlKit.read(new StringReader(html), doc, 0);
System.out.println("doc length: " + doc.getLength());
String text = doc.getText(0, doc.getLength());
System.out.println("doc text, surrounded by quotes, with newlines replaced with /: \""
+ text.replace('\n', '/') + "\"");
Element element = doc.getDefaultRootElement().getElement(1);
System.out.println("element name: " + element.getName());
int offset = element.getStartOffset();
System.out.println("offset of body: " + offset);
}
}
Output:
doc length: 26
doc text, surrounded by quotes, with newlines replaced with /: " /Paragraph 1/Paragraph 2"
element name: body
offset of body: 3
Basic questions: Why is "Paragraph 1" (the start of the body) at index 3
? Where do the first three characters (two spaces and a newline) of the text come from? Am I misinterpreting what "offset" means?
Challenge question: Given some HTML (simple enough to completely understand by inspection), how can I rigorously figure out the offsets of all DOM elements by hand?
More info:
If I remove the style
tag from the HTML, I get the same result (body offset of 3
). If I also remove the title
, I get a body offset of 1
. If I finally remove head
entirely, I get a body offset of 0
(as expected). So apparently style
contributes 0, title
contributes 2, and head
contributes 1 to the body's offset? What is the reasoning behind this?
This also doesn't appear to be affected by whitespace in the HTML string.
Good question. You can figure out the offsets (and therefore the necessary caret positions in a JEditorPane
) according to a few rules - you've mentioned the most important ones already.
Perhaps a few key tags are:
<head>
+1<title>
+2<meta>
+1<p>
text length +1 (for a CR)If you've not found it already, the simplest way to see that list of offsets, and how they break down is HTMLDocument.dump(System.out);
. E.g. for the example HTML above:
<html
name=html
>
<head
name=head
>
<p-implied
name=p-implied
>
<title
name=title
>
[0,1][ ]
<title
endtag=true
name=title
>
[1,2][ ]
<content
CR=true
name=content
>
[2,3][
]
<body
name=body
>
<div
id=BOX
name=div
>
<p
name=p
>
<content
name=content
>
[3,14][Paragraph 1]
<content
CR=true
name=content
>
[14,15][
]
<p
name=p
>
<content
name=content
>
[15,26][Paragraph 2]
<content
CR=true
name=content
>
[26,27][
]
<bidi root>
<bidi level
bidiLevel=0
>
[0,27][
Paragraph 1
Paragraph 2
]
If you're interested to drill deeper, it will mean exploring the rules in the Swing parsing logic for HTML. There are a lot of rules for different tag types - you can see the list in the source.
Each tag uses an 'Action' class in this hierarchy:
For example <p>
is a ParagraphAction
, and <head>
is a HeadAction
, and both of these are types of BlockAction
. A <div>
is also directly a BlockAction
.
A BlockAction
can add that extra <content CR...>
element, to finish the block, hence the extra +1 on the offset. It normally only does if there was direct text content in the tag. For <head>
though, the HeadAction
subclass adds the <p-implied>
you can see in the dump above, which is causing one of the extra offsets. (You can't see it in this example, but it's worth noting a <div>
with text content also inserts that extra <p-implied>
- to hold the block text).
Things get steadily more specific from there. E.g. <title>
(along with <applet>
and <object>
) seem to be 'non-empty' HiddenActions
. This means an element is inserted for both the start and end tags. <meta>
though, for example, is an empty HiddenAction
, so just gets one element for the start tag.
Hopefully that's enough of an explanation as to how to figure out the offset for any given tag. If browsing the source for the XxxActions
classes, look for lines like new ElementSpec(..., 0, 1)
- that last parameter is the length.
You also mentioned whitespace being ignored. This at least is normal in HTML parsing, in browsers too. Whitespace between tags, or before and after text is routinely ignored - only the whitespace between words is kept. And then, sequences of whitespace are collapsed to a single whitespace.
That all said, I'm still not clear why the extra offsets are needed for the <head>
and <title>
. E.g. if you use setCaretPosition(x)
against a JEditorPane
based on the doc
and htmlKit
above, you only see the caret if x
is 3 or more. Perhaps someone else can shed some light on this...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With