Some compilers failed on non-ASCII characters in JavaDoc and source code comments. What is the current (Java 7) and future (Java 8 and beyond) practices with respect to Unicode in Java source files? Are there differences between IcedTea, OpenJDK, and other Java environments, and what is dictated the the language specification? Should all non-ASCII characters be escaped in JavaDoc with HTML &escape;-like codes? But what would be the Java // comment equivalent?
Update: comments indicate that one can use any character set, and that upon compiling one needs to indicate what char set is used in the source file. I will look into this, and will be looking for details on how to configure this via Ant, Eclipse, and Maven.
In general, Javadoc comments are any multi-line comments (" /** ... */ ") that are placed before class, field, or method declarations. They must begin with a slash and two stars, and they can include special tags to describe characteristics like method parameters or return values.
A Javadoc comment is written in HTML and can therefore use common HTML tags. A JavaDoc comment is made up of two parts, the description followed by block tags. Keep in mind that Javadoc is often read in it's source form, so it should be easy to read and understand without the generated web frontend.
Add a Javadoc using automatic commentsType /** before a declaration and press Enter . The IDE auto-completes the doc comment for you.
What is Javadoc? Javadoc is a tool which comes with JDK and it is used for generating Java code documentation in HTML format from Java source code, which requires documentation in a predefined format. Following is a simple example where the lines inside /*…. */ are Java multi-line comments.
Some compilers failed on non-ASCII characters in JavaDoc and source code comments.
This is likely because the compiler assumes that the input is UTF-8, and there are invalid UTF-8 sequences in the source file. That these appear to be in comments in your source code editor is irrelevant because the lexer (which distinguishes comments from other tokens) never gets to run. The failure occurs while the tool is trying to convert bytes into chars before the lexer runs.
The man
page for javac
and javadoc
say
-encoding name
Specifies the source file encoding name, such as
EUCJIS/SJIS. If this option is not specified, the plat-
form default converter is used.
so running javadoc
with the encoding flag
javadoc -encoding <encoding-name> ...
after replacing <encoding-name>
with the encoding you've used for your source files should cause it to use the right encoding.
If you've got more than one encoding used within a group of source files that you need to compile together, you need to fix that first and settle on a single uniform encoding for all source files. You should really just use UTF-8 or stick to ASCII.
What is the current (Java 7) and future (Java 8 and beyond) practices with respect to Unicode in Java source files?
The algorithm for dealing with a source file in Java is
'\\'
'u'
followed by four hex digits with the code-unit corresponding to those hex-digits. Error out if there is a "\u"
not followed by four hex digits.The current and former practice is that step 2, converting bytes to UTF-16 code units, is up to the tool that is loading the compilation unit (source file) but the de facto standard for command line interfaces is to use the -encoding
flag.
After that conversion happens, the language mandates that \uABCD
style sequences are converted to UTF-16 code units (step 3) before lexing and parsing.
For example:
int a;
\u0061 = 42;
is a valid pair of Java statements. Any java source code tool must, after converting bytes to chars but before parsing, look for \uABCD sequences and convert them so this code is converted to
int a;
a = 42;
before parsing. This happens regardless of where the \uABCD sequence occurs.
This process looks something like
[105, 110, 116, 32, 97, 59, 10, 92, 117, 48, 48, 54, 49, 32, 61, 32, 52, 50, 59]
['i', 'n', 't', ' ', 'a', ';', '\n', '\\', 'u', '0', '0', '6', '1', ' ', '=', ' ', '4', '2', ';']
['i', 'n', 't', ' ', 'a', ';', '\n', a, ' ', '=', ' ', '4', '2', ';']
["int", "a", ";", "a", "=", "42", ";"]
(Block (Variable (Type int) (Identifier "a")) (Assign (Reference "a") (Int 42)))
Should all non-ASCII characters be escaped in JavaDoc with HTML &escape;-like codes?
No need except for HTML special characters like '<'
that you want to appear literally in the documentation. You can use \uABCD
sequences inside javadoc comments.
Java process \u....
before parsing the source file so they can appear inside strings, comments, anywhere really. That's why
System.out.println("Hello, world!\u0022);
is a valid Java statement.
/** @return \u03b8 in radians */
is equivalent to
/** @return θ in radians */
as far as javadoc is concerned.
But what would be the Java
//
comment equivalent?
You can use //
comments in java but Javadoc only looks inside /**...*/
comments for documentation. //
comments are not metadata carrying.
One ramification of Java's handling of \uABCD
sequences is that although
// Comment text.\u000A System.out.println("Not really comment text");
looks like a single line comment, and many IDEs will highlight it as such, it is not.
As commenters indicated, the encoding of the source files can be passed to (at least some) compilers. In this answer, I will summarize how to pass this information.
Eclipse
Eclipse (3.7 checked) does not require any special configuration, and you can happily use Java source code like:
double π = Math.PI;
Ant
<javac encoding="UTF-8" ... >
</javac>
Java
javac -encoding UTF-8 src/main/Foo.java
Gradle
javadoc {
options.encoding = 'UTF-8'
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With