Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to internationalize java source code?

EDIT: I completely re-wrote the question since it seems like I was not clear enough in my first two versions. Thanks for the suggestions so far.

I would like to internationalize the source code for a tutorial project (please notice, not the runtime application). Here is an example (in Java):

/** A comment */
public String doSomething() {
  System.out.println("Something was done successfully");
}

in English , and then have the French version be something like:

/** Un commentaire */
public String faitQuelqueChose() {
  System.out.println("Quelque chose a été fait avec succès.");
}

and so on. And then have something like a properties file somewhere to edit these translations with usual tools, such as:

com.foo.class.comment1=A comment
com.foo.class.method1=doSomething
com.foo.class.string1=Something was done successfully

and for other languages:

com.foo.class.comment1=Un commentaire
com.foo.class.method1=faitQuelqueChose
com.foo.class.string1=Quelque chose a été fait avec succès.

I am trying to find the easiest, most efficient and unobtrusive way to do this with the least amount of manual grunt work (other than obviously translating the actual text). Preferably working under Eclipse. For example, the original code would be written in English, then externalized (to properties, preferably leaving the original source untouched), translated (humanly) and then re-generated (as a separate source file / project).

Some trails I have found (other than what AlexS suggested):

  • AntLR, a language parser / generator. There seems to be a supporting Eclipse plugin
  • Using Eclipse's AST (Abstract Syntax Tree) and I guess building some kind of plugin.

I am just surprised there isn't a tool out there that does this already.

like image 669
deryb Avatar asked Jun 19 '12 15:06

deryb


2 Answers

I'd use unique strings as methodnames (or anything you want to be replaced by localized versions.

public String m37hod_1() {
  System.out.println(m355a6e_1);
}

then I'd define a propertyfile for each language like this:

m37hod_1=doSomething
m355a6e_1="Something was done successfully"

And then I'd write a small program parsing the sourcefiles and replacing the strings. So everything just outside eclipse.

Or I'd use the ant task Replace and propertyfiles as well, instead of a standalone translation program. Something like that:

<replace 
    file="${src}/*.*"
    value="defaultvalue"
    propertyFile="${language}.properties">
  <replacefilter 
    token="m37hod_1" 
    property="m37hod_1"/>
  <replacefilter 
    token="m355a6e_1" 
    property="m355a6e_1"/>
</replace>

Using one of these methods you won't have to explain anything about localization in your tutorials (except you want to), but can concentrate on your real topic.

like image 79
AlexS Avatar answered Sep 24 '22 18:09

AlexS


What you want is a massive code change engine.

ANTLR won't do the trick; ASTs are necessary but not sufficient. See my essay on Life After Parsing. Eclipse's "AST" may be better, if the Eclipse package provides some support for name and type resolution; otherwise you'll never be able to figure out how to replace each "doSomething" (might be overloaded or local), unless you are willing to replace them all identically (and you likely can't do that, because some symbols refer to Java library elements).

Our DMS Software Reengineering Toolkit could be used to accomplish your task. DMS can parse Java to ASTs (including comment capture), traverse the ASTs in arbitrary ways, analyze/change ASTs, and the export modified ASTs as valid source code (including the comments).

Basically you want to enumerate all comments, strings, and declarations of identifiers, export them to an external "database" to be mapped (manually? by Google Translate?) to an equivalent. In each case you want to note not only the item of interest, but its precise location (source file, line, even column) because items that are spelled identically in the original text may need different spellings in the modified text.

Enumeration of strings is pretty easy if you have the AST; simply crawl the tree and look for tree nodes containing string literals. (ANTLR and Eclipse can surely do this, too).

Enumeration of comments is also straightforward if the parser you have captures comments. DMS does. I'm not quite sure if ANTLR's Java grammar does, or the Eclipse AST engine; I suspect they are both capable.

Enumeration of declarations (classes, methods, fields, locals) is relatively straightforward; there's rather more cases to worry about (e.g., anonymous classes containing extensions to base classes). You can code a procedure to walk the AST and match the tree structures, but here's the place that DMS starts to make a difference: you can write surface-syntax patterns that look like the source code you want to match. For instance:

   pattern local_for_loop_index(i: IDENTIFIER, t: type, e: expression, e2: expression, e3:expression): for_loop_header
         = "for (\t \i = \e,\e2,\e3)"

will match declarations of local for loop variables, and return subtrees for the IDENTIFIER, the type, and the various expressions; you'd want to capture just the identifier (and its location, easily done by taking if from the source position information that DMS stamps on every tree node). You'd probably need 10-20 such patterns to cover the cases of all the different kinds of identifiers.

Capture step completed, something needs to translate all the captured entities to your target language. I'll leave that to you; what's left is to put the translated entities back.

The key to this is the precise source location. A line number isn't good enough in practice; you may have several translated entities in the same line, in the worst case, some with different scopes (imagine nested for loops for example). The replacement process for comments, strings and the declarations are straightforward; rescan the tree for nodes that match any of the identified locations, and replace the entity found there with its translation. (You can do this with DMS and ANTLR. I think Eclipse ADT requires you generate a "patch" but I guess that would work.).

The fun part comes in replacing the identifier uses. For this, you need to know two things:

  • for any use of an identifier, what is the declaration is uses; if you know this, you can replace it with the new name for the declaration; DMS provides full name and type resolution as well as a usage list, making this pretty easy, and
  • Do renamed identifiers shadow one another in scopes differently than the originals? This is harder to do in general. However, for the Java language, we have a "shadowing" check, so you can at least decide after renaming that you have an issues. (There's even a renaming procedure that can be used to resolve such shadowing conflicts

After patching the trees, you simply rewrite the patched tree back out as a source file using DMS's built-in prettyprinter. I think Eclipse AST can write out its tree plus patches. I'm not sure ANTLR provides any facilities for regenerating source code from ASTs, although somebody may have coded one for the Java grammar. This is harder to do than it sounds, because of all the picky detail. YMMV.

Given your goal, I'm a little surprised that you don't want a sourcefile "foo.java" containing "class foo { ... }" to get renamed to .java. This would require not only writing the transformed tree to the translated file name (pretty easy) but perhaps even reconstructing the directory tree (DMS provides facilities for doing directory construction and file copies, too).

If you want to do this for many languages, you'd need to run the process once per language. If you wanted to do this just for strings (the classic internationalization case), you'd replace each string (that needs changing, not all of them do) by a call on a resource access with a unique resource id; a runtime table would hold the various strings.

like image 33
Ira Baxter Avatar answered Sep 25 '22 18:09

Ira Baxter