Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it appropriate to use non-ASCII (natural-language) XML tags?

Is it appropriate to use XML tags (element names) written in non-ASCII natural languages? The XML spec allows it (see Names and Exceptions), but I couldn't find any best practices about this at W3C and related pages.

What I'm looking for is practical advice regarding which tools support this, whether important XML-related technologies such as XSLT and XForms may have problems with it, etc.

I think Andrey and Tomalak are missing the point. XML is not necessarily read by programmers, it is read by many different professionals. So the arguments comparing it to source code don't necessarily apply.

Let me clarify: I mean a Bulgarian legal domain, where many terms are specific to the Bulgarian legal process, and may not even have exact English translations. Translating them would be laborious, imprecise and impractical. Transliterating to ASCII is suboptimal.

So back to the question: what tool limitations would I face? (Eclipse supports UTF, so writing xpaths wouldn't be a problem.)

To get people started in the technical direction that I'd like: in several systems we've used generation techniques to ensure perfect correspondence between XML schemas, Java beans and database schemas.

  • Java: this article says that Unicode is ok
  • Oracle: identifiers can contain only alphanumeric characters from your database character set
  • I'd have to check for the tooling we use (JibX, Dozer, Hibernate, JXPath...)
like image 725
Vladimir Alexiev Avatar asked May 20 '10 11:05

Vladimir Alexiev


4 Answers

If the content of the documents will be in Bulgarian then the markup should be able to be.

If your tool chain can't parse the tags in that language then how can you be sure that it is handling the content correctly?

Programmers will always have to learn the language of the target domain, whether it be finance, genetics, engineering or the Bulgarian legal system. Compromising usability for the convenience of the programmer is almost always a 'Bad Thing'. Whatever effort is saved up front ends up getting lost as impeded end user productivity and in support effort/cost over the lifetime of the product.

like image 113
Matthew S Avatar answered Oct 16 '22 10:10

Matthew S


Short answer: You can name your XML elements any way you please.

Slightly longer answer: If you want to use the most portable/maintainable XML, you should probably use ASCII-only element names. I can think of no good reason to use other characters in the element name, and it certainly helps dealing with the XML in all kinds of places.

Think of handling XML nodes with some programming language that does not necessarily have its source code files UTF-8 encoded. You would have a hard time writing working XPath expressions, for example, in such a language. Or maintainers/programmers who do not speak the language that your element names are in, but are in charge of the source code. You are kind of locking yourself in when your element names are in Cyrillic script, for example. Element names should carry structure and meaning, and there is no apparent reason that would rule out ASCII for that purpose.

like image 39
Tomalak Avatar answered Oct 16 '22 11:10

Tomalak


I'm sorry to say this, but if your non-technical users needs to read raw XML, your application is broken. And the data you store will not usually have a 1-1 correspondence with user messages, either: many things are stored in a redundant way on XML, and other things are implicit from the data.

For me, I think you should, yes, store all your XML data in Bulgarian, using the UTF-8 character set. But in attributes, not in the XML tag structure.

I am thinking on this: you could design your program so that any of the legal structure can be modified freely from the user interface (maybe on a special "admin" panel, but still far from the code), and in no way hard-coded to the file format. The reason for this is that laws change, jurisprudence change and legal terms may change as well. (Well, some don't)

This may enable you to create a fairly general file format (think about one that could be used on US or japan, too - even if you don't plan to actually do it, that way your changes of designing a flexible file format will be greater)

This may be harder. You need to be prepared to handle with inconsistent, incomplete or otherwise poor data. But you should be doing this, anyway. And you may be rewarded, too: the file format could be cleaner and future-proof, making your software more flexible. Or maybe not. Notice the mays, coulds here. It actually depends on your specific design trade-offs.

And, of course, you need to have some balance here. In the end of the day, the burden of designing a reliable, flexible system is on you. You may take the approach of writing the tags in Bulgarian. I'm from Brazil, and I find odd to think about something like , but it could work.

About your actual concerns on tool limitations: I have no idea. You should first look for the documentation of you favorite XML library and see if it boldly claims to support it. Even the most used programs may not support fully a feature that is not much used.

like image 36
user348635 Avatar answered Oct 16 '22 09:10

user348635


Write your XML in whatever language you like. Make sure that the encoding supports the character set you are using, and that you state the correct encoding in the XML processing directive.

That will help to separate tools that support XML from tools that claim to do so, and which actually don't.

like image 20
John Saunders Avatar answered Oct 16 '22 10:10

John Saunders