I am creating UTF16 text files with Matlab, which I am later reading in using Java. In Matlab, I open a file called fileName and write to it as follows:
fid = fopen(fileName, 'w','n','UTF16-LE');
fprintf(fid,"Some stuff.");
In Java, I can read the text file using the following code:
FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16LE"); 
String s = scanner.nextLine();
Here is the hex output:
Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 00000000 73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00 s.o.m.e. .s.t.u.f.f.
The above approach works fine. But, I want to be able to write out the file using UTF16 with a BOM to give me more flexibility so that I don't have to worry about big or little endian. In Matlab, I've coded:
fid = fopen(fileName, 'w','n','UTF16');
fprintf(fid,"Some stuff.");
In Java, I change the code to:
FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16");
String s = scanner.nextLine();
In this case, the string s is garbled, because Matlab is not writing the BOM. I can get the Java code to work just fine if I add the BOM manually. With the added BOM, the following file works fine.
Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 00000000 FF FE 73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00 ÿþs.o.m.e. .s.t.u.f.f.
How can I get Matlab to write out the BOM? I know I could write the BOM out separately, but I'd rather have Matlab do it automatically.
Addendum
I selected the answer below from Amro because it exactly solves the question I posed.
One key discovery for me was the difference between the Unicode Standard and a UTF (Unicode transformation format) (see http://unicode.org/faq/utf_bom.html). The Unicode Standard provides unique identifiers (code points) for characters. UTFs provide mappings of every code point "to a unique byte sequence." Since all but a handful of the characters I am using are in the first 128 code points, I'm going to switch to using UTF-8 as Romeo suggests. UTF-8 is supported by Matlab (The warning shown below won't need to be suppressed.) and Java, and for my application will generate smaller text files.
I suppress the Matlab warning
Warning: The encoding 'UTF-16LE' is not supported.
with
warning off MATLAB:iofun:UnsupportedEncoding;
In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a ZWNBSP.
To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this. 1.1 The below example, write a BOM to a UTF-8 file /home/mkyong/file. txt .
UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF ) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.
little endian storage. The byte-order mark indicates which order is used, so that applications can immediately decode the content. In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 encodings, there is no alternative sequence of bytes in a character.
On my system MATLAB reports that UTF-16 is not supported. I think it will be safer to use UTF-8. Besides, UTF-8 will solve your problem with Little Endian/Big Endian.
Try the following code (I am using UNICODE2NATIVE and NATIVE2UNICODE functions to do the conversions):
%# convert string and write as bytes
str = 'Some stuff.';
b = unicode2native(str,'UTF-16');
fid = fopen('utf16.txt','wb');
fwrite(fid, b, '*uint8');
fclose(fid);
We can even check the hex values of the bytes written (first two being the BOM):
>> cellstr(dec2hex(b))'
ans = 
  Columns 1 through 10
    'FF'    'FE'    '53'    '00'    '6F'    '00'    '6D'    '00'    '65'    '00'
  Columns 11 through 20
    '20'    '00'    '73'    '00'    '74'    '00'    '75'    '00'    '66'    '00'
  Columns 21 through 24
    '66'    '00'    '2E'    '00'
>> char(b)
ans =
ÿþS o m e   s t u f f . 
Now we can read the created file using MATLAB's own methods:
%# read bytes and convert back to Unicode string
fid = fopen('utf16.txt', 'rb');
b = fread(fid, '*uint8')';          %'
fclose(fid);
str = native2unicode(b,'UTF-16')
Or use Java methods directly if you prefer:
scanner = java.util.Scanner(java.io.FileInputStream('utf16.txt'), 'UTF-16');
str = scanner.nextLine()
scanner.close()
both should read the string correctly...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With