Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you get Matlab to write the BOM (byte order markers) for UTF-16 text files?

I am creating UTF16 text files with Matlab, which I am later reading in using Java. In Matlab, I open a file called fileName and write to it as follows:

fid = fopen(fileName, 'w','n','UTF16-LE');
fprintf(fid,"Some stuff.");

In Java, I can read the text file using the following code:

FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16LE"); 
String s = scanner.nextLine();

Here is the hex output:

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13
00000000  73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00  s.o.m.e. .s.t.u.f.f.

The above approach works fine. But, I want to be able to write out the file using UTF16 with a BOM to give me more flexibility so that I don't have to worry about big or little endian. In Matlab, I've coded:

fid = fopen(fileName, 'w','n','UTF16');
fprintf(fid,"Some stuff.");

In Java, I change the code to:

FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16");
String s = scanner.nextLine();

In this case, the string s is garbled, because Matlab is not writing the BOM. I can get the Java code to work just fine if I add the BOM manually. With the added BOM, the following file works fine.

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15
00000000  FF FE 73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00  ÿþs.o.m.e. .s.t.u.f.f.

How can I get Matlab to write out the BOM? I know I could write the BOM out separately, but I'd rather have Matlab do it automatically.

Addendum

I selected the answer below from Amro because it exactly solves the question I posed.

One key discovery for me was the difference between the Unicode Standard and a UTF (Unicode transformation format) (see http://unicode.org/faq/utf_bom.html). The Unicode Standard provides unique identifiers (code points) for characters. UTFs provide mappings of every code point "to a unique byte sequence." Since all but a handful of the characters I am using are in the first 128 code points, I'm going to switch to using UTF-8 as Romeo suggests. UTF-8 is supported by Matlab (The warning shown below won't need to be suppressed.) and Java, and for my application will generate smaller text files.

I suppress the Matlab warning

Warning: The encoding 'UTF-16LE' is not supported.

with

warning off MATLAB:iofun:UnsupportedEncoding;
like image 962
Richard Povinelli Avatar asked Nov 08 '11 15:11

Richard Povinelli


People also ask

Does UTF-16 require BOM?

In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a ZWNBSP.

How do I add UTF-8 to BOM?

To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this. 1.1 The below example, write a BOM to a UTF-8 file /home/mkyong/file. txt .

Does UTF-8 byte have an order mark?

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF ) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

Does UTF-8 require a byte order mark to indicate Endianness?

little endian storage. The byte-order mark indicates which order is used, so that applications can immediately decode the content. In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 encodings, there is no alternative sequence of bytes in a character.


2 Answers

On my system MATLAB reports that UTF-16 is not supported. I think it will be safer to use UTF-8. Besides, UTF-8 will solve your problem with Little Endian/Big Endian.

like image 146
vharavy Avatar answered Oct 07 '22 21:10

vharavy


Try the following code (I am using UNICODE2NATIVE and NATIVE2UNICODE functions to do the conversions):

%# convert string and write as bytes
str = 'Some stuff.';
b = unicode2native(str,'UTF-16');
fid = fopen('utf16.txt','wb');
fwrite(fid, b, '*uint8');
fclose(fid);

We can even check the hex values of the bytes written (first two being the BOM):

>> cellstr(dec2hex(b))'
ans = 
  Columns 1 through 10
    'FF'    'FE'    '53'    '00'    '6F'    '00'    '6D'    '00'    '65'    '00'
  Columns 11 through 20
    '20'    '00'    '73'    '00'    '74'    '00'    '75'    '00'    '66'    '00'
  Columns 21 through 24
    '66'    '00'    '2E'    '00'

>> char(b)
ans =
ÿþS o m e   s t u f f . 

Now we can read the created file using MATLAB's own methods:

%# read bytes and convert back to Unicode string
fid = fopen('utf16.txt', 'rb');
b = fread(fid, '*uint8')';          %'
fclose(fid);
str = native2unicode(b,'UTF-16')

Or use Java methods directly if you prefer:

scanner = java.util.Scanner(java.io.FileInputStream('utf16.txt'), 'UTF-16');
str = scanner.nextLine()
scanner.close()

both should read the string correctly...

like image 24
Amro Avatar answered Oct 07 '22 21:10

Amro