Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text encoding in ID3v2.3 tags

Thanks to this site and a few others, I've created some simple code to read ID3v2.3 tags from MP3 files. Doing so has been a great learning experience as I previously had no knowledge of hex / byte / binary etc.

I can successfully read data, but have come across an issue that I believe is to do with encoding used. I've realized that Text frames have a byte at the beginning of the 'text' that describes encoding used, and potentially more information in the next 2 bytes...

Example: Data from frame TIT2 starts with the byte $03 (hex) before the actual text. This text displays correctly, albeit with an additional character at the beginning, using Encoding.ASCII.GetString

In another MP3, data from TIT2 starts $01 and is followed by $FF $FE, which I believe is to do with Unicode? The text itself is broken up though, there are $00 between every text character, and this stops the data from being displayed in windows forms (as soon as a 00 is encountered, the text just stops, so I get the first character and that's it). I've tried using Encoding.UNICODE.GetString, but that just seems to return gibberish.

Printing this data to a console seems to work, with spaces between each char, so the reading of the data is working properly.

I've been reading the official documentation for ID3v2.3 but I guess I'm just not clued-up enough to understand the text encoding section.

Any replies or links to articles that may be of help would be much appreciated!

Regards Ross

like image 582
phanteh Avatar asked Mar 25 '12 03:03

phanteh


2 Answers

Just add one more comment, for the text encoding code:

00 – ISO-8859-1 (ASCII).

01 – UCS-2 (UTF-16 encoded Unicode with BOM), in ID3v2.2 and ID3v2.3.

02 – UTF-16BE encoded Unicode without BOM, in ID3v2.4.

03 – UTF-8 encoded Unicode, in ID3v2.4.

from: http://en.wikipedia.org/wiki/ID3

like image 155
houqp Avatar answered Nov 04 '22 13:11

houqp


Data from frame TIT2 starts with the byte $03 (hex) before the actual text. This text displays correctly, albeit with an additional character at the beginning, using Encoding.ASCII.GetString

Encoding 0x03 is UTF-8, so you should use Encoding.UTF8.GetString. The character at the beginning may be U+FEFF Byte Order Mark, which is used to distinguish between UTF-16LE and UTF-16BE... it's no use for UTF-8, but Windows tools love to put it there anyway.

UTF-8 is an ID3v2.4 feature not present in 2.3, which may be why you can't find it in the spec. In the real world you will find all sorts of total nonsense in ID3 tags regardless of version.

data from TIT2 starts $01 and is followed by $FF $FE, which I believe is to do with Unicode? The text itself is broken up though, there are $00 between every text character,

That's UTF-16LE, the text-to-byte encoding that Windows misleadingly calls “Unicode”. It is made up of two-byte code units, so the characters in the range U+0000–U+00FF come out as the low-byte of the same number, followed by a zero high-byte. The 0xFF-0xFE prefix is a Byte Order Mark correctly used. Encoding.Unicode.GetString should return a correct string from this—post some code?

Printing this data to a console seems to work

Getting non-ASCII characters to print on the Windows console can be a trial, so if you hit problems bear in mind they may be caused by the print operation itself.

For completeness, encoding 0x02 is UTF-16BE without a BOM (there is little reason for this to exist and I have never met this in the wild at all), and encoding 0x00 is supposed to be ISO-8859-1, but in reality could be pretty much any ASCII-superset encoding, more likely a Windows ‘ANSI’ code page like Encoding.GetEncoding(1252) than a standard like 8859-1.

like image 25
bobince Avatar answered Nov 04 '22 15:11

bobince