Is there any one-size-fits-all (more or less) way to read a text file in D?
The requirement is that the function would auto-detect the encoding and give me the entire data of the file in a consistent format, like a string
or a dstring
. It should auto-detect BOMs and interpret them as appropriate.
I tried std.file.readText()
but it doesn't handle different encodings well.
(Of course, this will have a non-zero failure rate, and that's acceptable for my application.)
You can open a TXT file with any text editor and most popular web browsers. In Windows, you can open a TXT file with Microsoft Notepad or Microsoft WordPad, both of which come included with Windows. To open a TXT file with Notepad, select File → Open....
Use fopen to open the file, specify the character encoding, and obtain the fileID value. When you finish reading, close the file by calling fclose(fileID) . A = fscanf( fileID , formatSpec , sizeA ) reads file data into an array, A , with dimensions, sizeA , and positions the file pointer after the last value read.
The easiest way is to use the Scanner class in Java and the FileReader object. Simple example: Scanner in = new Scanner(new FileReader("filename. txt"));
I believe that the only real options for file I/O in Phobos at this point (aside from calling C functions) are std.file.readText
and std.stdio.File
. readText
will read in a file as an array of chars, wchars, or dchars (defaulting to immutable(char)[] - i.e. string). I believe that the encoding must be UTF-8, UTF-16, and UTF-32 for chars, wchars, and dchars respectively, though I'd have to go digging in the source code to be sure. Any encodings which are compatible with those encodings (e.g. ASCII is compatible with UTF-8) should work just fine.
If you use File
, then you have several options for functions to read the file with - including readln
and rawRead
. However, you essentially read the file in using a UTF-8, UTF-16, or UTF-32 compatible encoding just like with readText
, or you read it in as binary data and manipulate it yourself.
Since, the character types in D are char, wchar, and dchar, which are UTF-8, UTF-16, and UTF-32 code units respectively, unless you want to read the data in binary format, the file is going to have to be encoded in an encoding compatible with one of those three types of unicode. Given a string in a particular encoding, you can convert it to another encoding using the functions in std.utf
. However, I'm not aware of any way to query a file for its encoding type other than using readText
to try and read the file in a given encoding and see if it succeeds.
So, unless you want to process a file yourself and determine on the fly what encoding it's in, your best bet is probably to just use readText
with each consecutive string type, using the first one which succeeds. However, since text files are normally in UTF-8 or a UTF-8 compatible encoding, I would expect that readText
used with a normal string would almost always work just fine.
As for dealing with checking the BOM:
char[] ConvertViaBOM(ubyte[] data) {
char[] UTF8() { /*...*/ }
char[] UTF16LE(){ /*...*/ }
char[] UTF16BE(){ /*...*/ }
char[] UTF32LE(){ /*...*/ }
char[] UTF32BE(){ /*...*/ }
switch (data.length) {
default:
case 4:
if (data[0..4] == [cast(ubyte)0x00, 0x00, 0xFE, 0xFF]) return UTF32BE();
if (data[0..4] == [cast(ubyte)0xFF, 0xFE, 0x00, 0x00]) return UTF32LE();
goto case 3;
case 3:
if (data[0..3] == [cast(ubyte)0xEF, 0xBB, 0xBF]) return UTF8();
goto case 2;
case 2:
if (data[0..2] == [cast(ubyte)0xFE, 0xFF]) return UTF16BE();
if (data[0..2] == [cast(ubyte)0xFF, 0xFE]) return UTF16LE();
goto case 1;
case 1:
return UTF8();
}
}
Adding more obscure BOM's is left as an exercise for the reader.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With