Read Text File in D

Tags:

phobos

Is there any one-size-fits-all (more or less) way to read a text file in D?

The requirement is that the function would auto-detect the encoding and give me the entire data of the file in a consistent format, like a string or a dstring. It should auto-detect BOMs and interpret them as appropriate.

I tried std.file.readText() but it doesn't handle different encodings well.

(Of course, this will have a non-zero failure rate, and that's acceptable for my application.)

745

asked Jan 17 '11 21:01

user541686

2 Answers

I believe that the only real options for file I/O in Phobos at this point (aside from calling C functions) are std.file.readText and std.stdio.File. readText will read in a file as an array of chars, wchars, or dchars (defaulting to immutable(char)[] - i.e. string). I believe that the encoding must be UTF-8, UTF-16, and UTF-32 for chars, wchars, and dchars respectively, though I'd have to go digging in the source code to be sure. Any encodings which are compatible with those encodings (e.g. ASCII is compatible with UTF-8) should work just fine.

If you use File, then you have several options for functions to read the file with - including readln and rawRead. However, you essentially read the file in using a UTF-8, UTF-16, or UTF-32 compatible encoding just like with readText, or you read it in as binary data and manipulate it yourself.

Since, the character types in D are char, wchar, and dchar, which are UTF-8, UTF-16, and UTF-32 code units respectively, unless you want to read the data in binary format, the file is going to have to be encoded in an encoding compatible with one of those three types of unicode. Given a string in a particular encoding, you can convert it to another encoding using the functions in std.utf. However, I'm not aware of any way to query a file for its encoding type other than using readText to try and read the file in a given encoding and see if it succeeds.

So, unless you want to process a file yourself and determine on the fly what encoding it's in, your best bet is probably to just use readText with each consecutive string type, using the first one which succeeds. However, since text files are normally in UTF-8 or a UTF-8 compatible encoding, I would expect that readText used with a normal string would almost always work just fine.

199

answered Sep 19 '22 13:09

Jonathan M Davis

As for dealing with checking the BOM:

Click to copy

char[] ConvertViaBOM(ubyte[] data) {
  char[] UTF8()   { /*...*/ }
  char[] UTF16LE(){ /*...*/ }
  char[] UTF16BE(){ /*...*/ }
  char[] UTF32LE(){ /*...*/ }
  char[] UTF32BE(){ /*...*/ }

  switch (data.length) {
    default:
    case 4:
      if (data[0..4] == [cast(ubyte)0x00, 0x00, 0xFE, 0xFF]) return UTF32BE();
      if (data[0..4] == [cast(ubyte)0xFF, 0xFE, 0x00, 0x00]) return UTF32LE();
      goto case 3;

    case 3:
      if (data[0..3] == [cast(ubyte)0xEF, 0xBB, 0xBF]) return UTF8();
      goto case 2;

    case 2:
      if (data[0..2] == [cast(ubyte)0xFE, 0xFF]) return UTF16BE();
      if (data[0..2] == [cast(ubyte)0xFF, 0xFE]) return UTF16LE();
      goto case 1;

    case 1:
      return UTF8();
  }
}

Adding more obscure BOM's is left as an exercise for the reader.

answered Sep 17 '22 13:09

BCS

Related questions
                            
                                What is the '\?' Escape Sequence in D?
                            
                                What's the best way to handle incoming messages?
                            
                                Compiler optimization breaks multi-threaded code
                            
                                Status of D support on iOS
                            
                                How to create a Dynamic Library in D?
                            
                                Receive arrays of arrays of ... in D function?
                            
                                Templates and Shared Libraries in D
                            
                                Asking for help to fix inline assembly issue in D program
                            
                                Creating strings in D without allocating memory?
                            
                                equivalent of remove_if in D
                            
                                Are spinlocks a good choice for a memory allocator?
                            
                                Making a reference-counted object in D using RefCounted!(T)
                            
                                Select a random element of an enum in D
                            
                                Is there a kind of static print in D?
                            
                                d programming language : standard input problem or misunderstanding?
                            
                                Overriding .init in custom type in D
                            
                                How to get single keystroke in D2 (Phobos)?
                            
                                Using `void main` in D
                            
                                d2: assigning ranges/iterators to array slices
                            
                                Is it bad practice to alter dynamic arrays that have references to them?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With