Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Text File in D

Tags:

d

phobos

Is there any one-size-fits-all (more or less) way to read a text file in D?

The requirement is that the function would auto-detect the encoding and give me the entire data of the file in a consistent format, like a string or a dstring. It should auto-detect BOMs and interpret them as appropriate.

I tried std.file.readText() but it doesn't handle different encodings well.

(Of course, this will have a non-zero failure rate, and that's acceptable for my application.)

like image 745
user541686 Avatar asked Jan 17 '11 21:01

user541686


People also ask

How do I read a .TXT file?

You can open a TXT file with any text editor and most popular web browsers. In Windows, you can open a TXT file with Microsoft Notepad or Microsoft WordPad, both of which come included with Windows. To open a TXT file with Notepad, select File → Open....

How do I read a .TXT file in Matlab?

Use fopen to open the file, specify the character encoding, and obtain the fileID value. When you finish reading, close the file by calling fclose(fileID) . A = fscanf( fileID , formatSpec , sizeA ) reads file data into an array, A , with dimensions, sizeA , and positions the file pointer after the last value read.

How do I read a .TXT file in Java?

The easiest way is to use the Scanner class in Java and the FileReader object. Simple example: Scanner in = new Scanner(new FileReader("filename. txt"));


2 Answers

I believe that the only real options for file I/O in Phobos at this point (aside from calling C functions) are std.file.readText and std.stdio.File. readText will read in a file as an array of chars, wchars, or dchars (defaulting to immutable(char)[] - i.e. string). I believe that the encoding must be UTF-8, UTF-16, and UTF-32 for chars, wchars, and dchars respectively, though I'd have to go digging in the source code to be sure. Any encodings which are compatible with those encodings (e.g. ASCII is compatible with UTF-8) should work just fine.

If you use File, then you have several options for functions to read the file with - including readln and rawRead. However, you essentially read the file in using a UTF-8, UTF-16, or UTF-32 compatible encoding just like with readText, or you read it in as binary data and manipulate it yourself.

Since, the character types in D are char, wchar, and dchar, which are UTF-8, UTF-16, and UTF-32 code units respectively, unless you want to read the data in binary format, the file is going to have to be encoded in an encoding compatible with one of those three types of unicode. Given a string in a particular encoding, you can convert it to another encoding using the functions in std.utf. However, I'm not aware of any way to query a file for its encoding type other than using readText to try and read the file in a given encoding and see if it succeeds.

So, unless you want to process a file yourself and determine on the fly what encoding it's in, your best bet is probably to just use readText with each consecutive string type, using the first one which succeeds. However, since text files are normally in UTF-8 or a UTF-8 compatible encoding, I would expect that readText used with a normal string would almost always work just fine.

like image 199
Jonathan M Davis Avatar answered Sep 19 '22 13:09

Jonathan M Davis


As for dealing with checking the BOM:

char[] ConvertViaBOM(ubyte[] data) {
  char[] UTF8()   { /*...*/ }
  char[] UTF16LE(){ /*...*/ }
  char[] UTF16BE(){ /*...*/ }
  char[] UTF32LE(){ /*...*/ }
  char[] UTF32BE(){ /*...*/ }

  switch (data.length) {
    default:
    case 4:
      if (data[0..4] == [cast(ubyte)0x00, 0x00, 0xFE, 0xFF]) return UTF32BE();
      if (data[0..4] == [cast(ubyte)0xFF, 0xFE, 0x00, 0x00]) return UTF32LE();
      goto case 3;

    case 3:
      if (data[0..3] == [cast(ubyte)0xEF, 0xBB, 0xBF]) return UTF8();
      goto case 2;

    case 2:
      if (data[0..2] == [cast(ubyte)0xFE, 0xFF]) return UTF16BE();
      if (data[0..2] == [cast(ubyte)0xFF, 0xFE]) return UTF16LE();
      goto case 1;

    case 1:
      return UTF8();
  }
}

Adding more obscure BOM's is left as an exercise for the reader.

like image 37
BCS Avatar answered Sep 17 '22 13:09

BCS