Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of the first line in an UTF-8 file with BOM

Tags:

clojure

Good afternoon. Suppose I have an utf-8 file with a single letter, say "f" (no \n and spaces) and I try to get a sequence of line lengths.

(with-open [rdr (reader "test.txt")] 
  (doall (map #(.length %) (line-seq rdr))))

And I get

=> (2)

Why? Is there any elegant way to get the right length of the first string?

like image 606
Oleg Leonov Avatar asked Dec 09 '12 16:12

Oleg Leonov


People also ask

What does UTF-8 with BOM mean?

The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.

How do I know if my BOM is UTF-8?

To check if BOM character exists, open the file in Notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character.

How do I add BOM to UTF-8?

To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.


1 Answers

The problem of BOM in Java is covered in Reading UTF-8 - BOM marker. It seems that it can be abstracted away using BOMInputStream from Apache's Commons or it has to be removed manually, i.e.

(defn debomify
  [^String line]
  (let [bom "\uFEFF"]
    (if (.startsWith line bom)
      (.substring line 1)
      line)))

(doall (map #(.length %) (.split (debomify (slurp "test.txt")) "\n")))

If you want to read a file lazily using line-seq, for instance because it's huge, you have to treat the first line using debomify. Remaining ones can be read normally. Hence:

(defn debommed-line-seq
  [^java.io.BufferedReader rdr]
  (when-let [line (.readLine rdr)]
    (cons (debomify line) (lazy-seq (line-seq rdr)))))
like image 69
Jan Avatar answered Oct 17 '22 09:10

Jan