Good afternoon. Suppose I have an utf-8 file with a single letter, say "f" (no \n and spaces) and I try to get a sequence of line lengths.
(with-open [rdr (reader "test.txt")]
(doall (map #(.length %) (line-seq rdr))))
And I get
=> (2)
Why? Is there any elegant way to get the right length of the first string?
The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.
To check if BOM character exists, open the file in Notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character.
To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.
The problem of BOM in Java is covered in Reading UTF-8 - BOM marker. It seems that it can be abstracted away using BOMInputStream from Apache's Commons or it has to be removed manually, i.e.
(defn debomify
[^String line]
(let [bom "\uFEFF"]
(if (.startsWith line bom)
(.substring line 1)
line)))
(doall (map #(.length %) (.split (debomify (slurp "test.txt")) "\n")))
If you want to read a file lazily using line-seq
, for instance because it's huge, you have to treat the first line using debomify
. Remaining ones can be read normally. Hence:
(defn debommed-line-seq
[^java.io.BufferedReader rdr]
(when-let [line (.readLine rdr)]
(cons (debomify line) (lazy-seq (line-seq rdr)))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With