Good afternoon. Suppose I have an utf-8 file with a single letter, say "f" (no \n and spaces) and I try to get a sequence of line lengths. <pre class="prettyprint"><code>(with-open [rdr (reader "test.txt")] (doall (map #(.length %) (line-seq rdr)))) </code></pre> And I get <pre class="prettyprint"><code>=> (2) </code></pre> Why? Is there any elegant way to get the right length of the first string?

The problem of BOM in Java is covered in Reading UTF-8 - BOM marker. It seems that it can be abstracted away using BOMInputStream from Apache's Commons or it has to be removed manually, i.e. <pre class="prettyprint"><code>(defn debomify [^String line] (let [bom "\uFEFF"] (if (.startsWith line bom) (.substring line 1) line))) (doall (map #(.length %) (.split (debomify (slurp "test.txt")) "\n"))) </code></pre> If you want to read a file lazily using <code>line-seq</code>, for instance because it's huge, you have to treat the first line using <code>debomify</code>. Remaining ones can be read normally. Hence: <pre class="prettyprint"><code>(defn debommed-line-seq [^java.io.BufferedReader rdr] (when-let [line (.readLine rdr)] (cons (debomify line) (lazy-seq (line-seq rdr))))) </code></pre>

Length of the first line in an UTF-8 file with BOM

Tags:

clojure

Good afternoon. Suppose I have an utf-8 file with a single letter, say "f" (no \n and spaces) and I try to get a sequence of line lengths.

(with-open [rdr (reader "test.txt")] 
  (doall (map #(.length %) (line-seq rdr))))

And I get

=> (2)

Why? Is there any elegant way to get the right length of the first string?

606

asked Dec 09 '12 16:12

Oleg Leonov

1 Answers

The problem of BOM in Java is covered in Reading UTF-8 - BOM marker. It seems that it can be abstracted away using BOMInputStream from Apache's Commons or it has to be removed manually, i.e.

(defn debomify
  [^String line]
  (let [bom "\uFEFF"]
    (if (.startsWith line bom)
      (.substring line 1)
      line)))

(doall (map #(.length %) (.split (debomify (slurp "test.txt")) "\n")))

If you want to read a file lazily using line-seq, for instance because it's huge, you have to treat the first line using debomify. Remaining ones can be read normally. Hence:

(defn debommed-line-seq
  [^java.io.BufferedReader rdr]
  (when-let [line (.readLine rdr)]
    (cons (debomify line) (lazy-seq (line-seq rdr)))))

answered Oct 17 '22 09:10

Jan

Related questions
                            
                                Why isn't there an alter-var-root in clojurescript?
                            
                                How can I send a file in a POST request?
                            
                                How to install clojure-contrib on Windows?
                            
                                Performance Problem with Clojure Array
                            
                                How do I dynamically load a Clojure script from outside of my classpath from java?
                            
                                How to setup the classpath when running the jar made from 'lein uberjar'?
                            
                                Clojure: gc overhead limit exceeded, lazy evaluation, pi sequence
                            
                                What are some well-written open source Clojure applications (not libraries)?
                            
                                Is is possible for protocols to introduce new state into existing classes?
                            
                                In clojure why does splitting a string from an empty file return 1 element?
                            
                                A "hello world" using Netbeans 7.0 Enclojure 1.5
                            
                                clojure sequence type
                            
                                Avoid overriding variable names
                            
                                What languages are mature to develop native apps in Android [except Java]
                            
                                Get the host address programmatically in webnoir
                            
                                How can I update a seesaw tree model?
                            
                                wrapping knockout.js using clojurescript
                            
                                Can somebody explain the behavior of "conj"?
                            
                                Improving on loop-recur
                            
                                Run embedded code from a different namespace

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With