Good day. The one thing I now hate about Haskell is quantity of packages for working with string. First I used native Haskell <code>[Char]</code> strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing. Next I rewrote my code with <code>Data.Text</code> strings and <code>OverloadedStrings</code> extension, I chose <code>Text</code> because it has a wider set of functions, but it seems many projects prefer <code>ByteString</code>. Someone could give short reasoning why to use one or other? PS: btw how to convert from <code>Text</code> to <code>ByteString</code>? <blockquote> Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Text Expected type: IO Data.ByteString.Lazy.Internal.ByteString Inferred type: IO Text </blockquote> I tried <code>encodeUtf8</code> from <code>Data.Text.Encoding</code>, but no luck: <blockquote> Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Data.ByteString.Internal.ByteString </blockquote> UPD: Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this: <pre class="prettyprint"><code>htmlToItems :: Text -> [Item] htmlToItems = getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8" </code></pre> And now became: <pre class="prettyprint"><code>htmlToItems :: Text -> [Item] htmlToItems = getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS where toLazyBS t = fromChunks [encodeUtf8 t] fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t </code></pre> And yes, this function is not working because its wrong, if we supply <code>Text</code> to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside <code>htmltoItems</code>.

<code>ByteStrings</code> are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use <code>Text</code>. However, I must emphasize that neither is a replacement for the other, and they are generally used for different things: while <code>Text</code> represents pure unicode, you still need to encode to and from a binary <code>ByteString</code> representation whenever you e.g. transport text via a socket or a file. Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (<code>Text</code>) and the encoded binary bytes (<code>ByteString</code>): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).

Text or Bytestring

Tags:

string

text

haskell

Good day.

The one thing I now hate about Haskell is quantity of packages for working with string.

First I used native Haskell [Char] strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing.

Next I rewrote my code with Data.Text strings and OverloadedStrings extension, I chose Text because it has a wider set of functions, but it seems many projects prefer ByteString.
Someone could give short reasoning why to use one or other?

PS: btw how to convert from Text to ByteString?

Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Text Expected type: IO Data.ByteString.Lazy.Internal.ByteString Inferred type: IO Text

I tried encodeUtf8 from Data.Text.Encoding, but no luck:

Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Data.ByteString.Internal.ByteString

UPD:

Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this:

htmlToItems :: Text -> [Item]
htmlToItems =
    getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8"

And now became:

htmlToItems :: Text -> [Item]
htmlToItems =
    getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS
    where
      toLazyBS t = fromChunks [encodeUtf8 t]
      fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t

And yes, this function is not working because its wrong, if we supply Text to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside htmltoItems.

658

asked Sep 09 '11 06:09

Dfr

2 Answers

ByteStrings are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use Text. However, I must emphasize that neither is a replacement for the other, and they are generally used for different things: while Text represents pure unicode, you still need to encode to and from a binary ByteString representation whenever you e.g. transport text via a socket or a file.

Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (Text) and the encoded binary bytes (ByteString): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).

131

answered Oct 09 '22 03:10

shang

You definitely want to be using Data.Text for textual data.

encodeUtf8 is the way to go. This error:

Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Data.ByteString.Internal.ByteString

means that you're supplying a strict bytestring to code which expects a lazy bytestring. Conversion is easy with the fromChunks function:

Data.ByteString.Lazy.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString

so all you need to do is add the function fromChunks [myStrictByteString] wherever the lazy bytestring is expected.

Conversion the other way can be accomplished with the dual function toChunks, which takes a lazy bytestring and gives a list of strict chunks.

You may want to ask the maintainers of some packages if they'd be able to provide a text interface instead of, or in addition to, a bytestring interface.

answered Oct 09 '22 02:10

John L

Related questions
                            
                                Last index of a given substring in MySQL
                            
                                Checking for null before ToString()
                            
                                int array to string
                            
                                How to remove a path prefix in python?
                            
                                Is there a native templating system for plain text files in Python?
                            
                                Loop "Forgets" to Remove Some Items [duplicate]
                            
                                What's the use of System.String.Copy in .NET?
                            
                                Convert String XML fragment to Document Node in Java
                            
                                Determining how many times a substring occurs in a string in Python
                            
                                Converting to upper and lower case in Java
                            
                                Split string in JavaScript and detect line break
                            
                                Check if string has date, any format
                            
                                Best way to convert Pascal Case to a sentence
                            
                                String.Split only on first separator in C#?
                            
                                What is the difference between ' and " in JavaScript?
                            
                                Split NSString multiple times on the same separator
                            
                                Newtonsoft.Json.Linq.JArray to string array C#
                            
                                Copying to the clipboard in Java [duplicate]
                            
                                Converting string to double in C#
                            
                                When to use []byte or string in Go?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With