Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ensure that Strings are in UTF-8?

How to convert this String the surveyÂ’s rules to UTF-8 in Scala?

I tried these roads but does not work:

scala> val text = "the surveyÂ’s rules"
text: String = the surveyÂ’s rules

scala> scala.io.Source.fromBytes(text.getBytes(), "UTF-8").mkString
res17: String = the surveyÂ’s rules

scala> new String(text.getBytes(),"UTF8")
res21: String = the surveyÂ’s rules

Ok, i'm resolved in this way. Not a converting but a simple reading

implicit val codec = Codec("US-ASCII").onMalformedInput(CodingErrorAction.IGNORE).onUnmappableCharacter(CodingErrorAction.IGNORE)

val src = Source.fromFile(new File (folderDestination + name + ".csv"))
val src2 = Source.fromFile(new File (folderDestination + name + ".csv"))

val reader = CSVReader.open(src.reader())
like image 246
YoBre Avatar asked May 29 '14 11:05

YoBre


People also ask

How do I know if a string is UTF-8?

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.

How do I encode strings to UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

What are UTF-8 strings?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

How do I know if my file is UTF-16 or UTF-8?

There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...


2 Answers

Note that when you call text.getBytes() without arguments, you're in fact getting an array of bytes representing the string in your platform's default encoding. On Windows, for example, it could be some single-byte encoding; on Linux it can be UTF-8 already.

To be correct you need to specify exact encoding in getBytes() method call. For Java 7 and later do this:

import java.nio.charset.StandardCharsets

val bytes = text.getBytes(StandardCharsets.UTF_8)

For Java 6 do this:

import java.nio.charset.Charset

val bytes = text.getBytes(Charset.forName("UTF-8"))

Then bytes will contain UTF-8-encoded text.

like image 95
Vladimir Matveev Avatar answered Sep 16 '22 19:09

Vladimir Matveev


Just set the JVM's file.encoding parameter to UTF-8 as follows:

-Dfile.encoding=UTF-8

It makes sure that UTF-8 is the default encoding.

Using scala it could be scala -Dfile.encoding=UTF-8.

like image 43
Nitul Avatar answered Sep 17 '22 19:09

Nitul