Recently we moved our mysql db from Latin1 to UTF8. After trying a few different approaches to convert it we weren't able to find any that didn't also introduce some pretty nasty dataloss (and many simply did nothing).
This left me wondering if we have lots of different encodings going on since there doesn't seem to be a single approach that covers our test cases (various posts in our database). To test this theory I wrote a small scala app (my first, feel free to make fun of just how cobbled and non-idiomatic it is! :D) that used chardet to look at the posts and tell me the encoding.
Only one problem, everything is always UTF8.
Here's the code:
package main.scala
import org.mozilla.universalchardet.UniversalDetector
import java.sql.DriverManager
object DBConvert {
def main(args: Array[String]) {
val detector = new UniversalDetector(null)
val db_conn_str = "jdbc:mysql://localhost:3306/mt_pre?user=root"
val connection = DriverManager.getConnection(db_conn_str)
try {
val statement = connection.createStatement()
val rs = statement.executeQuery("SELECT * FROM mt_entry where entry_id = 3886")
while (rs.next) {
val buffer = rs.getBytes("entry_text_more")
detector.handleData(buffer, 0, buffer.length)
detector.dataEnd()
val encoding:String = detector.getDetectedCharset;
if (encoding != null) println("Detected encoding = " + encoding) else println("No encoding detected.");
detector.reset();
// Just so we can see the output
println(rs.getString("entry_text_more"))
}
} catch {
case _ => e: Exception => println(e.getMessage)
}
finally {
connection.close()
}
}
}
I tried passing useUnicode the JDBC query string, also characterEncoding. Neither of them budged the UTF-8 always coming out. Also tried using getBinaryStream and others, still UTF-8.
Fully admit that Character encoding makes my head bend a bit and playing with a new language may not be the best way of fixing this problem. :) That said I'm curious - is there a way to grab the data from the db and detect what encoding it was put in there as, or is it one of those things that simply since it's encoded as UTF-8 in the DB, no matter how you retrieve it that's just what it is (funny characters and all)?
Thanks!
Once I had a similar problem. See this answer. Setting the encoding inside the connection string may help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With