Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using chardet to detect bad encoding in a MySQL db with JDBC

Recently we moved our mysql db from Latin1 to UTF8. After trying a few different approaches to convert it we weren't able to find any that didn't also introduce some pretty nasty dataloss (and many simply did nothing).

This left me wondering if we have lots of different encodings going on since there doesn't seem to be a single approach that covers our test cases (various posts in our database). To test this theory I wrote a small scala app (my first, feel free to make fun of just how cobbled and non-idiomatic it is! :D) that used chardet to look at the posts and tell me the encoding.

Only one problem, everything is always UTF8.

Here's the code:

package main.scala

import org.mozilla.universalchardet.UniversalDetector
import java.sql.DriverManager

object DBConvert {
  def main(args: Array[String]) {
    val detector = new UniversalDetector(null)
    val db_conn_str = "jdbc:mysql://localhost:3306/mt_pre?user=root"
    val connection = DriverManager.getConnection(db_conn_str)

    try {
        val statement = connection.createStatement()
        val rs = statement.executeQuery("SELECT * FROM mt_entry where entry_id = 3886")
        while (rs.next) {
           val buffer = rs.getBytes("entry_text_more")
           detector.handleData(buffer, 0, buffer.length)
           detector.dataEnd()

           val encoding:String = detector.getDetectedCharset;

           if (encoding != null) println("Detected encoding = " + encoding) else println("No encoding detected.");

           detector.reset();

           // Just so we can see the output
           println(rs.getString("entry_text_more"))
        }
    } catch {
      case _ => e: Exception => println(e.getMessage)
    }
    finally {
        connection.close()
    }
  }
}

I tried passing useUnicode the JDBC query string, also characterEncoding. Neither of them budged the UTF-8 always coming out. Also tried using getBinaryStream and others, still UTF-8.

Fully admit that Character encoding makes my head bend a bit and playing with a new language may not be the best way of fixing this problem. :) That said I'm curious - is there a way to grab the data from the db and detect what encoding it was put in there as, or is it one of those things that simply since it's encoded as UTF-8 in the DB, no matter how you retrieve it that's just what it is (funny characters and all)?

Thanks!

like image 360
bnferguson Avatar asked Sep 26 '11 22:09

bnferguson


1 Answers

Once I had a similar problem. See this answer. Setting the encoding inside the connection string may help.

like image 84
Marco Hemminger Avatar answered Oct 07 '22 22:10

Marco Hemminger