Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can cause git to mess with character encoding?

Tags:

Edit: git does not mess with character encoding. This is still here to share knowlege and avoid others making the same mistake.


The context: My enterprise uses an svn repository. I'm using git-svn as a client to interact with this repository. All text files in the project are (and must be) encoded with windows default encoding (cp-....). I use git-extensions, and sometimes the command line to pilot git.

What I did: During the last 3 days, I was working on a new feature, and I did a number of local commits. Finally i squashed all these commits into a single one using an interactive rebase, then i used git svn dcommit to push everything on the svn repository in a single commit.

What happened then: A collegue told me that all accents were messed up in the files that I modified, and in the new files after my commit. I had already commited text files with accents in the same repository with my installation of git + svn before, and it's the first time I face this issue.

My investigation:I did the following things to investigate: opened the files with notepad++, and tried the most current encodings (including windows default and UTF-8) to view them: none of them could display accents properly, and different accents are always rendered by the same sequence of strange glyphs.

The temporary workaround:I quickly created a revert commit with git extension and "dcommited" it.

The question:My enterprise svn repository is OK, but now i have the two following problems to solve:

  1. Understand what happened with the characters with accents
  2. Retrieve my work from the SVN history and commit it in a proper way (if possible without reviewing manually all the characters with accents)

Can anybody provide some clues (i'm rather new to git) ?

like image 754
Samuel Rossille Avatar asked May 16 '12 17:05

Samuel Rossille


People also ask

What encoding does GitHub use?

ru. koi8-r and mpman-ru. tex, both use encoding koi8-r. GitHub uses right encoding for the first one and uses wrong for the second one.

Why is UTF 8 a good choice for the default editor encoding?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things.


1 Answers

And now let's reveal the painful truth (painful for my ego, not for git users): I did mess with the accents, not git.

I could have just removed the question which let's wrongly think that git can mess up with accents, but considering the number of upvotes, i think than a lot of people do the same mistake that i did, so I have chosen to answer my own question to establish the truth, and maybe help people in the same case:

  1. Git does not touch to characters other than line breaks.
  2. I broke the accents before commiting, and i did not noticed it because i did not pay enough attention. To do so, i edited some of the files with eclipse. Eclipse did not recognize the encoding and the accents were all replace by a weird byte sequence on save. That's all.

Thanks again to Dmitry Pavlenko for giving me indications on how to investigate this problem.

+1 to "git reflog"

Happy accent fixing ;=)

like image 127
Samuel Rossille Avatar answered Oct 31 '22 16:10

Samuel Rossille