Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visual Studio 2008 project file does not load because of an unexpected encoding change

In our team we have a database project in visual Studio 2008 which is under source control by Team Foundation Server. Every two weeks or so, after one co-worker checks in, the project file won't load on the other developers machines. The error message is:

The project file could not be loaded. Data at the root level is invalid. Line 1, position 1.

When I look at the project file in Notepad++, the file looks like this:

��<NUL?NULxNULmNULlNUL NULvNULeNULrNULsNULiNULoNULnNUL ...

and so on (you can see <?xml version in this) whereas an normal project file looks like:

<?xml version="1.0" encoding="utf-16"?> ...

So probably something is wrong with the encoding of the file. This is a problem for us because it turns out to be impossible to get the file encoding correct again. The 'solution' is to throw away the project file an get the last know working version from source control.

According to the file, the encoding should be UTF-16. According to Notepad++, the corrupted file is actually UTF-8.

My questions are:

  • Why is Visual Studio messing up the encoding of the project file, apparently at random times and at random machines?
  • What should we do to prevent this?
  • When it has happened, is there a possibility to restore the current file in the correct encoding instead of pulling an older version from source control?

As a last note: the problem is with one single project file, all other project files don't expose this problem.

UPDATE: Thanks to Jon Skeet's suggestion I have the answer to question number three. When I replace the first nine bytes EF BB BF EF BF BD EF BF BD by the two bytes FF FE, the project file will load again.

This leaves still the question why Visual Studio corrupts the file.

like image 683
Xenan Avatar asked Mar 23 '10 10:03

Xenan


1 Answers

I think I can provide some insight into what's happening, if not why.

FF FE is a BOM; its presence at the beginning of the file indicates that the file's encoding is UTF-16, little-endian. And it sounds like the original file really is UTF-16, but something is ignoring the BOM and reading it as if it were UTF-8.

When that happens, each of the bytes FF and FE is treated as invalid and converted to U+FFFD, the official Unicode garbage character. Then, when the text is written to a file again, each of the garbage characters gets converted to its UTF-8 encoding (EF BF BD) and the UTF-8 BOM (EF BB BF) is added in front of them, resulting in the nine-byte sequence you reported:

EF BB BF  # UTF-8 BOM
EF BF BD  # U+FFFD in UTF-8
EF BF BD  # ditto

If this is the case, simply replacing those nine bytes with FF FE is not safe. There's no guarantee those are the only bytes in the file that would be invalid when interpreted as UTF-8. As long as the file contains only ASCII characters you're okay, but anything else, like accented characters (é) or curly quotes (), will be irretrievably mangled.

Are the project files really supposed to be UTF-16? If not, maybe that one developer's system is generating UTF-16 when the version-control system is expecting UTF-8. I notice in my Visual C# Express install there's an option under Environment->Documents called "Save documents as Unicode when data cannot be saved in codepage". That sounds like something that could cause the encoding to change at apparently random times.

like image 56
Alan Moore Avatar answered Nov 11 '22 22:11

Alan Moore