In our team we have a database project in visual Studio 2008 which is under source control by Team Foundation Server. Every two weeks or so, after one co-worker checks in, the project file won't load on the other developers machines. The error message is:
The project file could not be loaded. Data at the root level is invalid. Line 1, position 1.
When I look at the project file in Notepad++, the file looks like this:
��<NUL?NULxNULmNULlNUL NULvNULeNULrNULsNULiNULoNULnNUL
...
and so on (you can see <?xml version
in this)
whereas an normal project file looks like:
<?xml version="1.0" encoding="utf-16"?>
...
So probably something is wrong with the encoding of the file. This is a problem for us because it turns out to be impossible to get the file encoding correct again. The 'solution' is to throw away the project file an get the last know working version from source control.
According to the file, the encoding should be UTF-16. According to Notepad++, the corrupted file is actually UTF-8.
My questions are:
As a last note: the problem is with one single project file, all other project files don't expose this problem.
UPDATE: Thanks to Jon Skeet's suggestion I have the answer to question number three. When I replace the first nine bytes EF BB BF EF BF BD EF BF BD by the two bytes FF FE, the project file will load again.
This leaves still the question why Visual Studio corrupts the file.
I think I can provide some insight into what's happening, if not why.
FF FE
is a BOM; its presence at the beginning of the file indicates that the file's encoding is UTF-16, little-endian. And it sounds like the original file really is UTF-16, but something is ignoring the BOM and reading it as if it were UTF-8.
When that happens, each of the bytes FF
and FE
is treated as invalid and converted to U+FFFD
, the official Unicode garbage character. Then, when the text is written to a file again, each of the garbage characters gets converted to its UTF-8 encoding (EF BF BD
) and the UTF-8 BOM (EF BB BF
) is added in front of them, resulting in the nine-byte sequence you reported:
EF BB BF # UTF-8 BOM
EF BF BD # U+FFFD in UTF-8
EF BF BD # ditto
If this is the case, simply replacing those nine bytes with FF FE
is not safe. There's no guarantee those are the only bytes in the file that would be invalid when interpreted as UTF-8. As long as the file contains only ASCII characters you're okay, but anything else, like accented characters (é
) or curly quotes (’
), will be irretrievably mangled.
Are the project files really supposed to be UTF-16? If not, maybe that one developer's system is generating UTF-16 when the version-control system is expecting UTF-8. I notice in my Visual C# Express install there's an option under Environment->Documents
called "Save documents as Unicode when data cannot be saved in codepage". That sounds like something that could cause the encoding to change at apparently random times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With