I'm trying to store some Windows PowerShell scripts in a Mercurial repository. It seems the PowerShell editor likes to save files as UTF-16 Unicode. This means that there are lots of \0
bytes, which is what Mercurial uses to distinguish between "text" and "binary" files. I understand that this makes no difference to how Mercurial stores the data, but it does mean that it displays binary diffs, which are kind of hard to read. Is there a way to tell Mercurial that these really are text files? Presumably I would need to convince Mercurial to use an external Unicode-aware diff program for particular file types.
If my other answer does not do what you want, I think this one may; although I haven't tested it on Windows at all yet, it's working well in Linux. It does what is potentially a nasty thing, in wrapping mercurial.mdiff.unidiff
with a new function which converts utf-16le to utf-8. This will not affect hg st
, but will affect hg diff
. One potential pitfall is that the BOM will also be changed from UTF-16LE BOM to the UTF-8 BOM.
Anyway, I think it may be useful to you, so here it is.
Extension file utf16decodediff.py
:
import codecs
from mercurial import mdiff
unidiff = mdiff.unidiff
def new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):
"""
A simple wrapper around mercurial.mdiff.unidiff which first decodes
UTF-16LE text.
"""
if a.startswith(codecs.BOM_UTF16_LE):
try:
# Gets reencoded as utf-8 to be a str rather than a unicode; some
# extensions may expect a str and may break if it's wrong.
a = a.decode('utf-16le').encode('utf-8')
except UnicodeDecodeError:
pass
if b.startswith(codecs.BOM_UTF16_LE):
try:
b = b.decode('utf-16le').encode('utf-8')
except UnicodeDecodeError:
pass
return unidiff(a, ad, b, bd, fn1, fn2, r, opts)
mdiff.unidiff = new_unidiff
In .hgrc
:
[extensions]
utf16decodediff = ~/.hgexts/utf16decodediff.py
(Or equivalent paths.)
I have worked around this by creating a new file with NotePad++ and saving it as a PowerShell file (.ps1 extension). NotePad++ will create the file as a plain text ANSI file. Once created I can open the file in the PowerShell editor and make any changes as necessary without the editor modifying the file encoding.
Disclaimer: I encountered this just moments ago and so I am not sure if there are any repercussions but so far my scripts appear to work as normal and my diffs are showing up nicely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With