I have a huge folder filled with xml documents, some of which may break because they contain those curly quotes, i.e. Microsoft Word quotes, i.e. smart quotes. I just want to run a quick check to see what I'm up against. Anybody know how to grep for them so I can easily find the offenders?
Edit
Here's a simplified example.
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>Pretend this is a curly quote: '</item>
</items>
Curly quotes has the following Unicode code points and UTF-8 sequence:
Name CodePoint UTF-8 sequence ---- --------- -------------- LEFT SINGLE QUOTATION MARK U+2018 0xE2 0x80 0x98 RIGHT SINGLE QUOTATION MARK U+2019 0xE2 0x80 0x99 SINGLE LOW-9 QUOTATION MARK U+201A 0xE2 0x80 0x9A SINGLE HIGH-REVERSED-9 QUOTATION MARK U+201B 0xE2 0x80 0x9B LEFT DOUBLE QUOTATION MARK U+201C 0xE2 0x80 0x9C RIGHT DOUBLE QUOTATION MARK U+201D 0xE2 0x80 0x9D DOUBLE LOW-9 QUOTATION MARK U+201E 0xE2 0x80 0x9E DOUBLE HIGH-REVERSED-9 QUOTATION MARK U+201F 0xE2 0x80 0x9F
XML is usually stored in UTF-8, so you could just compare directly for the byte sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With