Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep for (curly|microsoft|smart) quotes

Tags:

regex

grep

I have a huge folder filled with xml documents, some of which may break because they contain those curly quotes, i.e. Microsoft Word quotes, i.e. smart quotes. I just want to run a quick check to see what I'm up against. Anybody know how to grep for them so I can easily find the offenders?

Edit

Here's a simplified example.

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>Pretend this is a curly quote: '</item>
</items>
like image 808
Dave Aaron Smith Avatar asked Dec 29 '22 00:12

Dave Aaron Smith


1 Answers

Curly quotes has the following Unicode code points and UTF-8 sequence:

Name                                     CodePoint     UTF-8 sequence
----                                     ---------     --------------
LEFT SINGLE QUOTATION MARK               U+2018        0xE2 0x80 0x98
RIGHT SINGLE QUOTATION MARK              U+2019        0xE2 0x80 0x99
SINGLE LOW-9 QUOTATION MARK              U+201A        0xE2 0x80 0x9A
SINGLE HIGH-REVERSED-9 QUOTATION MARK    U+201B        0xE2 0x80 0x9B 
LEFT DOUBLE QUOTATION MARK               U+201C        0xE2 0x80 0x9C
RIGHT DOUBLE QUOTATION MARK              U+201D        0xE2 0x80 0x9D
DOUBLE LOW-9 QUOTATION MARK              U+201E        0xE2 0x80 0x9E
DOUBLE HIGH-REVERSED-9 QUOTATION MARK    U+201F        0xE2 0x80 0x9F

XML is usually stored in UTF-8, so you could just compare directly for the byte sequence.

like image 102
dalle Avatar answered Jan 13 '23 14:01

dalle