Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex

Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.

I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.

Example: Company Name: Simmons Text in Doc: The [[COMPANY]]'s business is cars. Result: The Simmons's business is cars.

This is improper English.

I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.

            Regex apostropheReplace = new Regex("s\\'s");
            docText = apostropheReplace.Replace(docText, "s\'"); 

This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.

            Regex apostrophyReplace = new Regex("s\\’s");
            docText = apostrophyReplace.Replace(docText, "s\'"); 

Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "'" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.

like image 502
mfontaine Avatar asked Oct 29 '19 20:10

mfontaine


2 Answers

The reason this happens is because they are different characters.

Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.

I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']

So essentially modify your code to:

Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")

I found these were the five most common type of single quotes and apostrophes used.

And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]

like image 82
Leo Avatar answered Nov 04 '22 13:11

Leo


Answering the question:

Is there a way to do it so that both characters work?

If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:

 Regex apostropheReplace = new Regex("s\\['’]s");
 docText = apostropheReplace.Replace(docText, "s\'")

This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:

If using the copied character from Word is the proper way of doing this?

That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).

If you mean "most versatile/robust single quote Regex", then as @Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:

Regex apostropheReplace =
    new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")

But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from

Regex("s\\['\u2018\u2019\u201A\u201b]s")

to

Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")

Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.

like image 4
LHM Avatar answered Nov 04 '22 15:11

LHM