Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace single quotes with double with exclusion of some elements

I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.

input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""

Code 1:(@https://stackoverflow.com/users/918959/antti-haapala)

def convert_regex(text): 
     return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)

There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.

Issue: That doesn't work on words that end in an apostrophe, i.e. most possessive plurals, and it also doesn't work on informal abbreviations that start with an apostrophe.

Code 2:(@https://stackoverflow.com/users/953482/kevin)

def convert_text_func(s):
    c = "_" #placeholder character. Must NOT appear in the string.
    assert c not in s
    protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
    for k,v in protected.iteritems():
        s = s.replace(k,v)
    s = s.replace("'", '"')
    for k,v in protected.iteritems():
        s = s.replace(v,k)
    return s

Too large set of words to specify, as how can one specify persons' etc. Please help.

Edit 1: I am using @anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail. Code=

text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)

Problem:

In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.

Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,

I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.

Edit 2: Naive and long approach to fix:

def replace_translations(text):
    d = enchant.Dict("en_US")
    words=tokenize_words(text)
    punctuations=[x for x in string.punctuation]
    for i,word in enumerate(words):
        print i,word
        if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
            text=text.replace(words[i]+words[i+1],words[i]+"\"")
    return text

Are there any corner cases I am missing or are there any better approaches?

like image 704
Abhishek Bhatia Avatar asked Aug 16 '15 02:08

Abhishek Bhatia


People also ask

How do you replace single quotes with double quotes?

Use the String. replace() method to replace double with single quotes, e.g. const replaced = str. replace(/"/g, "'"); . The replace method will return a new string where all occurrences of double quotes are replaced with single quotes.

How do you replace a single quote?

Method 1 : Using the replace() method To replace a single quote from the string you will pass the two parameters. The first is the string you want to replace and the other is the string you want to place. In our case it is string. replace(” ' “,” “).

Are single and double quotes interchangeable?

The short answer is that it depends on the country that you are writing in. In British and Australian English, one typically uses single quotes. If you're writing in North America, double quote marks are typically used.


1 Answers

First attempt

You can also use this regex:

(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))

DEMO IN REGEX101

This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".

  • (?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
  • ' - single quoting mark,
  • (?:.|\n)+?'?) - non capturing group: one or more of any character or new line (to match multiline sentences) with lazy quantifire (to avoid matching from first to last single quoting mark), followed by optional single quoting sing, if there would be two in a row
  • '(?!\w) - single quotes, followed by non-word character, to exclude text like "i'm", "you're" etc. where quoting mark is beetwen words,

The s' case

However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:

(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))

DEMO IN REGEX101

PYTHON IMPLEMENTATION

with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:

  • (?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
  • (?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.

this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.

Flaws of \w

Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.

like image 170
m.cekiera Avatar answered Nov 15 '22 15:11

m.cekiera