Replace single quotes with double with exclusion of some elements

Tags:

I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.

input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""

Code 1:(@https://stackoverflow.com/users/918959/antti-haapala)

def convert_regex(text): 
     return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)

There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.

Issue: That doesn't work on words that end in an apostrophe, i.e. most possessive plurals, and it also doesn't work on informal abbreviations that start with an apostrophe.

Code 2:(@https://stackoverflow.com/users/953482/kevin)

def convert_text_func(s):
    c = "_" #placeholder character. Must NOT appear in the string.
    assert c not in s
    protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
    for k,v in protected.iteritems():
        s = s.replace(k,v)
    s = s.replace("'", '"')
    for k,v in protected.iteritems():
        s = s.replace(v,k)
    return s

Too large set of words to specify, as how can one specify persons' etc. Please help.

Edit 1: I am using @anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail. Code=

text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)

Problem:

In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.

Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,

I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.

Edit 2: Naive and long approach to fix:

def replace_translations(text):
    d = enchant.Dict("en_US")
    words=tokenize_words(text)
    punctuations=[x for x in string.punctuation]
    for i,word in enumerate(words):
        print i,word
        if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
            text=text.replace(words[i]+words[i+1],words[i]+"\"")
    return text

Are there any corner cases I am missing or are there any better approaches?

704

asked Aug 16 '15 02:08

Abhishek Bhatia

1 Answers

First attempt

You can also use this regex:

(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))

DEMO IN REGEX101

This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".

(?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
' - single quoting mark,
(?:.|\n)+?'?) - non capturing group: one or more of any character or new line (to match multiline sentences) with lazy quantifire (to avoid matching from first to last single quoting mark), followed by optional single quoting sing, if there would be two in a row
'(?!\w) - single quotes, followed by non-word character, to exclude text like "i'm", "you're" etc. where quoting mark is beetwen words,

The s' case

However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:

(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))

DEMO IN REGEX101

PYTHON IMPLEMENTATION

with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:

(?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.

this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.

Flaws of \w

Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.

170

answered Nov 15 '22 15:11

m.cekiera

Related questions
                            
                                Get max length of multi-dimension tuple
                            
                                Homebrew install libxml2 with python modules
                            
                                when to commit data in ZODB
                            
                                Get all tags from taggit
                            
                                How to check that pylab backend of matplotlib runs inline?
                            
                                Flatten a nested list of variable sized sublists into a SciPy array
                            
                                How to start and stop thread?
                            
                                xlwt set style making error: More than 4094 XFs (styles)
                            
                                Generating sublists using multiplication ( * ) unexpected behavior [duplicate]
                            
                                Installing numpy on Amazon EC2
                            
                                Renderer problems using Matplotlib from within a script
                            
                                pandas retrieve the frequency of a time series
                            
                                Copying the contents of a variable to the clipboard
                            
                                Python Error: "ImportError: No module named six"
                            
                                Python Class Inheritance: How to initialize a subclass with values not in the parent class
                            
                                How to include an attribute in an XPath selection
                            
                                Python doesn't find MagickWand Libraries (despite correct location?)
                            
                                DatabaseError: ORA-00911: invalid character
                            
                                How to change django version in PyCharm?
                            
                                Use saxon with python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replace single quotes with double with exclusion of some elements

Tags:

python

regex

replace

nlp