Problem:
I have thousands of documents which contains a specific character I don't want. E.g. the character a
. These documents contain a variety of characters, but the a
's I want to replace are inside double quotes or single quotes.
I would like to find and replace them, and I thought using Regex would be needed. I am using VSCode, but I'm open to any suggestions.
My attempt:
I was able to find the following regex to match for a specific string containing the values inside the ()
.
".*?(r).*?"
However, this only highlights the entire quote. I want to highlight the character only.
Any solution, perhaps outside of regex, is welcome.
Example outcomes:
Given, the character is a
, find replace to b
Somebody once told me "apples" are good for you
=> Somebody once told me "bpples" are good for you
"Aardvarks" make good kebabs
=> "Abrdvbrks" make good kebabs
The boy said "aaah!" when his mom told him he was eating aardvark
=> The boy said "bbbh!" when his mom told him he was eating aardvark
The easiest way to do this is to highlight one of the quotes, then select Search, then Replace.
To remove double quotes from a string:Call the replace() method on the string. The replace method will replace each occurrence of a double quote with an empty string. The replace method will return a new string with all double quotes removed.
Firstly, double quote character is nothing special in regex - it's just another character, so it doesn't need escaping from the perspective of regex. However, because Java uses double quotes to delimit String constants, if you want to create a string in Java with a double quote in it, you must escape them.
VS Code uses JavaScript RegEx engine for its find / replace functionality. This means you are very limited in working with regex in comparison to other flavors like .NET or PCRE.
Lucky enough that this flavor supports lookaheads and with lookaheads you are able to look for but not consume character. So one way to ensure that we are within a quoted string is to look for number of quotes down to bottom of file / subject string to be odd after matching an a
:
a(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)
Live demo
This looks for a
s in a double quoted string, to have it for single quoted strings substitute all "
s with '
. You can't have both at a time.
There is a problem with regex above however, that it conflicts with escaped double quotes within double quoted strings. To match them too if it matters you have a long way to go:
a(?=[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*)*$)
Applying these approaches on large files probably will result in an stack overflow so let's see a better approach.
I am using VSCode, but I'm open to any suggestions.
That's great. Then I'd suggest to use awk
or sed
or something more programmatic in order to achieve what you are after or if you are able to use Sublime Text a chance exists to work around this problem in a more elegant way.
This is supposed to work on large files with hundred of thousands of lines but care that it works for a single character (here a
) that with some modifications may work for a word or substring too:
Search for:
(?:"|\G(?<!")(?!\A))(?<r>[^a"\\]*+(?>\\.[^a"\\]*)*+)\K(a|"(*SKIP)(*F))(?(?=((?&r)"))\3)
^ ^ ^
Replace it with: WHATEVER\3
Live demo
RegEx Breakdown:
(?: # Beginning of non-capturing group #1
" # Match a `"`
| # Or
\G(?<!")(?!\A) # Continue matching from last successful match
# It shouldn't start right after a `"`
) # End of NCG #1
(?<r> # Start of capturing group `r`
[^a"\\]*+ # Match anything except `a`, `"` or a backslash (possessively)
(?>\\.[^a"\\]*)*+ # Match an escaped character or
# repeat last pattern as much as possible
)\K # End of CG `r`, reset all consumed characters
( # Start of CG #2
a # Match literal `a`
| # Or
"(*SKIP)(*F) # Match a `"` and skip over current match
)
(?(?= # Start a conditional cluster, assuming a positive lookahead
((?&r)") # Start of CG #3, recurs CG `r` and match `"`
) # End of condition
\3 # If conditional passed match CG #3
) # End of conditional
Last but not least...
Matching a character inside quotation marks is tricky since delimiters are exactly the same so opening and closing marks can not be distinguished from each other without taking a look at adjacent strings. What you can do is change a delimiter to something else so that you can look for it later.
Search for: "[^"\\]*(?:\\.[^"\\]*)*"
Replace with: $0Я
Search for: a(?=[^"\\]*(?:\\.[^"\\]*)*"Я)
Replace with whatever you expect.
Search for: "Я
Replace with nothing to revert every thing.
/(["'])(.*?)(a)(.*?\1)/g
With the replace pattern:
$1$2$4
As far as I'm aware, VS Code uses the same regex engine as JavaScript, which is why I've written my example in JS.
The problem with this is that if you have multiple a's in 1 set of quotes, then it will struggle to pull out the right values, so there needs to be some sort of code behind it, or you, hammering the replace button until no more matches are found, to recurse the pattern and get rid of all the a's in between quotes
let regex = /(["'])(.*?)(a)(.*?\1)/g,
subst = `$1$2$4`,
str = `"a"
"helapke"
Not matched - aaaaaaa
"This is the way the world ends"
"Not with fire"
"ABBA"
"abba",
'I can haz cheezburger'
"This is not a match'
`;
// Loop to get rid of multiple a's in quotes
while(str.match(regex)){
str = str.replace(regex, subst);
}
const result = str;
console.log(result);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With