Logo Questions Linux Laravel Mysql Ubuntu Git Menu

What is the difference between `(\S.*\S)` and `^\s*(.*)\s*$` in regex?



I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespace.

The solution provided in the tutorial is

We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.

The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:

We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.

That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).

I've found this Stack Overflow solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.

Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?

like image 256
Catija Avatar asked Jul 25 '20 03:07


People also ask

What does ?! Mean in regex?

It's a negative lookahead, which means that for the expression to match, the part within (?!...) must not match. In this case the regex matches http:// only when it is not followed by the current host name (roughly, see Thilo's comment). Follow this answer to receive notifications.

What does S * mean in regex?

\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines. *? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times".

What is \S and \s in regex?

As far as I understand: \s means a whitespace character, \S Non-whitespace characters and [\S\s] means any character, anything.

What does a zA Z0 9 mean?

The bracketed characters [a-zA-Z0-9] indicate that the characters being matched are all letters (regardless of case) and numbers. The * (asterisk) following the brackets indicates that the bracketed characters occur 0 or more times.

What is the difference between \S\S and regular expression?

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified. Here is a sheet explaining all the regex commands. Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

What is the difference between -‘s and -s’?

What is the difference between -‘s and -s’? Both forms are used when making words possessive. However, the difference between putting the apostrophe before the -s or after the -s changes the meaning and usage of the word. There are also some exceptions and other things to keep in mind when making a noun possessive.

How many whitespace replacements will occur in the regular expression \s+?

Thus, eleven replacements will occur. Next, let's pass the regular expression \s+ to the replaceAll () method: Due to the greedy quantifier +, the replaceAll () method will match the longest sequence of contiguous whitespace characters and replace each match with an underscore.

What is the difference between \s and \s+ in JavaScript?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

1 Answers

The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.


Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.

But, given the problem description at your link:

Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.

From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:

Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/

Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:

const input = `   foo   
qux  `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])

Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.

like image 179
CertainPerformance Avatar answered Oct 06 '22 01:10
