RegEx to remove repeated start of line using TextWrangler

Question

Trying to turn

a: 1, 2, 3
a: a, b, v
b: 5, 6, 7
b: 10, 1543, 1345
b: e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
e1: 1, 3, 2
e1: 9, 8, 7, 6

into

a: 1, 2, 3
   a, b, v
b: 5, 6, 7
   10, 1543, 1345
   e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
    1, 3, 2
    9, 8, 7, 6

So, the lines are sorted. If consecutive lines start with the same sequence of characters up to / including some separator (here the colon (and the blank following it)), only the first instance should be preserved - as should be the remainder of all lines. There could be up to about a dozen (and a half) lines starting with the identical sequence of characters. The input holds about 4,500 lines…

Tried in TextWrangler.

Whilst the search pattern

^([[:alnum:]]+): (.+)
((\1:) (.+)
)*

matches correctly, neither the replacement

\1:	\2
	\3

nor

\1:	\2
	\4

gets me anywhere close to what I'm looking for.

The search pattern

^(.+): (.+)
((?<=\1:) (.+)
)*

is rejected for the lookbehind not being fixed length. - Not sure, it's going into the right direction anyway, though.

Looking at How to merge lines that start with the same items in a text file I wonder, whether there is an elegant (say: one search pattern, one replacement, run once) solution at all.

On the other hand, I might just not be able to come up with the right question to search the net for. If you know better, please, point me into the right direction.

Keeping the remainder of the rows aligned is, of course, sugar on the cake…

Thank you for your time.

Jonny 5 · Accepted Answer

As a workaround for variable length lookbehind: PCRE allows alternatives of variable length

PCRE is not fully Perl-compatible when it comes to lookbehind. While Perl requires alternatives inside lookbehind to have the same length, PCRE allows alternatives of variable length.

An idea that requires to add a pipe for each character of max prefix length:

(?<=(\w\w:)|(\w:)) (.*
?)\1?\2?

And replace with \3. See test at regex101. Capturing inside the lookbehind is important for not consuming / not skipping a match. Same pattern variable eg .NET: (?<=(\w+:)) (.* ?)\1?

(?<=(\w\w:)|(\w:)) first two capture groups inside lookbehind for capturing prefix: Two or one word characters followed by a colon. \w is a shorthand for [A-Za-z0-9_]
(.* ?) third capture group for stuff between prefixes. Optional newline to get the last match.
\1?\2? will optionally replace the same prefix if in the following line. Only one of both can be set: \1 xor \2. Also space after colon would always be matched - regardless prefix.

Summary: Space after each prefix is converted to tab. Prefix of following line only if matches current.
To match and replace multiple spaces and tabs: (?<=(\w\w:)|(\w:))[ ]+(.* ?)\1?\2?

Carsten Hagemann · Answer

The problem with the substitution is the uncertain number of matches. When you limit that number e.g. to 12, you could use a regex like this:

^([^:]+): ([^ ]+[ ]*)(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?(\1: ([^ ]+[ ]*))?

with this replacement:

\1: \2 \4 \6 \8 \10 \12 \14 \16 \18 \20 \22 \24

Explanation: it contains basically just two sub-regexes

^([^:]+): ([^ ]+[ ]*) = matches on the first line of a group
(\1: ([^ ]+[ ]*))? = optional matches on consecutive lines, belonging to the same group. You have to copy this regex as often as needed to match all lines (i.e. in this case 12x). The ? (= optional) match won't give you an error if there aren't enough matches for all substitutions.
the at the beginning of the substitution is needed for a formatting issue
the result will contain a few empty lines, but I'm sure, you can solve that... ;-)

DEMO 1

However, since I'm not a fan of over-sized regexes - and for the case that you have a bigger number of potential matches - I would prefer a solution like this:

combine all lines, belonging to the same group (as you already mentioned: How to merge lines that start with the same items in a text file). Within these steps, you can replace the group item by something unique (e.g. :@:).
replace this unique item with

DEMO 2

John Smith · Answer

The awk one-liner below will do what you want

awk -F: 'NR==1 {print $0} NR != 1 {if ($1 != prev) print $0; else {for (i=0; i<=length($1); ++i) printf " "; print $2;}} {prev=$1}' < input_file.txt

(put the original text into input_file.txt)

I believe it is possible to write a nicer code, but it is time to go to bed)

RegEx to remove repeated start of line using TextWrangler

Tags:

regex

replace

textwrangler

Abecee

3 Answers

Jonny 5

Carsten Hagemann

John Smith

Recent Activity

Donate For Us

RegEx to remove repeated start of line using TextWrangler

Tags:

regex

replace

textwrangler

Abecee

3 Answers

Jonny 5

Carsten Hagemann

John Smith

Related questions

Recent Activity

Donate For Us