Extract domains from one column while keeping other columns

Question

I have a file with three columns that looks like that:

0       1612291061      http://www.staropolska.pl/
0       1612450417      http://m.kerygma.pl/
6831926761338023936     1612171787      http://www.kerygma.pl/hermeneutyka-biblijna/377-ksiegi-starego-testamentu-mini-streszczenie
6867871457052077056     1612534199      http://www.kerygma.pl/katechizm-kkk/kkk-iv-modlitwa/538-kkk-2558-2565

I want to extract domains from the third column whilst keeping the first two columns, so I want to have a file that looks like that:

0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

So far I am able to extract domains using grep:

cat file.txt | grep -Eo '(http|https)://[^/"]+'

but this gives me only domains from third column:

http://www.staropolska.pl
http://m.kerygma.pl
http://www.kerygma.pl
http://www.kerygma.pl

without printing the first two.

Christian Fritz · Accepted Answer

Another option is cut, using / as delimiter:

$ cat file.txt | cut -d '/' -f 1-3
0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

anubhava · Answer

You just need to allow grep regex to match anything before https?://:

grep -Eo '.*[[:blank:]]https?://[^/"]+' file

0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

RegEx Explained:

.*: Match 0 or more of any characters
[[:blank:]]: Match one space or tab character
https?: Match https or http
://: Match ://
[^/"]+: Match 1+ of any character that is not a / and not a "

Alternatively, you may try this sed as well:

sed -E 's~([[:blank:]]https?://[^/"]+).*~\1~' file

Extract domains from one column while keeping other columns

Tags:

regex

grep

MKorona

Video Answer

2 Answers

Christian Fritz

anubhava

Recent Activity

Donate For Us

Extract domains from one column while keeping other columns

Tags:

regex

grep

MKorona

Video Answer

2 Answers

Christian Fritz

anubhava

Related questions

Recent Activity

Donate For Us