Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract domains from one column while keeping other columns

Tags:

regex

grep

I have a file with three columns that looks like that:

0       1612291061      http://www.staropolska.pl/
0       1612450417      http://m.kerygma.pl/
6831926761338023936     1612171787      http://www.kerygma.pl/hermeneutyka-biblijna/377-ksiegi-starego-testamentu-mini-streszczenie
6867871457052077056     1612534199      http://www.kerygma.pl/katechizm-kkk/kkk-iv-modlitwa/538-kkk-2558-2565

I want to extract domains from the third column whilst keeping the first two columns, so I want to have a file that looks like that:

0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

So far I am able to extract domains using grep:

cat file.txt | grep -Eo '(http|https)://[^/"]+'

but this gives me only domains from third column:

http://www.staropolska.pl
http://m.kerygma.pl
http://www.kerygma.pl
http://www.kerygma.pl

without printing the first two.

like image 717
MKorona Avatar asked Mar 23 '21 14:03

MKorona


Video Answer


2 Answers

Another option is cut, using / as delimiter:

$ cat file.txt | cut -d '/' -f 1-3
0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl
like image 137
Christian Fritz Avatar answered Sep 28 '22 06:09

Christian Fritz


You just need to allow grep regex to match anything before https?://:

grep -Eo '.*[[:blank:]]https?://[^/"]+' file

0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

RegEx Explained:

  • .*: Match 0 or more of any characters
  • [[:blank:]]: Match one space or tab character
  • https?: Match https or http
  • ://: Match ://
  • [^/"]+: Match 1+ of any character that is not a / and not a "

Alternatively, you may try this sed as well:

sed -E 's~([[:blank:]]https?://[^/"]+).*~\1~' file
like image 36
anubhava Avatar answered Sep 28 '22 07:09

anubhava