I have a file with three columns that looks like that:
0 1612291061 http://www.staropolska.pl/
0 1612450417 http://m.kerygma.pl/
6831926761338023936 1612171787 http://www.kerygma.pl/hermeneutyka-biblijna/377-ksiegi-starego-testamentu-mini-streszczenie
6867871457052077056 1612534199 http://www.kerygma.pl/katechizm-kkk/kkk-iv-modlitwa/538-kkk-2558-2565
I want to extract domains from the third column whilst keeping the first two columns, so I want to have a file that looks like that:
0 1612291061 http://www.staropolska.pl
0 1612450417 http://m.kerygma.pl
6831926761338023936 1612171787 http://www.kerygma.pl
6867871457052077056 1612534199 http://www.kerygma.pl
So far I am able to extract domains using grep:
cat file.txt | grep -Eo '(http|https)://[^/"]+'
but this gives me only domains from third column:
http://www.staropolska.pl
http://m.kerygma.pl
http://www.kerygma.pl
http://www.kerygma.pl
without printing the first two.
Another option is cut
, using /
as delimiter:
$ cat file.txt | cut -d '/' -f 1-3
0 1612291061 http://www.staropolska.pl
0 1612450417 http://m.kerygma.pl
6831926761338023936 1612171787 http://www.kerygma.pl
6867871457052077056 1612534199 http://www.kerygma.pl
You just need to allow grep
regex to match anything before https?://
:
grep -Eo '.*[[:blank:]]https?://[^/"]+' file
0 1612291061 http://www.staropolska.pl
0 1612450417 http://m.kerygma.pl
6831926761338023936 1612171787 http://www.kerygma.pl
6867871457052077056 1612534199 http://www.kerygma.pl
RegEx Explained:
.*
: Match 0 or more of any characters[[:blank:]]
: Match one space or tab characterhttps?
: Match https
or http
://
: Match ://
[^/"]+
: Match 1+ of any character that is not a /
and not a "
Alternatively, you may try this sed
as well:
sed -E 's~([[:blank:]]https?://[^/"]+).*~\1~' file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With