Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Complex changes to a URL with sed

Tags:

regex

url

sed

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.

I currently use this command:

feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/'

This gives me a number of feed items per line that look like this:

Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom

Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:

Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664

That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.

Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.

I have no idea how to do that and can't find the answer after hours of googling. Help?

EDIT:

This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:

Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom

EDIT 2:

It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.

like image 563
fabsh Avatar asked Oct 02 '22 12:10

fabsh


2 Answers

Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):

... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'

UPDATE: In basic regexp syntax, the best I can do is

... | sed 's|\(://[^/]*/\).*\(-[0-9][0-9]*\)\.html/.*|\1\2|'
like image 110
mike.dld Avatar answered Oct 05 '22 11:10

mike.dld


The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.

sed 's!\(http://www\.heise\.de/\)newsticker/meldung/[^./]*\(-[0-9]+\)\.html[^ ]*!\1\2!'

Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)

like image 22
Donal Fellows Avatar answered Oct 05 '22 12:10

Donal Fellows