Complex changes to a URL with sed

Question

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.

I currently use this command:

feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^$.\{3\}$$.\{13\}$$.\{6\}$$.\{3\}$$.*$/\1\3\5/'

This gives me a number of feed items per line that look like this:

Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom

Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:

Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664

That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.

Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.

I have no idea how to do that and can't find the answer after hours of googling. Help?

EDIT:

This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:

Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom

EDIT 2:

It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.

mike.dld · Accepted Answer

Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):

... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'

UPDATE: In basic regexp syntax, the best I can do is

... | sed 's|$://[^/]*/$.*$-[0-9][0-9]*$\.html/.*|\1\2|'

Donal Fellows · Answer

The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.

sed 's!$http://www\.heise\.de/$newsticker/meldung/[^./]*$-[0-9]+$\.html[^ ]*!\1\2!'

Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)

Complex changes to a URL with sed

Tags:

regex

url

sed

fabsh

2 Answers

mike.dld

Donal Fellows

Recent Activity

Donate For Us

Complex changes to a URL with sed

Tags:

regex

url

sed

fabsh

2 Answers

mike.dld

Donal Fellows

Related questions

Recent Activity

Donate For Us