I'm trying to write a sed script that will capture all "naked" URL's in a text file and replace them with <code><a href=[URL]>[URL]</a></code>. By "naked" I mean a URL that is not wrapped inside an anchor tag. My initial thought was that I should match URL's that do not have a " or a > in front of them, and also do not have a < or a " after them. However, I am running into difficulty with expressing the concept of "do not have in front of or behind" because as far as I know sed does not have look-ahead or look-behind. Sample Input: <pre class="prettyprint"><code>[Beginning of File]http://foo.bar arbitrary text http://test.com other text <a href="http://foobar.com">http://foobar.com</a> Nearing end of file!!! http://yahoo.com[End of File] </code></pre> Sample Desired Output: <pre class="prettyprint"><code>[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text <a href="http://test.com">http://test.com</a> other text <a href="http://foo.bar">http://foo.bar</a> Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File] </code></pre> Observe that the third line is unmodified, because it is already inside <code><a href></code>. On the other hand, both the first and second lines are modified. Finally, observe that all non-URL text is unmodified. Ultimately, I am trying to do something like: <pre class="prettyprint"><code>sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013 </code></pre> I began by verifying that the following will correctly match and remove a URL: <pre class="prettyprint"><code>sed 's/http:\/\/[^\s]\+//g' </code></pre> I then tried this, but it is not able to match URL's that start at the beginning of file / input: <pre class="prettyprint"><code>sed 's/[^\>"]http:\/\/[^\s]\+//g' </code></pre> Is there a way to work around this in sed, either by simulating lookbehind / lookahead, or explicitly matching beginning of file and end of file?

sed is an excellent tool for simple substitutions on a single line, for any other text manipulation problems just use awk. Check the definition I'm using in the BEGIN section below for a regexp that matches URLs. It works for your sample but I don't know if it captures all possible URL formats. Even if it doesn't though it may be adequate for your needs. <pre class="prettyprint"><code>$ cat file [Beginning of File]http://foo.bar arbitrary text http://test.com other text <a href="http://foobar.com">http://foobar.com</a> Nearing end of file!!! http://yahoo.com[End of File] $ $ awk -f tst.awk file [Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text <a href="http://test.com">http://test.com</a> other text <a href="http://foobar.com">http://foobar.com</a> Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File] $ $ cat tst.awk BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" } { head = "" tail = $0 while ( match(tail,urlRe) ) { url = substr(tail,RSTART,RLENGTH) href = "href=\"" url "\"" if (index(tail,href) == (RSTART - 6) ) { # this url is inside href="url" so skip processing it and the next url match. count = 2 } if (! (count && count--)) { url = "<a " href ">" url "</a>" } head = head substr(tail,1,RSTART-1) url tail = substr(tail,RSTART+RLENGTH) } print head tail } </code></pre>

The obvious problem with your command is <pre class="prettyprint"><code>You did not escape the parenthesis "(" </code></pre> This is the weird thing about <code>sed</code> regex. It is different to Perl regex that many symbols are by default "literal". You have to escape them to "function". Try: <pre class="prettyprint"><code>s/$[^>"]\?$$http:\/\/[^\s]\+$/\1<a href="\2">\2<\/a>/g </code></pre>

Can sed regex simulate lookbehind and lookahead?

Q: What is lookahead assertion in regex?

A lookahead assertion has the form (?= test) and can appear anywhere in a regular expression. MATLAB® looks ahead of the current location in the text for the test condition. If MATLAB matches the test condition, it continues processing the rest of the expression to find a match.

Tags:

regex

regex-negation

sed

awk

regex-lookarounds

I'm trying to write a sed script that will capture all "naked" URL's in a text file and replace them with <a href=[URL]>[URL]</a>. By "naked" I mean a URL that is not wrapped inside an anchor tag.

My initial thought was that I should match URL's that do not have a " or a > in front of them, and also do not have a < or a " after them. However, I am running into difficulty with expressing the concept of "do not have in front of or behind" because as far as I know sed does not have look-ahead or look-behind.

Sample Input:

[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]

Sample Desired Output:

[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foo.bar">http://foo.bar</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]

Observe that the third line is unmodified, because it is already inside <a href>. On the other hand, both the first and second lines are modified. Finally, observe that all non-URL text is unmodified.

Ultimately, I am trying to do something like:

sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013

I began by verifying that the following will correctly match and remove a URL:

sed 's/http:\/\/[^\s]\+//g'

I then tried this, but it is not able to match URL's that start at the beginning of file / input:

sed 's/[^\>"]http:\/\/[^\s]\+//g'

Is there a way to work around this in sed, either by simulating lookbehind / lookahead, or explicitly matching beginning of file and end of file?

780

asked Feb 15 '13 01:02

merlin2011

2 Answers

sed is an excellent tool for simple substitutions on a single line, for any other text manipulation problems just use awk.

Check the definition I'm using in the BEGIN section below for a regexp that matches URLs. It works for your sample but I don't know if it captures all possible URL formats. Even if it doesn't though it may be adequate for your needs.

$ cat file
[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]
$
$ awk -f tst.awk file
[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]
$
$ cat tst.awk
BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" }
{
    head = ""
    tail = $0
    while ( match(tail,urlRe) ) {
       url  = substr(tail,RSTART,RLENGTH)
       href = "href=\"" url "\""

       if (index(tail,href) == (RSTART - 6) ) {
          # this url is inside href="url" so skip processing it and the next url match.
          count = 2
       }

       if (! (count && count--)) {
          url = "<a " href ">" url "</a>"
       }

       head = head substr(tail,1,RSTART-1) url
       tail = substr(tail,RSTART+RLENGTH)
    }

    print head tail
}

182

answered Sep 30 '22 18:09

Ed Morton

The obvious problem with your command is

You did not escape the parenthesis "("

This is the weird thing about sed regex. It is different to Perl regex that many symbols are by default "literal". You have to escape them to "function". Try:

s/\([^>"]\?\)\(http:\/\/[^\s]\+\)/\1<a href="\2">\2<\/a>/g

answered Sep 30 '22 19:09

SwiftMango

Related questions
                            
                                Regular expression lookbehind problem
                            
                                Derive minimal regular expression from input
                            
                                Which regular expression algorithm does PHP use?
                            
                                Weirdness with gsub
                            
                                How to remove subdomains from domains using javascript [closed]
                            
                                Regex for matching a music Chord
                            
                                Split tokens on string using Regex in c#
                            
                                Why replaceFirst and replaceAll give different results?
                            
                                Regular expression with javascript
                            
                                express.js routes explanation
                            
                                Detecting if two regexes could possibly match the same string [duplicate]
                            
                                Regular expression matching emoji in Mac OS X / iOS
                            
                                Puzzled by use of .{1} in regex
                            
                                Getting text around a specific element reference
                            
                                vscode regex sub match evaluate instead of concatenate?
                            
                                Formatting camel case to readable in PHP while skipping abbreviations
                            
                                VSCode deletes `\` on save from my regex pattern [duplicate]
                            
                                Black --exclude argument not excluding desired file(s)
                            
                                Is this the RegEx for matching any cell reference in an Excel formula?
                            
                                Is_prime function via regex in python (from perl)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With