For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg I want to get this URL, assuming it's the only url in the entire file. <pre class="prettyprint"><code>cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*' </code></pre> This works only if the URL doesn't have the plus signs. How do I make work for + signs as well?

You missed the character class <code>0-9</code> (also useless use of cat): <pre class="prettyprint"><code>grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9+-]*/[a-zA-Z0-9.,-+]*' file.html </code></pre> Slight improvement, use <code>-i</code> for case insensitivity and only match images <code>.jpg</code> or <code>.jpeg</code>. <pre class="prettyprint"><code>grep -io 'http://ex[a-z.-]*/[a-z0-9+-]*/[a-z0-9.,-+]*[.jpe?g]' file.html </code></pre> Or how about just: <pre class="prettyprint"><code>grep -io 'http://ex.example.*[.jpe?g]' file.html </code></pre>

The following fixes your regular expression for this specific case (including numbers and plus-signs): <pre class="prettyprint"><code>http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]* </code></pre> <h3>Demonstration:</h3> <pre class="prettyprint"><code>echo "For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg" </code></pre> I want to get this URL, assuming it's the only url in the entire file. <pre class="prettyprint"><code>cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*' </code></pre> This works only if the URL doesn't have the plus signs. How do I make work for + signs as well? <pre class="prettyprint"><code>cat file.html| grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]*' </code></pre> output: <pre class="prettyprint"><code>http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg </code></pre> This does not extract all valid URLs. There are plenty of other answers on this site about URL matching.

How to grep for a URL in a file?

Tags:

regex

grep

For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg

I want to get this URL, assuming it's the only url in the entire file.

cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*'

This works only if the URL doesn't have the plus signs.

How do I make work for + signs as well?

687

asked Nov 28 '12 18:11

Leonardo DaVintik

2 Answers

You missed the character class 0-9 (also useless use of cat):

grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9+-]*/[a-zA-Z0-9.,-+]*' file.html

Slight improvement, use -i for case insensitivity and only match images .jpg or .jpeg.

grep -io 'http://ex[a-z.-]*/[a-z0-9+-]*/[a-z0-9.,-+]*[.jpe?g]' file.html

Or how about just:

grep -io 'http://ex.example.*[.jpe?g]' file.html

100

answered Sep 29 '22 09:09

Chris Seymour

The following fixes your regular expression for this specific case (including numbers and plus-signs):

http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]*

Demonstration:

echo "For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg"

I want to get this URL, assuming it's the only url in the entire file.

cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*'

This works only if the URL doesn't have the plus signs. How do I make work for + signs as well?

cat file.html| grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]*'

output:

http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg

This does not extract all valid URLs. There are plenty of other answers on this site about URL matching.

answered Sep 29 '22 10:09

Johnsyweb

Related questions
                            
                                Why is Regex (c++) taking exponential time?
                            
                                MongoDB $regex query and potential exploits
                            
                                Replace all emojis from a given unicode string
                            
                                Raku regex: How to use capturing group inside lookaheads
                            
                                Why do I get different backtracking with these Raku regexes?
                            
                                Eclipse regex search/replace not replacing after regex positive look-ahead?
                            
                                RegExp alternative to negative lookahead match for Google Analytics
                            
                                What's the easiest way to remove all attributes from a XML in C#?
                            
                                Regexp for Tokenizing English Text
                            
                                How can I specify the priority of a match pattern in a Regex?
                            
                                RegEx to match C# Interface file names only
                            
                                Regexp - How to find text parts surrounded by two specific characters?
                            
                                SQLite in C and supporting REGEXP
                            
                                How to Replace Text while Maintaining Capitalization?
                            
                                Regex and escaped and unescaped delimiter
                            
                                Refactoring large/old CSS files
                            
                                Auto-link regular expression
                            
                                Regular expression to split long strings in several lines
                            
                                Are these regex patterns different?
                            
                                How to go to the last match of a Vim search pattern

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With