Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.

So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+

This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?

If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.

So it should be something like: - find a string of numbers, letters and special signs until "http" - delete what you found - and keep searching for more numbers, letters ans special signs after "html" - and delete that again

Any ideas? Thanks so much.

like image 542
Phillip Avatar asked Dec 15 '22 04:12

Phillip


1 Answers

In Notepad++, in the Replace menu (CTRL+H) you can do the following:

  • Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
  • Replace: $1\n
  • Options: check the Regular expression and the . matches newline

This will return you with a list of all your links. There are two issues though:

  1. The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
  2. It will leave the text after the last matched URL intact. You have to delete it manually.
like image 121
psxls Avatar answered Dec 24 '22 18:12

psxls