Extracting URLs from an Emacs buffer?

Question

How can I write an Emacs Lisp function to find all hrefs in an HTML file and extract all of the links?

Input:

<html>
 <a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow&lt/a>
 <h1>Emacs Lisp</h1>
 <a href="http://news.ycombinator.com" _target="_blank">Hacker News&lt/a>
</html>

Output:

http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News

I've seen the re-search-forward function mentioned several times during my search. Here's what I think that I need to do based on what I've read so far.

(defun extra-urls (file)
 ...
 (setq buffer (...
 (while
        (re-search-forward "http://" nil t)
        (when (match-string 0)
...
))

Admin · Accepted Answer

I took Heinzi's solution and came up with the final solution that I needed. I can now take a list of files, extract all URL's and titles, and place the results in one output buffer.

(defun extract-urls (fname)
 "Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
  (setq in-buf (set-buffer (find-file fname))); Save for clean up
  (beginning-of-buffer); Need to do this in case the buffer is already open
  (setq u1 '())
  (while
      (re-search-forward "^.*<a href=\"$[^\"]+$\"[^>]+>$[^<]+$</a>" nil t)

      (when (match-string 0)            ; Got a match
        (setq url (match-string 1) )    ; URL
        (setq title (match-string 2) )  ; Title
        (setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
       )
      )
  (kill-buffer in-buf)          ; Don't leave a mess of buffers
  (progn
    (with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
      (mapcar 'insert u1))
    (switch-to-buffer "new-urls.csv"); Finally, show the new buffer
    )
  )

;; Create a list of files to process
;;
(mapcar 'extract-urls '(
                       "/tmp/foo.html"
                       "/tmp/bar.html"
               ))

Heinzi · Answer

If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:

(defun getlinks ()
  (beginning-of-buffer)
  (replace-regexp "^.*<a href=\"$[^\"]+$\"[^>]+>$[^<]+$</a>.*$" "LINK:\1|\2")
  (beginning-of-buffer)
  (replace-regexp "^$[^L]\|\(L[^I]$\|$LI[^N]$\|$LIN[^K]$\).*$" "")
  (beginning-of-buffer)
  (replace-regexp "
+" "
")
  (beginning-of-buffer)
  (replace-regexp "^LINK:$.*$$" "\1")
)

It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".

Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).

Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.

Extracting URLs from an Emacs buffer?

Tags:

elisp

2 Answers

Heinzi

Recent Activity

Donate For Us

Extracting URLs from an Emacs buffer?

Tags:

elisp

2 Answers

Heinzi

Related questions

Recent Activity

Donate For Us