Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing text with elisp

Since I've converted to the Church of Emacs, I've been trying to do everything from inside it, and I was wondering how to do some text processing quickly and efficiently with it.

As an example, let's take this list that I was editing some minutes ago on org-mode.

** Diego: b QI
** bruno-gil: b QI
** Koma: jo
** um: rsrs pr0n
** FelipeAugusto: esp
** GustavoPupo: pinto tr etc
** GP: lit gtk
** Alan: jo mil pc
** Jost: b hq jo 1997
** Herbert: b rsrs pr0n
** Andre: maia mil pseudo
** Rodrigo: c
** caue: b rsrs 7arte pseudo
** kenny: cri gif
** daniel: gtk mu pr0n rsrs b
** tony: an 1997 esp
** Vitor: b jo mimimi
** raphael: b rpg 7arte
** Luca: b lit gnu pc prog mmu 7arte 1997
** LZZ: an qt
** William: b an jo pc 1997
** Epic: gtk
** Aldo: b pseudo pol mil fur
** GustavoKyon: an gtk
** CarlosIsaksen : an hq jo 7arte gtk 1997
** Peter: pseudo pol mil est 1997 gtk lit lang
** leandro: b jo cb
** frederico: 7arte lit gtk
** rol: b an pseudo mimimi 7arte
** mathias: jo lit
** henrique: 1997 h gtk qt
** eumané: an qt
** walrus: cri de
** FilipePinheiro: lit pseudo
** Igor: pseudo b
** Erick: b jo rpg q 1997 gtk
** Gabriel: pr0n rsrs qt
** george: clo mimimi
** anão: hq jo 1997 rsrs clô b
** jeff: 7arte gtk
** davidatenas:  an 7arte 1997 esp qt
** HHahaah: b 
** Eduardo: b

It is a list of names associated with tags, and I want to get a list of tags associated with names.

In bash, I would first echo with single quotes the whole thing pasted and then pipe it to awk, looping over each line and adding each its parts to the right temporary variable and then messing with it until it is like I want.

echo '** Diego: b QI
** bruno-gil: b QI
** Koma: jo
** um: rsrs pr0n
** FelipeAugusto: esp
** GustavoPupo: pinto, tr etc
** GP: lit gtk
** Alan: jo mil pc
** Jost: b hq jo 1997
** Herbert: b rsrs pr0n
** Andre: maia mil pseudo
** Rodrigo: c
** caue: b rsrs 7arte pseudo
** kenny: cri gif
** daniel: gtk mu pr0n rsrs b
** tony: an 1997 esp
** Vitor: b jo mimimi
** raphael: b rpg 7arte
** Luca: b lit gnu pc prog mmu 7arte 1997
** LZZ: an qt
** William: b an jo pc 1997
** Epic: gtk
** Aldo: b pseudo pol mil fur
** GustavoKyon: an gtk
** CarlosIsaksen : an hq jo 7arte gtk 1997
** Peter: pseudo pol mil est 1997 gtk lit lang
** leandro: b jo cb
** frederico: 7arte lit gtk
** rol: b an pseudo mimimi 7arte
** mathias: jo lit
** henrique: 1997 h gtk qt
** eumané: an qt
** walrus: cri de
** FilipePinheiro: lit pseudo
** Igor: pseudo b
** Erick: b jo rpg q 1997 gtk
** Gabriel: pr0n rsrs qt
** george: clo mimimi
** anão: hq jo 1997 rsrs clô b
** jeff: 7arte gtk
** davidatenas:  an 7arte 1997 esp qt
** HHahaah: b
** Eduardo: b
' | awk '{sub(":","");for (i=3;i<=NF;i++) members[$i] = members[$i] " " $2}; END{for (j in members) print j ": " members[j]}' | sort

... and TA-DA! The expected output in less than 2 minutes, done in an intuitive and incremental way. Can you show me how to do something like this in elisp, preferably in an emacs buffer, with elegance and simplicity?

Thanks!

like image 931
konr Avatar asked Jan 03 '10 05:01

konr


3 Answers

The first thing I would do is to take advantage of org-mode's tag support. Instead of

** Diego: b QI

You would have

** Diego                          :b:QI:

Which org-mode recognizes as the tags "b" and "QI".

To transform your current format to the standard org-mode format, you can use the following (assuming the buffer with your source is called "asdf")

(with-current-buffer "asdf"
  (beginning-of-buffer)
  (replace-string " " ":")
  (beginning-of-buffer)
  (replace-string "**:" "** ")
  (beginning-of-buffer)
  (replace-string "::" " :")
  (beginning-of-buffer)
  (replace-string "\n" ":\n")
  (org-set-tags-command t t))

It's not pretty or efficient, but it gets the job done.

After that, you can use the following to produce a buffer that has the format you wanted from the shell script:

(let ((results (get-buffer-create "results"))
      tags)
  (with-current-buffer "asdf"
    (beginning-of-buffer)
    (while (org-on-heading-p)
      (mapc '(lambda (item) (when item (add-to-list 'tags item))) (org-get-local-tags))
      (outline-next-visible-heading 1)))
  (setq tags (sort tags 'string<))
  (with-current-buffer results
    (erase-buffer)
    (mapc '(lambda (item)
             (insert (format "%s: %s\n"
                             item
                             (with-current-buffer "asdf"
                               (org-map-entries '(substring-no-properties (org-get-heading t)) item)))))
          tags)
    (beginning-of-buffer)
    (replace-regexp "[()]" "")))

This puts the results in a buffer called "results", creating it if it doesn't already exist. Basically, it is collecting all the tags in the buffer "asdf", sorting them, then looping through each tag and searching for each headline with that tag in "asdf" and inserting it to "results".

With a bit of cleaning up, this could be made into a function; basically just replacing "asdf" and "results" with arguments. If you need that done, I can do that.

like image 70
haxney Avatar answered Sep 30 '22 18:09

haxney


There is a function shell-command-on-region that pretty much does what it says. You can highlight a region, do M-|, type the name of a shell command, and the data is piped to that command. Give it an argument and the region is replaced with the result of the command.

For a trivial example, highlight a region, type 'C-u 0 M-| wc' (control-u, zero, meta-pipe and then 'wc') and the region will be replaced with the number of characters, words and lines of that region.

Another thing you can do is figure out how to manipulate one line, make it a macro, and then run the macro repeatedly. For example, 'C-x ( C-s foo C-g bar C-x )' will search for the word "foo", then type the word "bar", changing it to "foobar". You can then do 'C-u C-x e' once which will continually run the macro until it doesn't find any more occurrences of "foo".

like image 22
Bryan Oakley Avatar answered Sep 30 '22 16:09

Bryan Oakley


Ok, here is my first attempt in elisp:

  1. I start a buffer with elisp and paredit modes on, open double quotes and paste the text
  2. I bind it to a symbol using let
(let ((foobar "** Diego: b QI
** bruno-gil: b QI
** Koma: jo
** um: rsrs pr0n
** FelipeAugusto: esp
** GustavoPupo: pinto, tr etc
** GP: lit gtk
** Alan: jo mil pc
** Jost: b hq jo 1997
** Herbert: b rsrs pr0n
** Andre: maia mil pseudo
** Rodrigo: c
** caue: b rsrs 7arte pseudo
** kenny: cri gif
** daniel: gtk mu pr0n rsrs b
** tony: an 1997 esp
** Vitor: b jo mimimi
** raphael: b rpg 7arte
** Luca: b lit gnu pc prog mmu 7arte 1997
** LZZ: an qt
** William: b an jo pc 1997
** Epic: gtk
** Aldo: b pseudo pol mil fur
** GustavoKyon: an gtk
** CarlosIsaksen : an hq jo 7arte gtk 1997
** Peter: pseudo pol mil est 1997 gtk lit lang
** leandro: b jo cb
** frederico: 7arte lit gtk
** rol: b an pseudo mimimi 7arte
** mathias: jo lit
** henrique: 1997 h gtk qt
** eumané: an qt
** walrus: cri de
** FilipePinheiro: lit pseudo
** Igor: pseudo b
** Erick: b jo rpg q 1997 gtk
** Gabriel: pr0n rsrs qt
** george: clo mimimi
** anão: hq jo 1997 rsrs clô b
** jeff: 7arte gtk
** davidatenas:  an 7arte 1997 esp qt
** HHahaah: b 
** Eduardo: b 
"))
  foobar)

Now I change foobar to something fancy.

  1. First I remove the symbols with a regexp and split the text in strings using (split-string)
  2. Then I do a mapcar to turn each line into a list of words
(mapcar #'(lambda (y) (split-string y " " t)) (split-string (replace-regexp-in-string "[:\*]" "" foobar) "\n" t))
  1. Then I create a hashmap and bind it to temphash ((temphash (make-hash-table :test 'equal)))
  2. And then I loop into the nested lists to add the elements to the hash-table. I think I'm not supposed to do non-functional programming with mapcar, but nobody is looking ;)
(mapcar #'(lambda (l)
              (mapcar #'(lambda (m) (puthash m (format "%s %s" (car l) (let ((tempel (gethash m temphash)))
                                                            (if tempel tempel ""))) temphash)) (rest l)))
          (mapcar #'(lambda (y) (split-string y " " t)) (split-string (replace-regexp-in-string "[:\*]" "" foobar) "\n" t))) 
  1. Finally, I extract the elements from the hash table into another set of nested lists with a handy function stolen from Xah Lee's webpage,
  2. And finally I pretty print it to another buffer with M-x pp-eval-last-sexp

It's a little mind-bending, specially the double mapcar, but it sorta works. Here is the full "code":

;; Stolen from Xah Lee's page


(defun hash-to-list (hashtable)
  "Return a list that represent the hashtable."
  (let (mylist)
    (maphash (lambda (kk vv) (setq mylist (cons (list kk vv) mylist))) hashtable)
    mylist
  )
)

;; Code

(let ((foobar "** Diego: b QI
** bruno-gil: b QI
** Koma: jo
** um: rsrs pr0n
** FelipeAugusto: esp
** GustavoPupo: pinto, tr etc
** GP: lit gtk
** Alan: jo mil pc
** Jost: b hq jo 1997
** Herbert: b rsrs pr0n
** Andre: maia mil pseudo
** Rodrigo: c
** caue: b rsrs 7arte pseudo
** kenny: cri gif
** daniel: gtk mu pr0n rsrs b
** tony: an 1997 esp
** Vitor: b jo mimimi
** raphael: b rpg 7arte
** Luca: b lit gnu pc prog mmu 7arte 1997
** LZZ: an qt
** William: b an jo pc 1997
** Epic: gtk
** Aldo: b pseudo pol mil fur
** GustavoKyon: an gtk
** CarlosIsaksen : an hq jo 7arte gtk 1997
** Peter: pseudo pol mil est 1997 gtk lit lang
** leandro: b jo cb
** frederico: 7arte lit gtk
** rol: b an pseudo mimimi 7arte
** mathias: jo lit
** henrique: 1997 h gtk qt
** eumané: an qt
** walrus: cri de
** FilipePinheiro: lit pseudo
** Igor: pseudo b
** Erick: b jo rpg q 1997 gtk
** Gabriel: pr0n rsrs qt
** george: clo mimimi
** anão: hq jo 1997 rsrs clô b
** jeff: 7arte gtk
** davidatenas:  an 7arte 1997 esp qt
** HHahaah: b 
** Eduardo: b 
")
      (temphash  (make-hash-table :test 'equal)))
  (mapcar #'(lambda (l)
              (mapcar #'(lambda (m) (puthash m (format "%s %s" (car l) (let ((tempel (gethash m temphash)))
                                                            (if tempel tempel ""))) temphash)) (rest l)))
          (mapcar #'(lambda (y) (split-string y " " t)) (split-string (replace-regexp-in-string "[:\*]" "" foobar) "\n" t)))
  (hash-to-list temphash)) 

And here is the output:

(("clô" "anão ")
 ("clo" "george ")
 ("q" "Erick ")
 ("de" "walrus ")
 ("h" "henrique ")
 ("cb" "leandro ")
 ("lang" "Peter ")
 ("est" "Peter ")
 ("fur" "Aldo ")
 ("pol" "Peter Aldo ")
 ("qt" "davidatenas Gabriel eumané henrique LZZ ")
 ("mmu" "Luca ")
 ("prog" "Luca ")
 ("gnu" "Luca ")
 ("rpg" "Erick raphael ")
 ("mimimi" "george rol Vitor ")
 ("an" "davidatenas eumané rol CarlosIsaksen GustavoKyon William LZZ tony ")
 ("mu" "daniel ")
 ("gif" "kenny ")
 ("cri" "walrus kenny ")
 ("7arte" "davidatenas jeff rol frederico CarlosIsaksen Luca raphael caue ")
 ("c" "Rodrigo ")
 ("pseudo" "Igor FilipePinheiro rol Peter Aldo caue Andre ")
 ("maia" "Andre ")
 ("1997" "davidatenas anão Erick henrique Peter CarlosIsaksen William Luca tony Jost ")
 ("hq" "anão CarlosIsaksen Jost ")
 ("pc" "William Luca Alan ")
 ("mil" "Peter Aldo Andre Alan ")
 ("gtk" "jeff Erick henrique frederico Peter CarlosIsaksen GustavoKyon Epic daniel GP ")
 ("lit" "FilipePinheiro mathias frederico Peter Luca GP ")
 ("etc" "GustavoPupo ")
 ("tr" "GustavoPupo ")
 ("pinto," "GustavoPupo ")
 ("esp" "davidatenas tony FelipeAugusto ")
 ("pr0n" "Gabriel daniel Herbert um ")
 ("rsrs" "anão Gabriel daniel caue Herbert um ")
 ("jo" "anão Erick mathias leandro CarlosIsaksen William Vitor Jost Alan Koma ")
 ("QI" "bruno-gil Diego ")
 ("b" "Eduardo HHahaah anão Erick Igor rol leandro Aldo William Luca raphael Vitor daniel caue Herbert Jost bruno-gil Diego "))
like image 45
konr Avatar answered Sep 30 '22 17:09

konr