Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bash: Split text-file into words with non-alphanumeric characters as delimiters

Lets say "textfile" contains the following:

lorem$ipsum-is9simply the.dummy text%of-printing

and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:

 lorem
 ipsum  
 is9simply  
 the  
 dummy  
 text  
 of  
 printing

How can I accomplish this using the Bash shell?



Some notes:

  • This is not a homework question.

  • The simpler case when the words should be determined only by spaces, is easy. Just writing:

    for i in `cat textfile`; do echo $i; done;
    

    will do the trick, and return:

     lorem$ipsum-is9simply
     the.dummy
     text%of-printing
    

    For splitting words by non-alphanumeric characters I have seen solutions that use the IFS environmental variable (links below), but I would like to avoid using IFS for two reasons: 1) it would require (I think) setting the IFS to a long list of non-alphanumeric characters. 2) I find it kind of ugly.

  • Here are the two related Q&As I found
    How do I split a string on a delimiter in Bash?
    How to split a line into words separated by one or more spaces in bash?

like image 534
Sv1 Avatar asked Sep 24 '10 22:09

Sv1


2 Answers

Use the tr command:

tr -cs 'a-zA-Z0-9' '\n' <textfile

The '-c' is for the complement of the specified characters; the '-s' squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumeric characters (maybe add _ too?); the '\n' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characters than the list above):

tr -cs '[:alnum:]' '\n' <textfile
like image 116
Jonathan Leffler Avatar answered Nov 06 '22 08:11

Jonathan Leffler


$ awk -f splitter.awk < textfile

$ cat splitter.awk
{
  count0 = split($0, asplit, "[^a-zA-Z0-9]")
  for(i = 1; i <= count0; ++i) { print asplit[i] }
}
like image 22
DigitalRoss Avatar answered Nov 06 '22 06:11

DigitalRoss