Bash: Split text-file into words with non-alphanumeric characters as delimiters

Question

Lets say "textfile" contains the following:

lorem$ipsum-is9simply the.dummy text%of-printing

and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:

 lorem
 ipsum  
 is9simply  
 the  
 dummy  
 text  
 of  
 printing

How can I accomplish this using the Bash shell?

Some notes:

This is not a homework question.
The simpler case when the words should be determined only by spaces, is easy. Just writing:
```
for i in `cat textfile`; do echo $i; done;
```
will do the trick, and return:
```
 lorem$ipsum-is9simply
 the.dummy
 text%of-printing
```
For splitting words by non-alphanumeric characters I have seen solutions that use the IFS environmental variable (links below), but I would like to avoid using IFS for two reasons: 1) it would require (I think) setting the IFS to a long list of non-alphanumeric characters. 2) I find it kind of ugly.
Here are the two related Q&As I found
How do I split a string on a delimiter in Bash?
How to split a line into words separated by one or more spaces in bash?

Jonathan Leffler · Accepted Answer

Use the tr command:

tr -cs 'a-zA-Z0-9' '
' <textfile

The '-c' is for the complement of the specified characters; the '-s' squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumeric characters (maybe add _ too?); the ' ' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characters than the list above):

tr -cs '[:alnum:]' '
' <textfile

DigitalRoss · Answer

$ awk -f splitter.awk < textfile

$ cat splitter.awk
{
  count0 = split($0, asplit, "[^a-zA-Z0-9]")
  for(i = 1; i <= count0; ++i) { print asplit[i] }
}

Bash: Split text-file into words with non-alphanumeric characters as delimiters

Tags:

bash

scripting

parsing

Sv1

2 Answers

Jonathan Leffler

DigitalRoss

Recent Activity

Donate For Us

Bash: Split text-file into words with non-alphanumeric characters as delimiters

Tags:

bash

scripting

parsing

Sv1

2 Answers

Jonathan Leffler

DigitalRoss

Related questions

Recent Activity

Donate For Us