Lets say "textfile" contains the following:
lorem$ipsum-is9simply the.dummy text%of-printing
and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:
lorem
ipsum
is9simply
the
dummy
text
of
printing
How can I accomplish this using the Bash shell?
Some notes:
This is not a homework question.
The simpler case when the words should be determined only by spaces, is easy. Just writing:
for i in `cat textfile`; do echo $i; done;
will do the trick, and return:
lorem$ipsum-is9simply
the.dummy
text%of-printing
For splitting words by non-alphanumeric characters I have seen solutions that use the IFS environmental variable (links below), but I would like to avoid using IFS for two reasons: 1) it would require (I think) setting the IFS to a long list of non-alphanumeric characters. 2) I find it kind of ugly.
Here are the two related Q&As I found
How do I split a string on a delimiter in Bash?
How to split a line into words separated by one or more spaces in bash?
Use the tr command:
tr -cs 'a-zA-Z0-9' '\n' <textfile
The '-c
' is for the complement of the specified characters; the '-s
' squeezes out duplicates of the replacements; the 'a-zA-Z0-9'
is the set of alphanumeric characters (maybe add _
too?); the '\n' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characters than the list above):
tr -cs '[:alnum:]' '\n' <textfile
$ awk -f splitter.awk < textfile
$ cat splitter.awk
{
count0 = split($0, asplit, "[^a-zA-Z0-9]")
for(i = 1; i <= count0; ++i) { print asplit[i] }
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With