Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing all special characters from a string in Bash

I have a lot of text in lowercase, only problem is, that there is a lot of special characters, which I want to remove it all with numbers too.

Next command it's not strong enough:

tr -cd '[alpha]\n '

In case of éćščž and some others it returns "?" But I want to remove all of them. Is there any stronger command?

I use linux mint 4.3.8(1)-release

like image 811
Marta Koprivnik Avatar asked Apr 28 '16 23:04

Marta Koprivnik


People also ask

How do I remove special characters from a bash script?

Escape characters are used to remove the special meaning from a single character. A non-quoted backslash, \, is used as an escape character in Bash.

What does [- Z $1 mean in Bash?

$1 means an input argument and -z means non-defined or empty. You're testing whether an input argument to the script was defined when running the script. Follow this answer to receive notifications.

How do I remove special characters from a string?

Using str_replace() Method: The str_replace() method is used to remove all the special characters from the given string str by replacing these characters with the white space (” “).

How do you remove special characters from shell?

In a shell, the most common way to escape special characters is to use a backslash before the characters.


2 Answers

You can use tr to print only the printable characters from a string like below. Just use the below command on your input file.

tr -cd "[:print:]\n" < file1   

The flag -d is meant to the delete the character sets defined in the arguments on the input stream, and -c is for complementing those (invert what's provided). So without -c the command would delete all printable characters from the input stream and using it complements it by removing the non-printable characters. We also keep the newline character \n to preserve the line endings in the input file. Removing it would just produce the final output in one big line.

The [:print:] is just a POSIX bracket expression which is a combination of expressions [:alnum:], [:punct:] and space. The [:alnum:] is same as [0-9A-Za-z] and [:punct:] includes characters ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

like image 108
Inian Avatar answered Sep 24 '22 05:09

Inian


I am not exactly certain where the text is coming from in your question but lets just say that the "lot of text in lowercase" is in the file called special.txt you could do something like the following but focused more on the characters you want to keep:

cat special.txt | sed 's/[^a-z  A-Z]//g'

It is a bit like doing surgery with an axe though.

Another possible solution in the post Remove non-ascii characters from ...

If the above don't solve your question, please try to provide a bit more details and I might be able to provide a more actionable answer.

like image 44
John Mark Mitchell Avatar answered Sep 25 '22 05:09

John Mark Mitchell