Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Strip non-ASCII Characters from string

Tags:

c++

string

ascii

Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.

bool invalidChar (char c)
{ 
    return !isprint((unsigned)c); 
}
void stripUnicode(string & str)
{
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end()); 
}

I tested this method on "Prusæus, Ægyptians," and it did nothing I also attempted to substitute isprint for isalnum

The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.

Ref:

How can you strip non-ASCII characters from a string? (in C#)

How to strip all non alphanumeric characters from a string in c++?

Edit:

I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:

// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH

Error Dialog

MSVC++ Debug Library

Debug Assertion Failed!

Program: //myproject

File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c

Line: //Above

Expression:(unsigned)(c+1)<=256

Edit:

Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within should be valid.

Solution:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

If someone else would like to copy/paste this, I can check this question off.

EDIT:

For future reference: try using the __isascii, iswascii commands

like image 743
AnthonyW Avatar asked Apr 16 '12 17:04

AnthonyW


3 Answers

Solution:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

EDIT:

For future reference: try using the __isascii, iswascii commands

like image 139
AnthonyW Avatar answered Oct 10 '22 13:10

AnthonyW


At least one problem is in your invalidChar function. It should be:

return !isprint( static_cast<unsigned char>( c ) );

Casting a char to an unsigned is likely to give some very, very big values if the char is negative (UNIT_MAX+1 + c). Passing such a value toisprint` is undefined behavior.

like image 2
James Kanze Avatar answered Oct 10 '22 15:10

James Kanze


Another solution that doesn't require defining two functions but uses anonymous functions available in C++17 above:

void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), [](char c){return !(c>=0 && c <128);}), str.end());  
}

I think it looks cleaner

like image 1
Fnr Avatar answered Oct 10 '22 14:10

Fnr