Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out if there is any non ASCII character in a string with a file path

Detect if there is any non-ASCII character in a file path

I have a Unicode string with UTF-8 encoding that stores the file path, like, for instance, C:\Users\myUser\Downloads\ü.pdf. I have already checked that the string holds a correct file path in the local file system, but since I'm sending this string to a different process that supports only ASCII I need to figure out if that string contains any non-ASCII character.

How can I do that?

like image 459
FrankS101 Avatar asked Jan 11 '18 17:01

FrankS101


People also ask

How do you grep non-ASCII characters in Unix?

With grep -Pv '[\0-\x7f]' , you're asking for lines that don't ( -v ) contain an ASCII character. That's not the same thing as lines that contain a non-ASCII character. Just ask for that. Instead of a code point range, you could ask for non-printable characters in an ASCII locale.

How do I find non printable characters in a text file?

You can download Notepad++ and open the file there. Then, go to the menu and select View->Show Symbol->Show All Characters . All characters will become visible, but you will have to scroll through the whole file to see which character needs to be removed.


2 Answers

An ASCII character uses only the lower 7 bits of a char (values 0-127). A non-ASCII Unicode character encoded in UTF-8 uses char elements that all have the upper bit set. So, you can simply iterate the char elements seeing if any of them has a value above 127, eg:

bool containsOnlyASCII(const std::string& filePath) {
  for (auto c: filePath) {
    if (static_cast<unsigned char>(c) > 127) {
      return false;
    }
  }
  return true;
}

A note on the cast: std::string contains char elements. The standard doesn't define whether char is signed or unsigned. If it's signed, then we can cast it to unsigned in a well-defined way. The standard specifies exactly how this is done.

like image 73
Cris Luengo Avatar answered Oct 06 '22 03:10

Cris Luengo


As suggested by several comments and highlighted by @CrisLuengo answer, we can iterate the characters looking for any in the upper bit set (live example):

#include <iostream>
#include <string>
#include <algorithm>

bool isASCII (const std::string& s)
{
    return !std::any_of(s.begin(), s.end(), [](char c) { 
        return static_cast<unsigned char>(c) > 127; 
    });
}

int main()
{
    std::string s1 { "C:\\Users\\myUser\\Downloads\\Hello my friend.pdf" };   
    std::string s2 { "C:\\Users\\myUser\\Downloads\\ü.pdf" };

    std::cout << std::boolalpha << isASCII(s1) << "\n";
    std::cout << std::boolalpha << isASCII(s2) << "\n";
}

true

false

like image 29
3 revs Avatar answered Oct 06 '22 04:10

3 revs