Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get a Path from a raw C string (CStr or *const u8)?

What's the most direct way to use a C string as Rust's Path?

I've got const char * from FFI and need to use it as a filesystem path in Rust.

  • I'd rather not enforce UTF-8 on the path, so converting through str/String is undesirable.
  • It should work on Windows at least for ASCII paths.

To clarify: I'm just replacing an existing C implementation that passes the path to fopen with a Rust stdlib implementation. It's not my problem whether it's a valid path or encoded properly for a given filesystem, as long as it's not worse than fopen (and I know fopen basically doesn't work on Windows).

like image 973
Kornel Avatar asked Sep 21 '17 11:09

Kornel


2 Answers

Here's what I've learned:

  • Path/OsStr always use WTF-8 on Windows, and are an encoding-ignorant bag of bytes on Unix.

  • They never ever store any paths using any "wide" encoding like UTF-16 or UCS-2. The Windows-only masquerade of OsStr is to hide the WTF-8 encoding, nothing more.

  • It is extremely unlikely to ever change, because the standard library API supports creation of Path and OsStr from UTF-8 &str without any allocation or mutation of memory (i.e. as_ref() is supported, and its strict API doesn't leave room to implement it as anything other than a pointer cast).

Unix-only zero-copy version (it doesn't even depend on any implementation details):

use std::ffi::{CStr,OsStr};
use std::path::Path;
use std::os::unix::ffi::OsStrExt;

let slice = CStr::from_ptr(c_null_terminated_string_ptr_here);
let osstr = OsStr::from_bytes(slice.to_bytes());
let path: &Path = osstr.as_ref();

On Windows, converting only valid UTF-8 is the best Rust can do without a charade of creating WTF-8 OsString from code units:

…
let str = ::std::str::from_utf8(slice.to_bytes()).expect("keep your surrogates paired");
let path: &Path = str.as_ref();
like image 179
Kornel Avatar answered Nov 09 '22 16:11

Kornel


Safely and portably? Insofar as I'm aware, there isn't a way. My advice is to demand UTF-8 and just pray it never breaks.

The problem is that the only thing you can really say about a "C string" is that it's NUL-terminated. You can't really say anything meaningful about how it's encoded. At least, not with any real certainty.

Unsafely and/or non-portably? If you're running on Linux (and possibly other modern *NIXen), you can maybe use OsStrExt to do the conversion. This only works assuming the C string was a valid path in the first place. If it came from some string processing code that wasn't using the same encoding as the filesystem (which these days is generally "arbitrary bytes that look like UTF-8 but might not be")... well, you'll have to convert it yourself, first.

On Windows? Hahahaha. This depends on where the string came from. C strings embedded in an executable can be in a variety of encodings depending on how the code was compiled. If it came from the OS itself, it could be in one of two different encodings: the thread's OEM codepage, or the thread's ANSI codepage. I never worked out how to check which it's set to. If it came from the console, it would be in whatever the console's input encoding was set to when you received it... assuming it wasn't piped in from something else that was using a different encoding (hi there, PowerShell!). All of the above require you to roll your own transcoding code, since Rust itself avoids this by never, ever using non-Unicode APIs on Windows.

Oh, and don't forget that there is no 8-bit encoding that can properly store Windows paths, since Windows paths are "arbitrary 16-bit words that look like UTF-16 but might not be". [1]

... so, like I said: demand UTF-8 and just pray it never breaks, because trying to do it "correctly" leads to madness.


[1]: I should clarify: there is such an encoding: WTF-8, which is what Rust uses for OsStr and OsString on Windows. The catch is that nothing else on Windows uses this, so it's never going to be how a C string is encoded.

like image 3
DK. Avatar answered Nov 09 '22 17:11

DK.