Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sscanf() and locales. How does one really parse things like "3.14"?

Tags:

c++

c

parsing

scanf

Let's say I have to read a file, containing a bunch of floating-point numbers. The numbers can be like 1e+10, 5, -0.15 etc., i.e., any generic floating-point number, using decimal points (this is fixed!). However, my code is a plugin for another application, and I have no control over what's the current locale. It may be Russian, for example, and the LC_NUMERIC rules there call for a decimal comma to be used. Thus, Pi is expected to be spelled as "3,1415...", and

sscanf("3.14", "%f", &x); 

returns "1", and x contains "3.0", since it refuses to parse past the '.' in the string.

I need to ignore the locale for such number-parsing tasks.

How does one do that?

I could write a parseFloat function, but this seems like a waste.
I could also save the current locale, reset it temporarily to "C", read the file, and restore to the saved one. What are the performance implications of this? Could setlocale() be very slow on some OS/libc combo, what does it really do under the hood?
Yet another way would be to use iostreams, but again their performance isn't stellar.

So I'm puzzled. What do you guys do in such situations?

Cheers!

like image 305
anrieff Avatar asked Dec 17 '12 18:12

anrieff


3 Answers

My personal preference is to never use LC_NUMERIC, i.e. just call setlocale with other categories, or, after calling setlocale with LC_ALL, use setlocale(LC_NUMERIC, "C");. Otherwise, you're completely out of luck if you want to use the standard library for printing or parsing numbers in a standared form for interchange.

If you're lucky enough to be on a POSIX 2008 conforming system, you can use the uselocale and *_l family of functions to make the situation somewhat better. There are at least 2 basic approaches:

  1. Leave the default locale unset (at least the troublesome parts like LC_NUMERIC; LC_CTYPE should probably always be set), and pass a locale_t object for the user's locale to the appropriate *_l functions only when you want to present things to the user in a way that meets their own cultural expectations; otherwise use the default C locale.

  2. Have your code that needs to work with data for interchange keep around a locale_t object for the C locale, and either switch back and forth using uselocale when you need to work with data in a standard form for interchange, or use the appropriate *_l functions (but there is no scanf_l).

Note that implementing your own floating point parser is not easy and is probably not the right solution to the problem unless you're an expert in numerical computing. Getting it right is very hard.


POSIX.1-2008 specifies isalnum_l(), isalpha_l(), isblank_l(), iscntrl_l(), isdigit_l(), isgraph_l(), islower_l(), isprint_l(), ispunct_l(), isspace_l(), isupper_l(), and isxdigit_l().

like image 135
R.. GitHub STOP HELPING ICE Avatar answered Nov 15 '22 18:11

R.. GitHub STOP HELPING ICE


Here's what I've done with this stuff in the past.

The goal is to use locale-dependent numeric converters with a C-locale numeric representation. The ideal, of course, would be to use non-locale-dependent converters, or not change the locale, etc., etc., but sometimes you just have to live with what you've got. Locale support is seriously broken in several ways and this is one of them.</rant>

First, extract the number as a string using something like the C grammar's simple pattern for numeric preprocessing tokens. For use with scanf, I do an even simpler one:

" %1[-+0-9.]%[-+0-9A-Za-z.]"

This could be simplified even more, depending on how what else you might expect in the input stream. The only thing you need to do is to not read beyond the end of the number; as long as you don't allow numbers to be followed immediately by letters, without intervening whitespace, the above will work fine.

Now, get the struct lconv (man 7 locale) representing the current locale using localeconv(3). The first entry in that struct is const char* decimal_point; replace all of the '.' characters in your string with that value. (You might also need to replace '+' and '-' characters, although most locales don't change them, and the sign fields in the lconv struct are documented as only applying to currency conversions.) Finally, feed the resulting string through strtod and see if it passes.

This is not a perfect algorithm, particularly since it's not always easy to know how locale-compliant a given library actually is, so you might want to do some autoconf stuff to configure it for the library you're actually compiling with.

like image 37
rici Avatar answered Nov 15 '22 16:11

rici


I am not sure how to solve it in C.

But C++ streams (can) have a unique locale object.

std::stringstream  dataStream;
dataStream.imbue(std::locale("C"));

// Note: You must imbue the stream before you do anything wit it.
//       If any operations have been performed then an imbue() can
//       be silently ignored by the stream (which is a pain to debug).

dataStream << "3.14";
float   x;
dataStream >> x;
like image 33
Martin York Avatar answered Nov 15 '22 16:11

Martin York