Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C - Formatting String size with special characters

So, I was trying to print fixtures from a competition with strings well formatted but i found out that whenever there is a special character like 'é' or 'í' or 'á' it would print +1 size even though i specified the max length.

Where is the code:

printf("=> %-25s (%d) vs (%d) \t%-25s\n", f->home_team_name, f->goals_home_team, f->goals_away_team, f->away_team_name);

For teams with those characters the output is like:

=> Palmeiras               (2) vs (0)   Botafogo               
=> Atlético Mineiro       (4) vs (3)    Grémio                
=> Atlético PR            (3) vs (0)    Palmeiras              
=> Botafogo                (2) vs (2)   Cruzeiro   

But i want the output to look like, even with special characters:

=> Tottenham Hotspur FC    (0) vs (0)   Leicester City FC      
=> West Ham United FC      (0) vs (0)   Everton FC             
=> Burnley FC              (0) vs (0)   AFC Bournemouth   

I've tried to look for formatting flags but can't find the solution.

like image 200
chriptus13 Avatar asked Oct 17 '22 00:10

chriptus13


1 Answers

The format string in printf does not take multibyte characters into account.

A possible solution is to count wide characters of a string by mbstowcs function. The obtained count is then subtracted from the length (i.e. in bytes) of the examined string. This yields a (nonnegative) "compensation value", that may be added to printf's format field width.

The mbstowcs function is described as:

Converts a multibyte character string from the array whose first element is pointed to by src to its wide character representation. Converted characters are stored in the successive elements of the array pointed to by dst. No more than len wide characters are written to the destination array.

In your case, this means that UTF-8 encoded octets (represented within array of char) are converted into some wide representation, that guarantees that any multibyte character (up to locale-specific MB_CUR_MAX bytes) can be encoded by no more than one wchar_t object.

The relevant quote from C11 Standard is contained in 7.19/2 Common definitions <stddef.h>:

wchar_t

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales;

For instance, on Linux platform, wide characters are most likely to be represented in UCS-4 (known as UTF-32).

Here is a proof of concept:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>

static inline size_t widestrlen(const char *str)
{
    return mbstowcs(NULL, str, strlen(str));
}

static inline size_t compensation(const char *str)
{
    return strlen(str) - widestrlen(str);
}

int main(void)
{
    setlocale(LC_CTYPE, "");

    // Print some debugging information regarding selected locale
    printf("Current locale for LC_TYPE category: %s\n", setlocale(LC_CTYPE, NULL));
    printf("Maximum number of bytes in a multibyte character: %zu\n", MB_CUR_MAX);
    printf("Does current encoding support shift states? : %s\n\n", mblen(NULL, 0) ? "Yes" : "No");

    int goals_home_teams[] = { 4, 0 };
    int goals_away_teams[] = { 3, 0 };
    const char *home_team_names[] = { "Atlético Mineiro", "West Ham United FC" };
    const char *away_team_names[] = { "Grémio", "Everton FC" };

    for (int i = 0; i < 2; i++)
    {
        printf("=> %-*s (%d) vs (%d) \t%-*s\n",
            25 + (int) compensation(home_team_names[i]),
            home_team_names[i], goals_home_teams[i], goals_away_teams[i],
            25 + (int) compensation(away_team_names[i]),
            away_team_names[i]);
    }
    return 0;
}

Result:

Current locale for LC_TYPE category: en_US.UTF-8
Maximum number of bytes in a multibyte character: 6
Does current encoding support shift states? : No

=> Atlético Mineiro          (4) vs (3)     Grémio                   
=> West Ham United FC        (0) vs (0)     Everton FC  
like image 193
Grzegorz Szpetkowski Avatar answered Nov 12 '22 22:11

Grzegorz Szpetkowski