Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing punctuation and capitalizing in C

I'm writing a program for school that asks to read text from a file, capitalizes everything, and removes the punctuation and spaces. The file "Congress.txt" contains

(Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the government for a redress of grievances.)

It reads in correctly but what I have so far to remove the punctuation, spaces, and capitalize causes some major problems with junk characters. My code so far is:

void processFile(char line[]) {
    FILE *fp;
    int i = 0;
    char c;

    if (!(fp = fopen("congress.txt", "r"))) {
        printf("File could not be opened for input.\n");
        exit(1);
    }

    line[i] = '\0';
    fseek(fp, 0, SEEK_END);
    fseek(fp, 0, SEEK_SET);
    for (i = 0; i < MAX; ++i) {
        fscanf(fp, "%c", &line[i]);
        if (line[i] == ' ')
            i++;
        else if (ispunct((unsigned char)line[i]))
            i++;
        else if (islower((unsigned char)line[i])) {
            line[i] = toupper((unsigned char)line[i]);
            i++;
        }
        printf("%c", line[i]);
        fprintf(csis, "%c", line[i]);
    }

    fclose(fp);
}

I don't know if it's an issue but I have MAX defined as 272 because that's what the text file is including punctuation and spaces.

My output I am getting is:

    C╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠Press any key to continue . . .
like image 852
Zach Greene Avatar asked Apr 19 '15 21:04

Zach Greene


2 Answers

The fundamental algorithm needs to be along the lines of:

while next character is not EOF
    if it is alphabetic
        save the upper case version of it in the string
null terminate the string

which translates into C as:

int c;
int i = 0;

while ((c = getc(fp)) != EOF)
{
    if (isalpha(c))
        line[i++] = toupper(c);
}
line[i] = '\0';

This code doesn't need the (unsigned char) cast with the functions from <ctype.h> because c is guaranteed to contain either EOF (in which case it doesn't get into the body of the loop) or the value of a character converted to unsigned char anyway. You only have to worry about the cast when you use char c (as in the code in the question) and try to write toupper(c) or isalpha(c). The problem is that plain char can be a signed type, so some characters, notoriously ÿ (y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS), will appear as a negative value, and that breaks the requirements on the inputs to the <ctype.h> functions. This code will attempt to case-convert characters that are already upper-case, but that's probably cheaper than a second test.

What else you do in the way of printing, etc is up to you. The csis file stream is a global scope variable; that's a bit (tr)icky. You should probably terminate the output printing with a newline.

The code shown is vulnerable to buffer overflow. If the length of line is MAX, then you can modify the loop condition to:

while (i < MAX - 1 && (c = getc(fp)) != EOF)

If, as would be a better design, you change the function signature to:

void processFile(int size, char line[]) {

and assert that the size is strictly positive:

    assert(size > 0);

and then the loop condition changes to:

while (i < size - 1 && (c = getc(fp)) != EOF)

Obviously, you change the call too:

char line[4096];

processFile(sizeof(line), line);
like image 76
Jonathan Leffler Avatar answered Sep 28 '22 09:09

Jonathan Leffler


in the posted code, there is no intermediate processing, so the following code ignores the 'line[]' input parameter

void processFile()
{
    FILE *fp = NULL;

    if (!(fp = fopen("congress.txt", "r")))
    {
        printf("File could not be opened for input.\n");
        exit(1);
    }

    // implied else, fopen successful

    unsigned int c; // must be integer so EOF (-1) can be recognized
    while( EOF != (c =(unsigned)fgetc(fp) ) )
    {
        if( (isalpha(c) || isblank(c) ) && !ispunct(c) ) // a...z or A...Z or space
        {
            // note toupper has no effect on upper case characters
            // note toupper has no effect on a space
            printf("%c", toupper(c));
            fprintf(csis, "%c", toupper(c));
        }
    }
    printf( "\n" );

    fclose(fp);
} // end function: processFile
like image 22
user3629249 Avatar answered Sep 28 '22 08:09

user3629249