Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cleaning a string of punctuation in C++

Tags:

c++

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.

Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).

The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.

So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.

Here are the functions I have written to read in from the file and then the one that cleans it.

Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.

Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.

int get_words(map<string, int>& mapz)
{
 int cnt = 0;               //set out counter to zero

 map<string, int>::const_iterator mapzIter;

 ifstream input;            //declare instream
 input.open( "prog2.d" ); //open instream
 assert( input );           //assure it is open

 string s;                  //temp strings to read into
 string not_s;

 input >> s;

 while(!input.eof())        //read in until EOF
  {
   not_s = "";
   clean_entry(s, not_s);

   if((int)not_s.length() == 0)
    {
     input >> s;
     clean_entry(s, not_s);
    }    

   mapz[not_s]++;              //increment occurence
   input >>s;
  }
 input.close();     //close instream 

 for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
  cnt = cnt + mapzIter->second;

 return cnt;        //return number of words in instream
}


void clean_entry(const string& non_clean, string& clean)
{
 int i, j, begin, end;

 for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);

 begin = i;

 if(begin ==(int)non_clean.length())
   return;

 for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);

 end = j;

 clean = non_clean.substr(begin, (end-begin));

 for(i = 0; i < (int)clean.size(); i++)
  clean[i] = tolower(clean[i]);

}
like image 986
Brandon Haugen Avatar asked Sep 22 '08 18:09

Brandon Haugen


People also ask

How do you clean punctuation from a string?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate method typically takes a translation table, which we'll do using the . maketrans() method.

How do you remove punctuations from regular expressions?

You can use this: Regex. Replace("This is a test string, with lots of: punctuations; in it?!.", @"[^\w\s]", "");

Is Punct in C?

C ispunct()The ispunct() function checks whether a character is a punctuation mark or not. The function prototype of ispunct() is: int ispunct(int argument); If a character passed to the ispunct() function is a punctuation, it returns a non-zero integer.


1 Answers

The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:

not_s = "";
clean_entry(s, not_s);

if((int)not_s.length() == 0)
 {
  input >> s;
  clean_entry(s, not_s);
 }    

mapz[not_s]++;              //increment occurence
input >>s;

to

not_s = "";
clean_entry(s, not_s);

if((int)not_s.length() > 0)
{
    mapz[not_s]++;              //increment occurence
}    

input >>s;

EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.

like image 73
Chris Marasti-Georg Avatar answered Oct 23 '22 14:10

Chris Marasti-Georg