Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using C/C++ to efficiently de-serialize a string comprised of floats, tokens and blank lines

I have large strings that resemble the following...

some_text_token

24.325973 -20.638823  

-1.964366 0.753947  
-1.290811 -3.547422  
0.813014 -3.547227  

0.472015 3.723311  
-0.719116 3.676793  

other_text_token  

24.325973 20.638823  

-1.964366 0.753947  
-1.290811 -3.547422  
-1.996611 -2.877422  
0.813014 -3.547227  

1.632365 2.083673  
0.472015 3.723311  
-0.719116 3.676793  

...

...from which I'm trying to efficiently, and in the interleaved sequence they appear in the string, grab...

  1. the text tokens
  2. the float values
  3. the blank lines

...but I'm having trouble.

I've tried strtod and successfully grabbed the floats from the string, but I can't seem to get a loop using strtod to report back to me the interleaved text tokens and blank lines. I'm not 100% confident strtod is the "right track" given the interleaved tokens and blank lines that I'm also interested in.

The tokens and blank lines are present in the string to give context to the floats so my program knows what the float values occurring after each token are to be used for, but strtod seems more geared, understandably, toward just reporting back floats it encounters in a string without regard for silly things like blank lines or tokens.

I know this isn't very hard conceptually, but being relatively new to C/C++ I'm having trouble judging what language features I should focus on to take best advantage of the efficiency C/C++ can bring to bear on this problem.

Any pointers? I'm very interested in why various approaches function more or less efficiently. Thanks!!!

like image 271
Monte Hurd Avatar asked Dec 30 '22 05:12

Monte Hurd


1 Answers

Using C, I would do something like this (untested):

#include <stdio.h>

#define MAX 128

char buf[MAX];
while (fgets(buf, sizeof buf, fp) != NULL) {
    double d1, d2;
    if (buf[0] == '\n') {
        /* saw blank line */
    } else if (sscanf(buf, "%lf%lf", &d1, &d2) != 2) {
        /* buf has the next text token, including '\n' */
    } else {
        /* use the two doubles, d1, and d2 */
    }
}

The check for blank line is first because it's relatively inexpensive. Depending upon your needs:

  1. you might need to increase/change MAX,
  2. you may need to check if buf ends with a newline, if it doesn't, then the line was too long (go to 1 or 3 in that case),
  3. you might need a function that reads full lines from a file, using malloc() and realloc() to dynamically allocate the buffer (see this for more),
  4. you might want to take care of special cases such as a single floating-point value on a line (which I assume is not going to happen). sscanf() returns the number of input items successfully matched and assigned.

I am also assuming that blank lines are really blank (just the newline character by itself). If not, you will need to skip leading white-space. isspace() in ctype.h is useful in that case.

fp is a valid FILE * object returned by fopen().

like image 106
Alok Singhal Avatar answered Dec 31 '22 20:12

Alok Singhal