Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest method to deal with string

Tags:

c++

fwrite

fread

I have a text file to read and deal with with 20000 lines. In the text file I want to read the point coordinates and assign to DirectX to render.Snapshot of Text file

I have used std::ifstream, getline, stringstream to get the point coordinates. After building the win32 program then start running, it takes too long to read and store the point coordinates in the array. (5 mins to go through 20000 lines text file). Code as below:

struct PointCoord { std::string PtName; float PtX = 0.0; float PtY = 0.0;}
PointCoord *PointPtr = NULL;
PointCoord pointcoord;

std::ifstream File_read(FileNameTXT);    
while (getline(File_read, TextHandler))
        {
            std::istringstream iss;
            std::string skip;
            if (TextHandler.find("  POINT ") != std::string::npos)
            {
                iss.str(TextHandler);
                std::string TempX, TempY;
                iss >> skip;
                iss >> pointcoord.PtName;                         

                //pointcoord pass value to PointCoord
                iss >> TempX;
                iss >> TempY;

                pointcoord.PtX = std::stof(TempX.c_str());
                pointcoord.PtY = std::stof(TempY.c_str());

                //dynamically store the points coordiantes
                if (PointPtr == NULL)
                {
                    PointPtr = new PointCoord[1];                     

                    //PointCoord pass value to PointPtr
                    PointPtr[0] = pointcoord;
                    Pt_Count++;
                }
                else
                {
                    PointCoord *Temp = PointPtr;
                    PointPtr = new PointCoord[Pt_Count + 1];

                    for (UINT i = 0; i < Pt_Count; i++)
                    {
                        PointPtr[i] = Temp[i];
                    }
                    PointPtr[Pt_Count] = pointcoord;
                    Pt_Count++;
                    delete[]Temp;
                }
            }//end of loading points
      }//end of getline

I also used std::fread to read the whole text file at once in string buffer, which is fast(reading done in few seconds). Then use stringstream in similar code to store the point coordinates in dynamical array, which is also too slow.

Any suggestion are welcome. Many thanks.

like image 950
timhu Avatar asked Oct 22 '18 05:10

timhu


Video Answer


1 Answers

The most damning thing in this code isn't the string parsing; its the resize of the target array for every new point read. The more you read, the worse it gets. Ultimately it becomes a copy operation on the order of O(n^2).

To give you an idea how bad that is, consider the basic summation of n natural numbers, because that's how many object constructions, destructions, and copies you're doing:

n(n+1)/2 = (20000 * 20001)/2 = 200010000 objects created, copied, and destroyed

So, string parsing isn't the problem. The 20000 lines of text being parsed is dwarfed by the over-200-million object constructions, destructions, and copies.


You don't need to do any of that. Use an appropriate container such as std::vector and approximate the initial reserve based on the file size. Then, just generate the points and move them into the container.

An example that does this, including generation of a test file of 100000 points (5x the size you're asking about), appears below:

Code

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>
#include <random>
#include <chrono>

struct Point
{
    std::string name;
    float x;
    float y;
};

std::vector<Point> readFile(std::string const& fname)
{
    std::vector<Point> res;

    std::ifstream inp(fname);
    if (inp.is_open())
    {
        // gather file size for a rough approximation of reserve
        inp.seekg(0, std::ios::end);
        auto flen = inp.tellg();
        inp.seekg(0, std::ios::beg);
        res.reserve(flen/40);

        std::string line;
        while (std::getline(inp, line))
        {
            auto pos = line.find("POINT");
            if (pos != std::string::npos)
            {
                std::istringstream iss(line.substr(pos+5));
                Point pt;
                if (iss >> pt.name >> pt.x >> pt.y)
                    res.emplace_back(std::move(pt));
            }
        }
    }

    return res;
}


int main()
{
    using namespace std::chrono;

    std::mt19937 prng(std::random_device{}());
    std::uniform_real_distribution<float> dist(-100.0, 100.0);

    // generate test data
    std::ofstream outf("testpoints.txt");
    for (int i=1; i<=100000; ++i)
        outf << "POINT  \"" << i << "\" " << dist(prng) << ' ' << dist(prng) << '\n';
    outf.close();

    // rough benchmark
    auto tp0 = steady_clock::now();
    auto v = readFile("testpoints.txt");
    auto tp1 = steady_clock::now();
    std::cout << "Read " << v.size() << " points in ";
    std::cout << duration_cast<milliseconds>(tp1-tp0).count() << "ms\n";
    v.clear();
}

Output

Running on a 2015 duo-core i7 MacBook Air Laptop, a release mode build produces the following result:

Read 100000 points in 164ms

A Possibly More Appropriate Container: std::deque

In the end, what you're really needing is a container that allows quick-insertion on the end, while minimizing (or eliminating) element copying during resizing. Certainly, as the code above shows, setting up a reserve against a std::vector is one way to do that. Another option is to use a container that actuall specializes in end-insertions, while still being mostly cache-friendly (not perfect like std::vector, but certainly better than something like linked lists, which are great for insertions, but dreadful for enumeration).

That's exactly what std::deque will do. The code above, changed for std::deque, allows you to eliminate the reserve-guess, and simply start slamming nodes on the end of the sequence, which will automatically add more pages as the sequence grows:

Code

#include <iostream>
#include <fstream>
#include <sstream>
#include <deque>
#include <string>
#include <random>
#include <chrono>

struct Point
{
    std::string name;
    float x;
    float y;
};

std::deque<Point> readFile(std::string const& fname)
{
    std::deque<Point> res;

    std::ifstream inp(fname);
    if (inp.is_open())
    {
        std::string line;
        while (std::getline(inp, line))
        {
            auto pos = line.find("POINT");
            if (pos != std::string::npos)
            {
                std::istringstream iss(line.substr(pos+5));
                Point pt;
                if (iss >> pt.name >> pt.x >> pt.y)
                    res.emplace_back(std::move(pt));
            }
        }
    }

    return res;
}


int main()
{
    using namespace std::chrono;

    std::mt19937 prng(std::random_device{}());
    std::uniform_real_distribution<float> dist(-100.0, 100.0);

    // generate test data
    std::ofstream outf("testpoints.txt");
    for (int i=1; i<=100000; ++i)
        outf << "POINT  \"" << i << "\" " << dist(prng) << ' ' << dist(prng) << '\n';
    outf.close();

    // rough benchmark
    auto tp0 = steady_clock::now();
    auto v = readFile("testpoints.txt");
    auto tp1 = steady_clock::now();
    std::cout << "Read " << v.size() << " points in ";
    std::cout << duration_cast<milliseconds>(tp1-tp0).count() << "ms\n";
    v.clear();
}

Output

Read 100000 points in 160ms

If your needs require a contiguous sequence, the std::vector approach is the way to go. If you just need random access to the elements and want fast end insertions, a std::deque may be a better fit. Think about that, and choose the best fit for you.


Summary

Get rid of that dreadful expansion algorithm. It's the sore point in your code. Replace it with a geometric resize algorithm, and start with a rough approximation of the number of elements you're going to need from the beginning. Or use a container suited for optimal end-insertions. Either way, it's better than what you have now.

like image 89
WhozCraig Avatar answered Sep 19 '22 23:09

WhozCraig