Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

QTextStream behavior searching for a string not as expected

I have these few lines of code:

QFile file("h:/test.txt");
file.open(QFile::ReadOnly | QFile::Text);
QTextStream in(&file);

bool found = false;
uint pos = 0;

do {
    QString temp = in.readLine();
    int p = temp.indexOf("something");
    if (p < 0) {
        pos += temp.length() + 1;
    } else {
        pos += p;
        found = true;
    }
} while (!found && !in.atEnd());

in.seek(0);
QString text = in.read(pos);
cout << text.toStdString() << endl;

The idea is to search a text file for a specific char sequence, and when found, load the file from the beginning to the occurrence of the searched text. The input I used for testing was:

this is line one, the first line
this is line two, it is second
this is the third line
and this is line 4
line 5 goes here
and finally, there is line number 6

And here comes the strange part - if the searched string is on any of lines save for the last, I get the expected behavior. It works perfectly fine.

BUT if I search for a string that is on the last line 6, the result is always 5 characters short. If it was the 7th line, the result would be 6 characters short and so on, when the searched string is on the last line, the result is always lineNumber - 1 characters shorter.

So, is this a bug or I am missing something obvious?

EDIT: Just to clarify, I am not asking for alternative ways to do this, I am asking why do I get this behavior.

like image 706
dtech Avatar asked Apr 06 '13 11:04

dtech


5 Answers

When you search on the last line, you read all of the input stream - in.atEnd() returns true. It looks like it somehow corrupts either file or text stream, or sets them out of sync, so seek is no longer valid.

If you replace

in.seek(0);
QString text = in.read(pos);
cout << text.toStdString() << endl;

by

QString text;
if(in.atEnd())
{
    file.close();
    file.open(QFile::ReadOnly | QFile::Text);
    QTextStream in1(&file);
    text = in1.read(pos);
}

else
{
    in.seek(0);
    text = in.read(pos);
}
cout << text.toStdString().c_str() << endl;

It will work as expected. P.S. There might be some cleaner solution then re-opening the file, but the problem definitely comes from reaching the end of both stream and file and trying to operate on them after...

like image 101
Ilya Kobelevskiy Avatar answered Oct 21 '22 15:10

Ilya Kobelevskiy


Obviously you get this behaviour because readLine() skips cursor by line size with line delimiter chars (either LF CRLF or CR depending on file). Buffer you get from this method does not contans those symbols, so you aren't taking these chars in your position calculations.

The solution is to read not by lines but by buffer. Here is your code, modified:

QFile file("h:/test.txt");
file.open(QFile::ReadOnly | QFile::Text);
QTextStream in(&file);

bool found = false;
uint pos = 0;
qint64 buffSize = 64; // adjust to your needs

do {
    QString temp = in.read(buffSize);
    int p = temp.indexOf("something");
    if (p < 0) {
        uint posAdj = buffSize;
        if (temp.length() < buffSize)
            posAdj = temp.length();
        pos += posAdj;
    } else {
        pos += p;
        found = true;
    }
} while (!found && !in.atEnd());

in.seek(0);
QString text = in.read(pos);
cout << text.toStdString() << endl;

EDIT

The code above contains error due to word might be splitted by buffer. Here is a sample input that breaks stuff (assuming we seach for keks):

test test test test test test
test test test test test test  keks
test test test test test test
test test test test test test
test test test test test test
test test test test test test

Solution

Here is complete code what works great with all inputs I tried:

#include <QFile>
#include <QTextStream>
#include <iostream>


int findPos(const QString& expr, QTextStream& stream) {
    if (expr.isEmpty())
        return -1;

    // buffer size of same length as searched expr should be OK to go
    qint64 buffSize = quint64(expr.length());

    stream.seek(0);
    QString startBuffer = stream.read(buffSize);
    int pos = 0;

    while(!stream.atEnd()) {
        QString cycleBuffer = stream.read(buffSize);
        QString searchBuffer = startBuffer + cycleBuffer;
        int bufferPos = searchBuffer.indexOf(expr);
        if (bufferPos >= 0)
            return pos + bufferPos + expr.length();
        pos += cycleBuffer.length();
        startBuffer = cycleBuffer;
    }

    return pos;
}

int main(int argc, char *argv[])
{
    Q_UNUSED(argc);
    Q_UNUSED(argv);

    QFile file("test.txt");
    file.open(QFile::ReadOnly | QFile::Text);
    QTextStream in(&file);

    int pos = findPos("keks", in);

    in.seek(0);
    QString text = in.read(pos);
    std::cout << text.toUtf8().data() << std::endl;
}
like image 27
dant3 Avatar answered Oct 21 '22 15:10

dant3


You know the difference between windows and *nix line endings (\r\n vs \n). When you open file in text mode you should know that all sequence of \r\n are transtaled to \n.

Your mistake in original code that you are trying to calculate offset of skipped line, but you don't know it exact length of line in text file.

length = number_of_chars + number_of_eol_chars
where number_of_chars == QString::length()
and number_of_eol_chars == (1 if \n) or (2 if \r\n)

You could not detect number_of_eol_chars without raw access to file. And you don't use it in your code, because you open file as text, but not as binary. So error in your code, that you had hardcoded number_of_eol_chars with 1, instead of detecting it. For each line in windows text files (with \r\n eol) you will get mistake in pos for each skipped line.

Fixed code:

#include <QFile>
#include <QTextStream>

#include <iostream>
#include <string>


int main(int argc, char *argv[])
{
    QFile f("test.txt");
    const bool isOpened = f.open( QFile::ReadOnly | QFile::Text );
    if ( !isOpened )
        return 1;
    QTextStream in( &f );

    const QString searchFor = "finally";

    bool found = false;
    qint64 pos = 0;

    do 
    {
        const qint64 lineStartPos = in.pos();
        const QString temp = in.readLine();
        const int ofs = temp.indexOf( searchFor );
        if ( ofs < 0 )
        {
            // Here you skip line and increment pos on exact length of line
            // You shoud not hardcode "1", because it may be "2" (\n or \r\n)
            const qint64 length = in.pos() - lineStartPos;
            pos += length;
        }
        else
        {
            pos += ofs;
            found = true;
        }

    } while ( !found && !in.atEnd() );

    in.seek( 0 );
    const QString text = in.read( pos );

    std::cout << text.toStdString() << std::endl;

    return 0;
}
like image 33
Dmitry Sazonov Avatar answered Oct 21 '22 16:10

Dmitry Sazonov


I'm not entirely sure why you're seeing this behavior but I'd suspect it's related to line endings. I tried your code and I only saw the last line behavior when the file had CRLF line endings AND there was no new line (CRLF) at the end of the file. So yes, weird. If the file had LF line endings then it always worked as expected.

With that said, it's probably not a good idea to keep track of the position by adding + 1 at the end of each line because you won't know if your source file was CRLF or LF and QTextStream will always strip the line endings. Here's a function that should work better. It builds up the output string line by line and I haven't seen any weird behavior with it:

void searchStream( QString fileName, QString searchStr )
{
    QFile file( fileName );
    if ( file.open(QFile::ReadOnly | QFile::Text) == false )
        return;

    QString text;
    QTextStream in(&file);
    QTextStream out(&text);

    bool found = false;

    do {
        QString temp = in.readLine();
        int p = temp.indexOf( searchStr );
        if (p < 0) {
            out << temp << endl;
        } else {
            found = true;
            out << temp.left(p);
        }
    } while (!found && !in.atEnd());

    std::cout << text.toStdString() << std::endl;
}

It doesn't keep track of the position in the original stream, so if you really wanted a position then I'd recommend using QTextStream::pos() as it will be accurate whether the file is CRLF or LF.

like image 2
Cutterpillow Avatar answered Oct 21 '22 16:10

Cutterpillow


The QTextStream.read() method takes as a parameter the maximum number of characters to read, not a file position. In many environments, the position is not a simple character count: VMS and Windows both come to mind as exceptions. VMS imposes a record structure which uses many hidden bits of metadata within the file and file positions are "magic cookies"

The only filesystem-independent way to get the right value is to use QTextStream::pos() when the file is already positioned to the correct place, and then keep reading until the file position returns to the same location.

(Redacted because there was an initially unspecified requirement prohibiting multiple allocations to buffer the text.)
However, given the program's requirements, there is no sense to rereading the first part of the file. Start saving text at the beginning and stop when the string is found:

QString out;
do {
    QString temp = in.readLine();
    int p = temp.indexOf("something");
    if (p < 0) {
        out += temp;
    } else {
        out += temp.substr(pos);  //not sure of the proper function/parameters here
        break;
    }
} while (!in.atEnd());

cout << out.toStdString() << endl;

Since you are on Windows, text file processing is translating '\r\n' into '\n' and that is causing a mismatch in file positioning vs. character counting. There are several ways to work around this, but perhaps the simplest is simply to process the file as binary (that is, not "text" by dropping the text mode) to prevent the translation:

file.open(QFile::ReadOnly);

Then the code should work as expected. It doesn't do any harm to output \r\n in Windows, but sometimes can cause nuisance displays when using Windows' text utilities. If that is important, search and replace \r\n with \n once the text is in memory.

like image 2
wallyk Avatar answered Oct 21 '22 16:10

wallyk