I am currently working a project where I have a large text file (15+ GB) and I'm trying to run a function on each line of the file. In order to speed the task along, I am creating 4 threads and attempting to have them read the file at the same time. This is similar to what I have:
#include <stdio.h>
#include <string>
#include <iostream>
#include <stdlib.h>
#include <thread>
#include <fstream>
void simpleFunction(*wordlist){
string word;
getline(*wordlist, word);
cout << word << endl;
}
int main(){
int max_concurrant_threads = 4;
ifstream wordlist("filename.txt");
thread all_threads[max_concurrant_threads];
for(int i = 0; i < max_concurrant_threads; i++){
all_threads[i] = thread(simpleFunction,&wordlist);
}
for (int i = 0; i < max_concurrant_threads; ++i) {
all_threads[i].join();
}
return 0;
}
The getline function (along with "*wordlist >> word") seems to increment the pointer and read the value in 2 steps, as I will regularly get:
Item1 Item2 Item3 Item2
back.
So I was wondering if there was a way to atomically read a line of the file? Loading it into an array first won't work because the file is too big, and I would prefer not to load the file in chunks at a time.
I couldn't find anything regarding fstream and the atomicity of getline sadly. If there is an atomic version of readline or even a simple way to use locks to achieve what I want, I'm all ears.
Thanks in advance!
Use std::getline() Function to Read a File Line by Line The getline() function is the preferred way of reading a file line by line in C++. The function reads characters from the input stream until the delimiter char is encountered and then stores them in a string.
Reads and writes are atomic for all variables declared volatile (including long and double variables).
In order to solve this problem, C++ offers atomic variables that are thread-safe. The atomic type is implemented using mutex locks. If one thread acquires the mutex lock, then no other thread can acquire it until it is released by that particular thread.
In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
Proper way to do this would be locking the file, which would prevent all other processes from using it. See Wikipedia: File locking. This is probably too slow for you, because you only read one line at a time. But if you were reading for example 1000 or 10000 lines during each function call, it could be the best way to implement it.
If there are no other processes accessing the file, and it is enough that other threads don't access it, you can use mutex that you lock when you access the file.
void simpleFunction(*wordlist){
static std::mutex io_mutex;
string word;
{
std::lock_guard<std::mutex> lock(io_mutex);
getline(*wordlist, word);
}
cout << word << endl;
}
Another way to implement your program could be creating a single thread that is reading the lines to the memory all the time, and the other threads would request single lines from the class that is storing them. You would need something like this:
class FileReader {
public:
// This runs in its own thread
void readingLoop() {
// read lines to storage, unless there are too many lines already
}
// This is called by other threads
std::string getline() {
std::lock_guard<std::mutex> lock(storageMutex);
// return line from storage, and delete it
}
private:
std::mutex storageMutex;
std::deque<std::string> storage;
};
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With