Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove first line from a file [duplicate]

Tags:

Possible Duplicate:
Removing the first line of a text file in C#

What would be the fastest and smartest way to remove the first line from a huge (think 2-3 GB) file?

  • I think, that you probably can't avoid rewriting the whole file chunk-by-chunk, but I might be wrong.

  • Could using memory-mapped files somehow help to solve this issue?

  • Is it possible to achieve this behavior by operating directly on the file system (NTFS, for example) - say, update the corresponding inode data and change the file starting sector, so that the first line is ignored? If yes, would this approach be really fragile or there are many other applications, except the OS itself that do something similiar?

like image 733
Yippie-Ki-Yay Avatar asked Jul 19 '12 18:07

Yippie-Ki-Yay


2 Answers

NTFS by default on most volumes (but importantly not all!) stores data in 4096 byte chunks. These are referenced by the $MFT record, which you cannot edit directly because it's disallowed by the Operating System (for reasons of sanity). As a result, there is no trick available to operate on the filesystem to do something approaching what you want (in other words, you cannot directly reverse truncate a file on NTFS, even in filesystem chunk sized amounts.)

Because of the way files are stored in a filesystem, the only answer is that you must rewrite the entire file directly. Or figure out a different way to store your data. a 2-3GB file is massive and crazy, especially considering you referred to lines meaning that this data is at least in part text information.

You should look into putting this data into a database perhaps? Or organizing it a bit more efficiently at the very least.

like image 68
OmnipotentEntity Avatar answered Sep 19 '22 04:09

OmnipotentEntity


You can overwrite every character that you want to erase with '\x7f'. Then, when reading in the file, your reader ignores that character. This assumes you have a text file that doesn't ever use the DEL character, of course.

std::istream & my_getline (std::istream &in, std::string &s,             char del = '\x7f', char delim = '\n') {     std::getline(in, s, delim);     std::size_t beg = s.find(del);     while (beg != s.npos) {         std::size_t end = s.find_first_not_of(del, beg+1);         s.erase(beg, end-beg);         beg = s.find(del, beg+1);     }     return in; } 

As Henk points out, you could choose a different character to act as your DELETE. But, the advantage is that the technique works no matter which line you want to remove (it is not limited to the first line), and doesn't require futzing with the file system.

Using the modified reader, you can periodically "defragment" the file. Or, the defragmentation may occur naturally as the contents are streamed/merged into a different file or archived to a different machine.

Edit: You don't explicitly say it, but I am guessing this is for some kind of logging application, where the goal is to put an upper bound on the size of the log file. However, if that is the goal, it is much easier to just use a collection of smaller log files. Let's say you maintained roughly 10MB log files, with total logs bounded to 4GB. So that would be about 400 files. If the 401st file is started, for each line written there, you could use the DELETE marker on successive lines in the first file. When all lines have been marked for deletion, the file itself can be deleted, leaving you with about 400 files again. There is no hidden O(n2) behavior so long as the first file is not closed while the lines are being deleted.

But easier still is allow your logging system to keep the 1st and 401st file as is, and remove the 1st file when moving to the 402nd file.

like image 30
jxh Avatar answered Sep 19 '22 04:09

jxh