Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove reoccuring lines from text file with enhanced performance

I have the following performance issue concerning large text file input (~500k lines) and subsequent data parsing.

Consider a text file data.txt having the following exemplary structure with the pecularity that the two header lines can reappear somewhere in the text file:

Name Date Val1 val2
--- ------- ---- ----
BA 2013-09-07 123.123 1232.22
BA 2013-09-08 435.65756 2314.34
BA 2013-09-09 234.2342 21342.342

The code I wrote and which is working is the following:

%# Read in file using textscan, read all values as string

inFile = fopen('data.txt','r');
DATA = textscan(inFile, '%s %s %s %s');
fclose(inFile);

%# Remove the header lines everywhere in DATA:
%# Search indices of the first entry in first cell, i.e. 'Name', and remove 
%# all lines corresponding to those indices

[iHeader,~] = find(strcmp(DATA{1},DATA{1}(1)));
for i=1:length(DATA)
    DATA{i}(iHeader)=[];
end

%# Repeat again, the first entry corresponds now to '---'

[iHeader,~] = find(strcmp(DATA{1},DATA{1}(1)));
for i=1:length(DATA)
    DATA{i}(iHeader)=[];
end

%# Now convert the cells for column Val1 and Val2 in data.txt to doubles
%# since they have been read in as strings:

for i=3:4
    [A] = cellfun(@str2double,DATA{i});
    DATA{i} = A;
end

I chose to read in everything as a string in oder to be able to remove the remove the header lines everywhere in DATA.

Stopping the time tells me that the slowest part of the code is the conversion [A] = cellfun(@str2double,DATA{i}) although str2double is already the faster choice compared to str2num. The second slowest part is textscan.

The question is now, is there a faster way to deal with this problem?

Please let me know if I should further clearify. And forgive me if there is a very obvious solution I haven't seen, I'm just working with Matlab for three weeks now.

like image 602
Lukas Avatar asked Oct 03 '13 13:10

Lukas


People also ask

How to delete specific lines from a text file in Linux?

The task is simple. You have to delete specific lines from a text file in Linux terminal. Using commands like rm deletes the entire file and you don't want that here. You can use a text editor like Vim or Nano, enter the file and delete the desired lines. However, this approach is not suitable for automation using bash scripts.

How do I remove duplicate lines in textmechanic?

Paste the text to be processed into the TextMechanic window before pressing the “ Remove Duplicate Lines ” button followed by the “ Remove Empty Lines ” button. If you don’t press both buttons the text will contain empty lines where the duplicates have been removed.

How do I remove blank lines in a text file?

Next, you need to remove those blank lines. Close the Find/Replace dialog. To remove the empty lines, click Edit → Line Operations → Remove Empty Lines. This removes all the lines except the ones which contain the search word or string.

How do I delete repeated lines from a file in Python?

If the file is small with a few lines, then the task of deleting/eliminating repeated lines from it could be done manually, but when it comes to large files, this is where Python comes to your rescue. Open the input file using using the open () function and pass in the flag -r to open in reading mode.


1 Answers

You can use an option of textscan called CommentStyle that will skip part of your file (the repeated 2 headerlines in your case), and read your file in one function call.

As the doc says, CommentStyle can be used in 2 ways: a single string such as '%' to ignore characters following the string on the same line, or a cell array of two strings, such as {'/*', '*/'}, to ignore characters between the two strings (including end of lines). We will use the second option here: remove characters between Name and -. As the ending string consists of a repeated - character, we need to specify the whole string.

inFile = fopen('data.txt','r');
DATA = textscan(inFile, '%s %s %f %f', ...
      'Commentstyle', {'Name';'--- ------- ---- ----'});
fclose(inFile);

You can convert a date string into a meaningful number using datenum.

DATA_date = datenum(C{2})
like image 164
marsei Avatar answered Oct 03 '22 00:10

marsei