Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickest way of Importing 500GB Text File taking only the sections wanted

I have about 500GB of text file seperated in months. In these text files the first 43 lines are just connection information (not needed). the next 75 lines are descriptors for an observation. This is followed by 4 lines (not needed) then the next observation which is 75 lines.

The thing is all I want are these 75 lines (descriptors are in the same place for every observation) which are characterized like this:

ID: 5523
Date: 20052012
Mixed: <Null>
.
.

And I want to change it to csv format 5523;20052012;;.. for each observation. So that I end up with much smaller text files. Since the descriptors are the same I'll know the first position for example is ID.

Once I finish with the text file I'll be opening the next one and appending it (or would creating a new file be quicker?).

What I've done is quite inefficient I've been opening the file. Loading it. Deleting these observations going line by line. If it's taking a fair bit with a test sample it clearly isn't the best method.

Any suggestions would be great.

like image 301
FancyDolphin Avatar asked May 20 '12 18:05

FancyDolphin


People also ask

How do I import a large text file into Excel?

The steps to import a TXT or CSV file into Excel are similar for Excel 2007, 2010, 2013, and 2016: Open the Excel spreadsheet where you want to save the data and click the Data tab. In the Get External Data group, click From Text. Select the TXT or CSV file you want to convert and click Import.


2 Answers

You said that you have "about 500GB of text files." If I understand correctly, you don't have a fixed length for each observation (note, I'm not talking about the number of lines, I mean the total length, in bytes, of all of the lines for an observation). This means that you will have to go through the entire file, because you can't know exactly where the newlines are going to be.

Now, depending on how large each individual text file is, you may need to look for a different answer. But if each file is sufficiently small (less than 1 GB?), you might be able to use the linecache module, which handles the seeking-by-line for you.

You'd use it perhaps like this:

import linecache

filename = 'observations1.txt'

# Start at 44th line
curline = 44
lines = []

# Keep looping until no return string is found
# getline() never throws errors, but returns an empty string ''
# if the line wasn't found (if the line was actually empty, it would have
# returned the newline character '\n')
while linecache.getline(filename, curline):
    for i in xrange(75):
        lines.append(linecache.getline(filename, curline).rstrip())
        curline += 1

    # Perform work with the set of observation lines
    add_to_observation_log(lines)

    # Skip the unnecessary section and reset the lines list
    curline += 4
    lines = []

I tried a test of this, and it chewed through a 23MB file in five seconds.

like image 98
voithos Avatar answered Oct 30 '22 22:10

voithos


opening the file. Loading it. Deleting these observations going line by line.

What do you mean by "loading it"? If you mean reading the entire thing into a string, then yes this is going to suck. The natural way to handle the file is to take advantage of the fact that the file object is an iterator over the lines of the file:

for line in file:
    if should_use(line): do_something_with(line)
like image 35
Karl Knechtel Avatar answered Oct 30 '22 20:10

Karl Knechtel