I have about 500GB of text file seperated in months. In these text files the first 43 lines are just connection information (not needed). the next 75 lines are descriptors for an observation. This is followed by 4 lines (not needed) then the next observation which is 75 lines. The thing is all I want are these 75 lines (descriptors are in the same place for every observation) which are characterized like this: <pre class="prettyprint"><code>ID: 5523 Date: 20052012 Mixed: <Null> . . </code></pre> And I want to change it to csv format <code>5523;20052012;;..</code> for each observation. So that I end up with much smaller text files. Since the descriptors are the same I'll know the first position for example is ID. Once I finish with the text file I'll be opening the next one and appending it (or would creating a new file be quicker?). What I've done is quite inefficient I've been opening the file. Loading it. Deleting these observations going line by line. If it's taking a fair bit with a test sample it clearly isn't the best method. Any suggestions would be great.

You said that you have "about 500GB of text files." If I understand correctly, you don't have a fixed length for each observation (note, I'm not talking about the number of lines, I mean the total length, in bytes, of all of the lines for an observation). This means that you will have to go through the entire file, because you can't know exactly where the newlines are going to be. Now, depending on how large each individual text file is, you may need to look for a different answer. But if each file is sufficiently small (less than 1 GB?), you might be able to use the <code>linecache</code> module, which handles the seeking-by-line for you. You'd use it perhaps like this: <pre class="prettyprint"><code>import linecache filename = 'observations1.txt' # Start at 44th line curline = 44 lines = [] # Keep looping until no return string is found # getline() never throws errors, but returns an empty string '' # if the line wasn't found (if the line was actually empty, it would have # returned the newline character '\n') while linecache.getline(filename, curline): for i in xrange(75): lines.append(linecache.getline(filename, curline).rstrip()) curline += 1 # Perform work with the set of observation lines add_to_observation_log(lines) # Skip the unnecessary section and reset the lines list curline += 4 lines = [] </code></pre> I tried a test of this, and it chewed through a 23MB file in five seconds.

Quickest way of Importing 500GB Text File taking only the sections wanted

Tags:

python

text

replace

I have about 500GB of text file seperated in months. In these text files the first 43 lines are just connection information (not needed). the next 75 lines are descriptors for an observation. This is followed by 4 lines (not needed) then the next observation which is 75 lines.

The thing is all I want are these 75 lines (descriptors are in the same place for every observation) which are characterized like this:

ID: 5523
Date: 20052012
Mixed: <Null>
.
.

And I want to change it to csv format 5523;20052012;;.. for each observation. So that I end up with much smaller text files. Since the descriptors are the same I'll know the first position for example is ID.

Once I finish with the text file I'll be opening the next one and appending it (or would creating a new file be quicker?).

What I've done is quite inefficient I've been opening the file. Loading it. Deleting these observations going line by line. If it's taking a fair bit with a test sample it clearly isn't the best method.

Any suggestions would be great.

301

asked May 20 '12 18:05

FancyDolphin

2 Answers

You said that you have "about 500GB of text files." If I understand correctly, you don't have a fixed length for each observation (note, I'm not talking about the number of lines, I mean the total length, in bytes, of all of the lines for an observation). This means that you will have to go through the entire file, because you can't know exactly where the newlines are going to be.

Now, depending on how large each individual text file is, you may need to look for a different answer. But if each file is sufficiently small (less than 1 GB?), you might be able to use the linecache module, which handles the seeking-by-line for you.

You'd use it perhaps like this:

import linecache

filename = 'observations1.txt'

# Start at 44th line
curline = 44
lines = []

# Keep looping until no return string is found
# getline() never throws errors, but returns an empty string ''
# if the line wasn't found (if the line was actually empty, it would have
# returned the newline character '\n')
while linecache.getline(filename, curline):
    for i in xrange(75):
        lines.append(linecache.getline(filename, curline).rstrip())
        curline += 1

    # Perform work with the set of observation lines
    add_to_observation_log(lines)

    # Skip the unnecessary section and reset the lines list
    curline += 4
    lines = []

I tried a test of this, and it chewed through a 23MB file in five seconds.

answered Oct 30 '22 22:10

voithos

opening the file. Loading it. Deleting these observations going line by line.

What do you mean by "loading it"? If you mean reading the entire thing into a string, then yes this is going to suck. The natural way to handle the file is to take advantage of the fact that the file object is an iterator over the lines of the file:

for line in file:
    if should_use(line): do_something_with(line)

answered Oct 30 '22 20:10

Karl Knechtel

Related questions
                            
                                Which is the correct way to encode escape characters in Python 2 without killing Unicode?
                            
                                Python and gettext
                            
                                How to extract points from a graph?
                            
                                python os.path.realpath not working properly
                            
                                Finding index of maximum element from list
                            
                                Python TimedRotatingFileHandler - logs are missing
                            
                                In Python 3.2, is "lambda" considered a "keyword," an "operator" or both?
                            
                                Why does my program work with a .py extension but not with a .pyw extension?
                            
                                Pyserial: could not configure port: (5, 'Input/output error)
                            
                                What is BDB for in Python?
                            
                                Latent Semantic Analysis in Python discrepancy
                            
                                Count bitmasks, enumerate 0s
                            
                                Is it costly in Python to put classes in different files?
                            
                                Authentication with public keys and cx_Oracle using Python
                            
                                What are good libraries for creating a python program for (visually appealing) 3D physics simulations/visualizations?
                            
                                Mocking Django Model and save()
                            
                                Why can't Python execute java.exe via subprocess?
                            
                                Splitting a list by first character of each element
                            
                                Equivalent of python:scipy.optimize() in C++?
                            
                                Pyramid: Views registered with `view_config` not being associated with routes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With