I have about 500GB of text file seperated in months. In these text files the first 43 lines are just connection information (not needed). the next 75 lines are descriptors for an observation. This is followed by 4 lines (not needed) then the next observation which is 75 lines.
The thing is all I want are these 75 lines (descriptors are in the same place for every observation) which are characterized like this:
ID: 5523
Date: 20052012
Mixed: <Null>
.
.
And I want to change it to csv format 5523;20052012;;..
for each observation. So that I end up with much smaller text files. Since the descriptors are the same I'll know the first position for example is ID.
Once I finish with the text file I'll be opening the next one and appending it (or would creating a new file be quicker?).
What I've done is quite inefficient I've been opening the file. Loading it. Deleting these observations going line by line. If it's taking a fair bit with a test sample it clearly isn't the best method.
Any suggestions would be great.
The steps to import a TXT or CSV file into Excel are similar for Excel 2007, 2010, 2013, and 2016: Open the Excel spreadsheet where you want to save the data and click the Data tab. In the Get External Data group, click From Text. Select the TXT or CSV file you want to convert and click Import.
You said that you have "about 500GB of text files." If I understand correctly, you don't have a fixed length for each observation (note, I'm not talking about the number of lines, I mean the total length, in bytes, of all of the lines for an observation). This means that you will have to go through the entire file, because you can't know exactly where the newlines are going to be.
Now, depending on how large each individual text file is, you may need to look for a different answer. But if each file is sufficiently small (less than 1 GB?), you might be able to use the linecache
module, which handles the seeking-by-line for you.
You'd use it perhaps like this:
import linecache
filename = 'observations1.txt'
# Start at 44th line
curline = 44
lines = []
# Keep looping until no return string is found
# getline() never throws errors, but returns an empty string ''
# if the line wasn't found (if the line was actually empty, it would have
# returned the newline character '\n')
while linecache.getline(filename, curline):
for i in xrange(75):
lines.append(linecache.getline(filename, curline).rstrip())
curline += 1
# Perform work with the set of observation lines
add_to_observation_log(lines)
# Skip the unnecessary section and reset the lines list
curline += 4
lines = []
I tried a test of this, and it chewed through a 23MB file in five seconds.
opening the file. Loading it. Deleting these observations going line by line.
What do you mean by "loading it"? If you mean reading the entire thing into a string, then yes this is going to suck. The natural way to handle the file is to take advantage of the fact that the file object is an iterator over the lines of the file:
for line in file:
if should_use(line): do_something_with(line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With