Sorry if the title doesn't match my question well, I'm still unsure as to how I should put it.
Anyway, I've been using Tcl/Tk on Windows (wish
) for a while now and haven't encountered any problem on the script I wrote until recently. The script is supposed to break down a large txt file into smaller files that can be imported to excel (I'm talking about breaking down a file with maybe 25M lines which comes around 2.55 GB).
My current script is something like that:
set data [open "file.txt" r]
set data1 [open "File Part1.txt" w]
set data2 [open "File Part2.txt" w]
set data3 [open "File Part3.txt" w]
set data4 [open "File Part4.txt" w]
set data5 [open "File Part5.txt" w]
set count 0
while {[gets $data line] != -1} {
if {$count > 4000000} {
puts $data5 $line
} elseif {$count > 3000000} {
puts $data4 $line
} elseif {$count > 2000000} {
puts $data3 $line
} elseif {$count > 1000000} {
puts $data2 $line
} else {
puts $data1 $line
}
incr count
}
close $data
close $data1
close $data2
close $data3
close $data4
close $data5
And I alter the numbers within the if
to get the desired number of lines per file, or add/remove any elseif
where required.
The problem is, with the latest file I got, I end up with only about half the data (1.22 GB instead of 2.55 GB) and I was wondering if there was a line which told Tcl to ignore the limit that it can read. I tried to look for it, but I didn't find anything (or anything that I could understand well; I'm still quite the amateur at Tcl ^^;). Can anyone help me?
EDIT (update): I found a program to open large text files and managed to get a preview of the contents of the file directly. There are actually 16,756,263 lines. I changed the script to:
set data [open "file.txt" r]
set data1 [open "File Part1.txt" w]
set count 0
while {[gets $data line] != -1} {
incr count
}
puts $data1 $count
close $data
close $data1
to get where the script is blocking and it stopped here:
There's a character that the text editor is not recognising in the middle line showing as a little square. I tried to use fconfigure
like evil otto suggested but I'm afraid I don't quite understand how the channelID
, name
or value
work exactly to escape that character. Um... help?
reEDIT : I managed to find out how fconfigure
worked! Thanks evil otto! Um, I'm not sure how I can 'choose' your answer since it's a comment instead of a proper answer...
The Tcl file commands are file, open, close, gets and read, and puts, seek, tell, and eof, fblocked, fconfigure, Tcl_StandardChannels(3), flush, fileevent, filename. One way to get file data in Tcl is to 'slurp' up the file into a text variable. This works really well if the files are known to be small.
Open the file for reading only; the file must already exist. This is the default value if access is not specified. r+ Open the file for both reading and writing; the file must already exist.
Is it possible there is any binary data in "file.txt"? Under windows, tcl will flag eof if it reads a ^Z
(the default eofchar
) in a file. You can turn this off with fconfigure
:
fconfigure $data -eofchar {}
See the docs for full details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With