I have a file with n lines. (n above 100 millions)
I want to output a file with only 1 of 10 lines, I can't split the file in ten part and keep only one part, as it must be a little more random. later I have to do a statistical analysis I can't afford to create a strong bias in the data).
I was thinking of reading the file and for each record if the record number mod 10 then output it.
The constraints are:
it's a windows (likely hardened) computer possibly XP Vista or Windows server 2003.
no development tools available
no network,usb,cd-rom. read no external communication.
Therefore I was thinking of windows batch file (I can't assume powershell, and vbscript is likely to have been removed). And at the moment looking at the FOR /F command. Still I am not an expert and I don't know how to achieve this.
Thank you Paul for your answer. I reformat (with Hosam help) the answer to put it in a batch file:
@echo off
setlocal
findstr/N . inputFile| findstr ^[0-9]*0: >temporaryFile
FOR /F "tokens=1,* delims=: " %%i in (temporaryfile) do echo %%j > outputFile
Thanks quux and Pax for the similar alternative solution. However after a quick test on a larger file Paul's answer is roughly 8 times faster. I guess the evaluation (in SET) is kind of slow, even if the logic seems brilliant.
Ok, I think I've cracked it:
findstr/N . path-to-log-file | findstr ^[0-9]*0:
(use findstr to add the line number to the beginning of the line, then again to print only lines with a line number ending in zero)
So you'll get one line in 10, but with the linenumber and colon prepended to each line
If I can think of a way using command-line tools only of stripping that out, I'll edit this answer :)
Remove the line number and colon with
FOR /F "tokens=1,2* delims=: " %i in (file-with-linenumbers) do echo %j
Paul.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With