Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to randomly delete a number of lines from a big file?

I have a big text file of 13 GB with 158,609,739 lines and I want to randomly select 155,000,000 lines.

I have tried to scramble the file and then cut the 155000000 first lines, but it's seem that my ram memory (16GB) isn't enough big to do this. The pipelines i have tried are:

shuf file | head -n 155000000
sort -R file | head -n 155000000

Now instead of selecting lines, I think is more memory efficient delete 3,609,739 random lines from the file to get a final file of 155000000 lines.

like image 960
Geparada Avatar asked Nov 10 '11 22:11

Geparada


4 Answers

As you copy each line of the file to the output, assess its probability that it should be deleted. The first line should have a 3,609,739/158,609,739 chance of being deleted. If you generate a random number between 0 and 1 and that number is less than that ratio, don't copy it to the output. Now the odds for the second line are 3,609,738/158,609,738; if that line is not deleted, the odds for the third line are 3,609,738/158,609,737. Repeat until done.

Because the odds change with each line processed, this algorithm guarantees the exact line count. Once you've deleted 3,609,739 the odds go to zero; if at any time you would need to delete every remaining line in the file, the odds go to one.

like image 150
Mark Ransom Avatar answered Nov 05 '22 21:11

Mark Ransom


You could always pre-generate which line numbers (a list of 3,609,739 random numbers selected without replacement) you plan on deleting, then just iterate through the file and copy to another, skipping lines as necessary. As long as you have space for a new file this would work.

You could select the random numbers with random.sample E.g.,

random.sample(xrange(158609739), 3609739)
like image 30
Darren Yin Avatar answered Nov 05 '22 22:11

Darren Yin


Proof of Mark Ransom's Answer

Let's use numbers easier to think about (at least for me!):

  • 10 items
  • delete 3 of them

First time through the loop we will assume that the first three items get deleted -- here's what the probabilities look like:

  • first item: 3 / 10 = 30%
  • second item: 2 / 9 = 22%
  • third item: 1 / 8 = 12%
  • fourth item: 0 / 7 = 0 %
  • fifth item: 0 / 6 = 0 %
  • sixth item: 0 / 5 = 0 %
  • seventh item: 0 / 4 = 0 %
  • eighth item: 0 / 3 = 0 %
  • ninth item: 0 / 2 = 0 %
  • tenth item: 0 / 1 = 0 %

As you can see, once it hits zero, it stays at zero. But what if nothing is getting deleted?

  • first item: 3 / 10 = 30%
  • second item: 3 / 9 = 33%
  • third item: 3 / 8 = 38%
  • fourth item: 3 / 7 = 43%
  • fifth item: 3 / 6 = 50%
  • sixth item: 3 / 5 = 60%
  • seventh item: 3 / 4 = 75%
  • eighth item: 3 / 3 = 100%
  • ninth item: 2 / 2 = 100%
  • tenth item: 1 / 1 = 100%

So even though the probability varies per line, overall you get the results you are looking for. I went a step further and coded a test in Python for one million iterations as a final proof to myself -- remove seven items from a list of 100:

# python 3.2
from __future__ import division
from stats import mean  # http://pypi.python.org/pypi/stats
import random

counts = dict()
for i in range(100):
    counts[i] = 0

removed_failed = 0

for _ in range(1000000):
    to_remove = 7
    from_list = list(range(100))
    removed = 0
    while from_list:
        current = from_list.pop()
        probability = to_remove / (len(from_list) + 1)
        if random.random() < probability:
            removed += 1
            to_remove -= 1
            counts[current] += 1
    if removed != 7:
        removed_failed += 1

print(counts[0], counts[1], counts[2], '...',
      counts[49], counts[50], counts[51], '...',
      counts[97], counts[98], counts[99])
print("remove failed: ", removed_failed)
print("min: ", min(counts.values()))
print("max: ", max(counts.values()))
print("mean: ", mean(counts.values()))

and here's the results from one of the several times I ran it (they were all similar):

70125 69667 70081 ... 70038 70085 70121 ... 70047 70040 70170
remove failed:  0
min:  69332
max:  70599
mean:  70000.0

A final note: Python's random.random() is [0.0, 1.0) (doesn't include 1.0 as a possibility).

like image 3
Ethan Furman Avatar answered Nov 05 '22 22:11

Ethan Furman


I believe you're looking for "Algorithm S" from section 3.4.2 of Knuth (D. E. Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981).

You can see several implementations at http://rosettacode.org/wiki/Knuth%27s_algorithm_S

The Perlmonks list has some Perl implementations of Algorithm S and Algorithm R that might also prove useful.

These algorithms rely on there being a meaningful interpretation of floating point numbers like 3609739/158609739, 3609738/158609738, etc. which might not have sufficient resolution with a standard Float datatype, unless the Float datatype is implemented using numbers of double precision or larger.

like image 2
sarnold Avatar answered Nov 05 '22 21:11

sarnold