I am preparing for a phone interview. I came upon these questions on the internet. Can anyone tell me some good answers for these?
Suppose I give you a text file and ask you a to write a program that will return a random line from the file (all lines must have equal probability to be returned)
Same as part 1, except this time the entire text file cannot fit into main memory
Same as part 2, except now you have a stream instead of a file.
Please help.
Ok...@Everyone, I really had a few ideas in my mintod before asking this...Seeing the relentless attack by my fellow SOers, I am posting my answers. Please feel free to attack them too...
1: Count the number of '\n' in the file. Generate a random number between 1 and the number and return the line after the number-1 '\n'.
2: Bring the file into main memory part by part and follow the above procedure.
3: I dont have much idea about this and would appreciate any inputs.
Its wonderful that you guys really give an inspiration to push forward.....
Read all lines into an array, return a random line in the range of 1 and the amount of lines.
Most simple: Count the lines, choose a line number at random, go through the file a second time and return it.
You just have to remember one line. Each new line has a probability of 1/N (N being lines read).
Pseudocode:
i = 1
chosen_line = ""
for line in lines:
if random() < 1/i: # random returns a uniform random number in [0,1)
chosen_line = line
i += 1
return chosen_line
Algorithm number 3 could be used for 1 and 2 too.
You can do this without having to read all the lines in memory, thus working well for huge files. Pseudocode:
linenum := 0
ret := ''
while more lines to read:
line := readline()
linenum := linenum + 1
r := uniform_random(0, linenum)
if r < 1:
ret := line
return ret
Proof: We begin by noting that we always save the first line in ret
. If the file has one line, you are going to choose it, and you're done.
For two-line file, ret
will save the first line 100% of the times, and the second line will be saved in ret
50% of the time during the second iteration of the loop. Thus, each line has a probability of 0.5 of being selected.
Now, let's assume that this method works for files of ≤N
lines. To prove that this means it works for N+1
, in the (N+1)
th iteration of the loop, there is a probability of 1/(N+1)
that the last line will be selected (random(0, N+1) < 1
has that probability). Thus, the last line has 1/(N+1)
probability of being selected. The probability of all other lines being selected is still going to be equal to each other, let's call this x
. Then, N*x + 1/(N+1) == 1
, which means that x = 1/(N+1)
.
Proof by induction is complete.
Edit: Oops, didn't see the first answer's third method before replying. Still, I will keep this post here if only for the proof, and the opportunity for other people to correct it if there are any errors in it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With