I have a big file which has 30 Million User ID's. That big file will look something like this and each line will have an User Id.
149905320
1165665384
66969324
886633368
1145241312
286585320
1008665352
1135545396
186217320
132577356
Now, I am planning to get any random line from that big text file. I know the total number of user id's in that big text file. I am not sure what is the best way to select random elements from that big text file. I was thinking to store all those 30Million user ids in a set and then randomly select elements from the hastset but with this approach it will go with out of memory error.
So that is the reason, I am trying to randomly select elements from a big text file.
final String id = generateRandomUserId(random);
/**
* Select random elements from the a big text file
*
* @param userIdsSet2
* @param r
* @return
*/
private String generateRandomUserId(Random r) {
File bigFile = new File("C:\\bigfile.txt");
//randomly select elements from a big text file
}
what is the best way to do this?
You could do like so :
RandomAccessFile
)file.seek(number)
)\n
character (file.seek(1)
)file.readLine()
)for example...
This way you don't have to store anything.
A sample theoretic snippet could look like this (contains some side effects) :
File f = new File("D:/abc.txt");
RandomAccessFile file;
try {
file = new RandomAccessFile(f, "r");
long file_size = file.length();
long chosen_byte = (long)(Math.random() * file_size);
file.seek(chosen_byte);
for (;;)
{
byte a_byte = file.readByte();
char wordChar = (char)a_byte;
if (chosen_byte >= file_size || wordChar == '\n' || wordChar == '\r' || wordChar == -1) break;
else chosen_byte += 1;
System.out.println("\"" + Character.toString(wordChar) + "\"");
}
int chosen = -1;
if (chosen_byte < file_size)
{
String s = file.readLine();
chosen = Integer.parseInt(s);
System.out.println("Chosen id : \"" + s + "\"");
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
EDIT : Full working (theoretically) class
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
public class Main {
/**
* WARNING : This piece of code requires that the input file terminates by a BLANK line !
*
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
File f = new File("D:/abc.txt");
RandomAccessFile file;
try {
file = new RandomAccessFile(f, "r");
long file_size = file.length();
// Let's start
long chosen_byte = (long)(Math.random() * (file_size - 1));
long cur_byte = chosen_byte;
// Goto starting position
file.seek(cur_byte);
String s_LR = "";
char a_char;
// Get left hand chars
for (;;)
{
a_char = (char)file.readByte();
if (cur_byte < 0 || a_char == '\n' || a_char == '\r' || a_char == -1) break;
else
{
s_LR = a_char + s_LR;
--cur_byte;
if (cur_byte >= 0) file.seek(cur_byte);
else break;
}
}
// Get right hand chars
cur_byte = chosen_byte + 1;
file.seek(cur_byte);
for (;;)
{
a_char = (char)file.readByte();
if (cur_byte >= file_size || a_char == '\n' || a_char == '\r' || a_char == -1) break;
else
{
s_LR += a_char;
++cur_byte;
}
}
// Parse ID
if (cur_byte < file_size)
{
int chosen_id = Integer.parseInt(s_LR);
System.out.println("Chosen id : " + chosen_id);
}
else
{
throw new Exception("Ran out of bounds. But this usually never happen...");
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Hope this is not too wrong as an implementation (I'm more C++ those days)...
Instead of storing the user ids in a hash you could parse the file and store just the offsets in an int[] array - 30M would take ~120MB of RAM.
Alternatively if you can change or preprocess the file in some way you could change the format to fixed width by padding the user ids or use a binary format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With