I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.
Every day I get another file and need to extract new elements (tens of thousands a day).
What's the best approach to solve my problem?
I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.
I use this code:
dataNew.removeAll(dataOld);
Tried to load data in HashSets - creation of HashSets is endless. The same with LinkedHashset.
Tried to load into ArrayLists and to sort only one of them
Collections.sort(dataNew);
but it didn't speed up the process of
dataNew.removeAll(dataOld);
Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).
I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).
So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).
Any suggestions, please? Maybe I need not alphabetic sorting of ArrayList, but something else?
UPDATE: This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.
Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.
Interesting, when I increased maximum heap size to 16Gb
java -Xms5G -Xmx16G -jar utils.jar
loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList
My code is very simple:
List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);
on second file the program gets
Killed
[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child [1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB
UPDATE2:
Thanks for all your ideas!
Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.
First of all, don't do Files.readAllLines(Paths.get("filename"))
and then pass everything to a Set
, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.
Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.
Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// process your line, maybe add to the Set for the old data?
oldData.add(line);
}
}
Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
boolean oldRemoved = oldData.remove(line);
if (!oldRemoved) {
newData.add(line);
}
}
}
You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.
Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.
Also, if your data are numbers, you could just use a long
and hold that instead of trying to hold instances of String
? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects
might serve you very well as a standard HashSet
has a lot of unnecessary object allocation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With