Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - how to efficiently store a large amount of String arrays

Tags:

java

csv

lua

I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.

Below is an example of the memory differences and load times:

  • CSV File - 232mb
  • Lua - 549mb in memory - 157 seconds to load
  • Java - 1,378mb in memory - 12 seconds to load

If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.

Below is some background on the data within the CSV files:

  • Each field consists of a String
  • Specific fields within each row may include one of a set of Strings (E.g. field 3 could be "red", "green", or "blue").
  • There are many duplicate Strings within the content.

Below are some examples of what may be required of the loaded data:

  • Search through all Strings attempting to match with a given String and return the matching Strings
  • Display matches in a GUI table (sort able via fields).
  • Alter or replace Strings.

My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?

like image 504
user1816198 Avatar asked Nov 12 '22 18:11

user1816198


1 Answers

One easy solution. You can have some HashMap were you will put references to all unique strings. And in ArrayList you will just have reference to existing unique strings in HashMap.

Something like :

private HashMap<String, String> hashMap = new HashMap<String, String>();

public String getUniqueString(String ns) {
   String oldValue = hashMap.get(ns);
   if (oldValue != null) { //I suppose there will be no null strings inside csv
    return oldValue;
   }        
   hashMap.put(ns, ns);
   return ns;
}

Simple usage:

List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
   String ns = a.getUniqueString(er);
   finS.add(ns);
}
like image 174
Igor Avatar answered Nov 15 '22 11:11

Igor