Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do I load 100 million rows in to memory

I have the need of loading 100 million+ rows from a MySQL database in to memory. My java program fails with java.lang.OutOfMemoryError: Java heap space I have 8GB RAM in my machine and I have given -Xmx6144m in my JVM options.

This is my code

public List<Record> loadTrainingDataSet() {

    ArrayList<Record> records = new ArrayList<Record>();
    try {
        Statement s = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
        s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings");
        ResultSet rs = s.getResultSet();
        int count = 0;
        while (rs.next()) {

Any idea how to overcome this problem?


UPDATE

I came across this post, as well as based on the comments below I updated my code. It seems I am able to load the data to memory with the same -Xmx6144m amount, but it takes a long time.

Here is my code.

...
import org.apache.mahout.math.SparseMatrix;
...

@Override
public SparseMatrix loadTrainingDataSet() {
    long t1 = System.currentTimeMillis();
    SparseMatrix ratings = new SparseMatrix(NUM_ROWS,NUM_COLS);
    int REC_START = 0;
    int REC_END = 0;

    try {
        for (int i = 1; i <= 101; i++) {
            long t11 = System.currentTimeMillis();
            REC_END = 1000000 * i;
            Statement s = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                    java.sql.ResultSet.CONCUR_READ_ONLY);
            s.setFetchSize(Integer.MIN_VALUE);
            ResultSet rs = s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT " + REC_START + "," + REC_END);//100480507
            while (rs.next()) {
                int movieId = rs.getInt("movie_id");
                int customerId = rs.getInt("customer_id");
                byte rating = (byte) rs.getInt("rating");
                ratings.set(customerId,movieId,rating);
            }
            long t22 = System.currentTimeMillis();
            System.out.println("Round " + i + " completed " + (t22 - t11) / 1000 + " seconds");
            rs.close();
            s.close();
        }

    } catch (Exception e) {
        System.err.println("Cannot connect to database server " + e);
    } finally {
        if (conn != null) {
            try {
                conn.close();
                System.out.println("Database connection terminated");
            } catch (Exception e) { /* ignore close errors */ }
        }
    }
    long t2 = System.currentTimeMillis();
    System.out.println(" Took " + (t2 - t1) / 1000 + " seconds");
    return ratings;
}

To load first 100,000 rows it took 2 seconds. To load 29th 100,000 rows it took 46 seconds. I stopped the process in the middle since it was taking too much time. Are these acceptable amounts of time? Is there a way to improve the performance of this code? I am running this on 8GB RAM 64bit windows machine.

like image 357
ravindrab Avatar asked Dec 04 '22 00:12

ravindrab


2 Answers

A hundred million records means that each record may take up at most 50 bytes in order to fit within 6 GB + some extra space for other allocations. In Java 50 bytes is nothing; a mere Object[] takes 32 bytes per element. You must find a way to immediately use the results in your while (rs.next()) loop and not retain them in full.

like image 131
Marko Topolnik Avatar answered Dec 14 '22 17:12

Marko Topolnik


The problem is I get the java.lang.OutOfMemoryError in the s.executeQuery( line it self

You can split your query in multiple ones:

    s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT 0,300"); //shows the first 300 results
    //process this first result
    s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT 300,600");//shows 300 results starting from the 300th one
    //process this second result
    //etc

You can do a while that stops when no more results are found

like image 33
BackSlash Avatar answered Dec 14 '22 17:12

BackSlash