Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing MySQL query on large table

Tags:

java

mysql

jdbc

Im using mysql with JDBC.

I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below: enter image description here

I have created three additional indexes on the table, see below: enter image description here

Performing a SELECT query like this SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3" has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this: enter image description here

My code for retrieving the data is below:

    Connection con = null;
    PreparedStatement pst = null;
    Statement stmt = null;
    ResultSet rs = null;

    String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
    String user = "bigd";
    String password = "XXXXX";

    try {
        Class.forName("com.mysql.jdbc.Driver");
        con = DriverManager.getConnection(url, user, password);
        String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
        stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
        stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
        stmt.setFetchSize(Integer.MIN_VALUE);
        rs = stmt.executeQuery(query);

        System.out.println("Start");
        while (rs.next()) {

            int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
            int x = (int) (maxLong * 100000) - tempLong;
            int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
            int y = (int) (maxLat * 100000) - tempLat;

            if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
                matrix[y][x] += 1;
            }
        }
        System.out.println("End");
        JSONObject obj = convertToCRS(matrix);
        return obj;

    }catch (ClassNotFoundException ex){
        Logger lgr = Logger.getLogger(Database.class.getName());
        lgr.log(Level.SEVERE, ex.getMessage(), ex);
        return null;
    }
    catch (SQLException ex) {
        Logger lgr = Logger.getLogger(Database.class.getName());
        lgr.log(Level.SEVERE, ex.getMessage(), ex);
        return null;
    } finally {
        try {
            if (rs != null) {
                rs.close();
            }
            if (pst != null) {
                pst.close();
            }
            if (con != null) {
                con.close();
            }
        } catch (SQLException ex) {
            Logger lgr = Logger.getLogger(Database.class.getName());
            lgr.log(Level.WARNING, ex.getMessage(), ex);
            return null;
        }
    }

Removing every line in the while(rs.next()) loop gives me the same horrible run-time.

My question is what can I do to optimize this type of query? I am curious about the .setFetchSize() and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?

Any help is appreciated.

EDIT After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:

enter image description here

like image 907
kongshem Avatar asked Dec 13 '15 13:12

kongshem


2 Answers

Some ideas up front:

  • Did you in fact check the SQL Execution time (from .executeQuery() till first row?) or is that execution + iteration over 6.3 million rows?
  • You prepare a PreparedStatement but don't use it?!
  • Use PreparedStatement, pass tiemstamp, dayOfWeek, hourOfDay as parameters
  • Create one index that can satisfy your where condition. Order the keys in a way that you can eliminate the most items with the highest ranking field.

The idex might look like:

CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);

Perform your SQL inside MySQL - what time do you get there?

  • Try without stmt.setFetchSize(Integer.MIN_VALUE); this might create many unneeded network roundtrips.
like image 108
Jan Avatar answered Sep 24 '22 19:09

Jan


According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp column is about 1/30th of the cardinality of your Uid column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.

That being said, you might try to use the following compound covering index to speed things up.

CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)

Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.

You could try doing some of the summarizing on the MySQL server.

SELECT COUNT(*) number_of_duplicates, 
       ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
  FROM 3dag
 WHERE timestamp BETWEEN "+startTime+" 
                     AND "+endTime+"
   AND HourOfDay=4
   AND DayOfWeek=3
 GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)

This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N) function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.

Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.

More suggestions

Make your latitude and longitude columns into FLOAT values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE. Storing and transferring numbers in varchar(30) columns is quite inefficient.

Similarly, make your HourOfDay and DayOfWeek columns into SMALLINT or even TINYINT data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.

Finally, if your queries always look like this

SELECT Latitude, Longitude
   FROM 3dag
  WHERE timestamp BETWEEN SOME_VALUE 
                      AND ANOTHER_VALUE
    AND HourOfDay = SOME_CONSTANT_DAY
    AND DayOfWeek = SOME_CONSTANT_HOUR

this compound covering index should be ideal to accelerate your query.

CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)
like image 42
O. Jones Avatar answered Sep 23 '22 19:09

O. Jones