Im using mysql with JDBC.
I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below:
I have created three additional indexes on the table, see below:
Performing a SELECT
query like this SELECT latitude, longitude FROM 3dag WHERE
timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3"
has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this:
My code for retrieving the data is below:
Connection con = null;
PreparedStatement pst = null;
Statement stmt = null;
ResultSet rs = null;
String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
String user = "bigd";
String password = "XXXXX";
try {
Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection(url, user, password);
String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
rs = stmt.executeQuery(query);
System.out.println("Start");
while (rs.next()) {
int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
int x = (int) (maxLong * 100000) - tempLong;
int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
int y = (int) (maxLat * 100000) - tempLat;
if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
matrix[y][x] += 1;
}
}
System.out.println("End");
JSONObject obj = convertToCRS(matrix);
return obj;
}catch (ClassNotFoundException ex){
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
}
catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
} finally {
try {
if (rs != null) {
rs.close();
}
if (pst != null) {
pst.close();
}
if (con != null) {
con.close();
}
} catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.WARNING, ex.getMessage(), ex);
return null;
}
}
Removing every line in the while(rs.next())
loop gives me the same horrible run-time.
My question is what can I do to optimize this type of query? I am curious about the .setFetchSize()
and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?
Any help is appreciated.
EDIT After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:
Some ideas up front:
The idex might look like:
CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);
Perform your SQL inside MySQL - what time do you get there?
stmt.setFetchSize(Integer.MIN_VALUE);
this might create many unneeded network roundtrips. According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp
column is about 1/30th of the cardinality of your Uid
column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.
That being said, you might try to use the following compound covering index to speed things up.
CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)
Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.
You could try doing some of the summarizing on the MySQL server.
SELECT COUNT(*) number_of_duplicates,
ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
FROM 3dag
WHERE timestamp BETWEEN "+startTime+"
AND "+endTime+"
AND HourOfDay=4
AND DayOfWeek=3
GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)
This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N)
function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.
Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.
More suggestions
Make your latitude and longitude columns into FLOAT
values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE
. Storing and transferring numbers in varchar(30)
columns is quite inefficient.
Similarly, make your HourOfDay
and DayOfWeek
columns into SMALLINT
or even TINYINT
data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.
Finally, if your queries always look like this
SELECT Latitude, Longitude
FROM 3dag
WHERE timestamp BETWEEN SOME_VALUE
AND ANOTHER_VALUE
AND HourOfDay = SOME_CONSTANT_DAY
AND DayOfWeek = SOME_CONSTANT_HOUR
this compound covering index should be ideal to accelerate your query.
CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With