I'm building a webcrawler and I'm looking for the best way to handle my requests and connection between my threads and the database (MySql).
I've 2 types of threads :
- Fetchers : They crawl websites. They produce url and add they into 2 tables : table_url and table_file. They select from table_url to continue the crawl. And update table_url to set visited=1 when they have read a url. Or visited=-1 when they are reading it. They can delete row.
- Downloaders : They download files. They select from table_file. They update table_file to change the Downloaded column. They never insert anything.
Right now I'm working with this : I've a pool of connection based on c3p0. Every target (website) have thoses variables :
private Connection connection_downloader;
private Connection connection_fetcher;
I create both connection only once when I instanciate a website. Then every thread will use thoses connections based on their target.
Every thread have thoses variables :
private Statement statement;
private ResultSet resultSet;
Before every Query I open a SqlStatement :
public static Statement openSqlStatement(Connection connection){
try {
return connection.createStatement();
} catch (SQLException e) {
e.printStackTrace();
}
return null;
}
And after every Query I close sql statement and resultSet with :
public static void closeSqlStatement(ResultSet resultSet, Statement statement){
if (resultSet != null) try { resultSet.close(); } catch (SQLException e) {e.printStackTrace();}
if (statement != null) try { statement.close(); } catch (SQLException e) {e.printStackTrace();}
}
Right now my Select queries only work with one select (I never have to select more than one for now but this will change soon) and is defined like this :
public static String sqlSelect(String Query, Connection connection, Statement statement, ResultSet resultSet){
String result = null;
try {
resultSet = statement.executeQuery(Query);
resultSet.next();
result = resultSet.toString();
} catch (SQLException e) {
e.printStackTrace();
}
closeSqlStatement(resultSet, statement);
return result;
}
And Insert, Delete and Update queries use this function :
public static int sqlExec(String Query, Connection connection, Statement statement){
int ResultSet = -1;
try {
ResultSet = statement.executeUpdate(Query);
} catch (SQLException e) {
e.printStackTrace();
}
closeSqlStatement(resultSet, statement);
return ResultSet;
}
My question is simple : can this be improved to be faster ? And I'm concerned about mutual exclusion to prevent a thread to update a link while another is doing it.
I believe your design is flawed. Having one connection assigned full-time for one website will severly limit your overall workload.
As you already have setup a connection pool, it's perfectly okay to fetch before you use (and return afterwards).
Just the same, try-with-catch for closing all your ResultSets and Statements after will make code more readable - and using PreparedStatement instead of Statement would not hurt as well.
One Example (using a static dataSource() call to access your pool):
public static String sqlSelect(String id) throws SQLException {
try(Connection con = dataSource().getConnection();
PreparedStatement ps = con.prepareStatement("SELECT row FROM table WHERE key = ?")) {
ps.setString(1, id);
try(ResultSet resultSet = ps.executeQuery()) {
if(rs.next()) {
return rs.getString(1);
} else {
throw new SQLException("Nothing found");
}
}
} catch (SQLException e) {
e.printStackTrace();
throw e;
}
}
Following the same pattern I suggest you create methods for all the different Insert/Update/Selects your application uses as well - all using the connection only for the short time inside the DB logic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With