I am using an ADO .Net Entity Model
for querying a MySQL database. I was very happy about its implementation and usage. I decided to see what would happen if I queried 1 million records and it has serious performance issues, and I don't understand why.
The system hangs for sometime and then I get either
My code is as follows::
try
{
// works very fast
var data = from employees in dataContext.employee_table
.Include("employee_type")
.Include("employee_status")
orderby employees.EMPLOYEE_ID descending
select employees;
// This hangs the system and causes some deadlock exception
IList<employee_table> result = data.ToList<employee_table>();
return result;
}
catch (Exception ex)
{
throw new MyException("Error in fetching all employees", ex);
}
My question is why is ToList() taking such a long time?
Also how can I avoid this exception and what is the ideal way to query a million records?
Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.
Dapper is literally much faster than Entity Framework Core considering the fact that there are no bells and whistles in Dapper. It is a straight forward Micro ORM that has minimal features as well. It is always up to the developer to choose between these 2 Awesome Data Access Technologies.
Because an open connection to the database consumes a valuable resource, the Entity Framework opens and closes the database connection only as needed. You can also explicitly open the connection. For more information, see Managing Connections and Transactions. Once in each application domain.
The ideal way to query a million records would be to use a IQueryable<T>
to make sure that you actually aren't executing a query on the database until you need the actual data. I highly doubt that you need a million records at once.
The reason that it is deadlocking is that you are asking the MySQL server to pull those million records from the database then sort then by the EMPLOYEE_ID
and then for your program to return that back to you. So I imagine that the deadlocks are from your program waiting for that to finish, and for your program to read that into memory. The MySQL problems are probably related to timeout issues.
The reason that the var data
section works quickly is because you actually haven't done anything yet, you've just constructed the query. when you call ToList()
then all of the SQL and reading of the SQL is executed. This is what is known as Lazy Loading.
I would suggest try this as follows:
var data = from employees in dataContext.employee_table
.Include("employee_type")
.Include("employee_status")
orderby employees.EMPLOYEE_ID descending
select employees;
Then when you actually need something from the list just call
data.Where(/* your filter expression */).ToList()
So if you needed the employee with ID 10.
var employee = data.Where(e => e.ID == 10).ToList();
Or if you need all the employees that last names start with S (I don't know if your table has a last name column, just an example).
var employees = data.Where(e => e.LastName.StartsWith("s")).ToList();
Or if you want to page through all of the employees in chunks of 100
var employees = data.Skip(page * 100).Take(100).ToList();
If you want to defer your database calls even further, you can not call ToList()
and just use the iterator when you need it. So let's say you want to add up all of the salaries of the people that have a name starting with A
var salaries = data.Where(s => s.LastName.StartsWith("A"))
foreach(var employee in salaries)
{
salaryTotal += employee.Salary;
}
This would only do a query that would look something like
Select Salary From EmployeeTable Where ID = @ID
Resulting in a very fast query that is only getting the information when you need it and only just the information that you need.
If for some crazy reason you wanted to actually query all the million records for the database. Ignoring the fact that this would eat up a massive amount of system resources I would suggest doing this in chunks, you would probably need to play around with the chunk size to get the best performance.
The general idea is to do smaller queries to avoid timeout issues from the database.
int ChunkSize = 100; //for example purposes
HashSet<Employee> Employees - new HashSet<Employee>;
//Assuming it's exactly 1 Million records
int RecordsToGet = 1000000;
for(record = 0; record <= RecordsToGet; record += ChunkSize)
{
dataContext.EmployeeTable.Skip(record).Take(ChunkSize).ForEach(e => HashSet.Add(e));
}
I chose to use a HashSet<T>
since they are designed for large sets of data, but I don't know what performance would look like a 1,000,000 objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With