Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Entity model .net querying 1 million records from MySQL performance issues

I am using an ADO .Net Entity Model for querying a MySQL database. I was very happy about its implementation and usage. I decided to see what would happen if I queried 1 million records and it has serious performance issues, and I don't understand why.

The system hangs for sometime and then I get either

  • A deadlock exception
  • MySQL Exception

My code is as follows::

      try
        {
            // works very fast
            var data = from employees in dataContext.employee_table
                            .Include("employee_type")
                            .Include("employee_status")
                       orderby employees.EMPLOYEE_ID descending                            
                       select employees; 

            // This hangs the system and causes some deadlock exception
            IList<employee_table> result = data.ToList<employee_table>(); 

            return result;
       }
       catch (Exception ex)
       {
            throw new MyException("Error in fetching all employees", ex);
       }

My question is why is ToList() taking such a long time?

Also how can I avoid this exception and what is the ideal way to query a million records?

like image 849
Gurucharan Balakuntla Maheshku Avatar asked Mar 13 '11 15:03

Gurucharan Balakuntla Maheshku


People also ask

Can MySQL handle 1 million records?

Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.

Is Dapper better than Entity Framework?

Dapper is literally much faster than Entity Framework Core considering the fact that there are no bells and whistles in Dapper. It is a straight forward Micro ORM that has minimal features as well. It is always up to the developer to choose between these 2 Awesome Data Access Technologies.

How does Entity Framework affect the connection with the database?

Because an open connection to the database consumes a valuable resource, the Entity Framework opens and closes the database connection only as needed. You can also explicitly open the connection. For more information, see Managing Connections and Transactions. Once in each application domain.


1 Answers

The ideal way to query a million records would be to use a IQueryable<T> to make sure that you actually aren't executing a query on the database until you need the actual data. I highly doubt that you need a million records at once.

The reason that it is deadlocking is that you are asking the MySQL server to pull those million records from the database then sort then by the EMPLOYEE_ID and then for your program to return that back to you. So I imagine that the deadlocks are from your program waiting for that to finish, and for your program to read that into memory. The MySQL problems are probably related to timeout issues.

The reason that the var data section works quickly is because you actually haven't done anything yet, you've just constructed the query. when you call ToList() then all of the SQL and reading of the SQL is executed. This is what is known as Lazy Loading.

I would suggest try this as follows:

        var data = from employees in dataContext.employee_table
                        .Include("employee_type")
                        .Include("employee_status")
                   orderby employees.EMPLOYEE_ID descending                            
                   select employees;

Then when you actually need something from the list just call

data.Where(/* your filter expression */).ToList()

So if you needed the employee with ID 10.

var employee = data.Where(e => e.ID == 10).ToList();

Or if you need all the employees that last names start with S (I don't know if your table has a last name column, just an example).

var employees = data.Where(e => e.LastName.StartsWith("s")).ToList();

Or if you want to page through all of the employees in chunks of 100

var employees = data.Skip(page * 100).Take(100).ToList();

If you want to defer your database calls even further, you can not call ToList() and just use the iterator when you need it. So let's say you want to add up all of the salaries of the people that have a name starting with A

 var salaries = data.Where(s => s.LastName.StartsWith("A"))

 foreach(var employee in salaries)
 {
     salaryTotal += employee.Salary;
 }

This would only do a query that would look something like

Select Salary From EmployeeTable Where ID = @ID

Resulting in a very fast query that is only getting the information when you need it and only just the information that you need.

If for some crazy reason you wanted to actually query all the million records for the database. Ignoring the fact that this would eat up a massive amount of system resources I would suggest doing this in chunks, you would probably need to play around with the chunk size to get the best performance.

The general idea is to do smaller queries to avoid timeout issues from the database.

int ChunkSize = 100; //for example purposes
HashSet<Employee> Employees - new HashSet<Employee>;

//Assuming it's exactly 1 Million records

int RecordsToGet = 1000000;

for(record = 0; record <= RecordsToGet; record += ChunkSize)
{
    dataContext.EmployeeTable.Skip(record).Take(ChunkSize).ForEach(e => HashSet.Add(e));
}

I chose to use a HashSet<T> since they are designed for large sets of data, but I don't know what performance would look like a 1,000,000 objects.

like image 166
msarchet Avatar answered Nov 15 '22 09:11

msarchet