Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to execute large amount of sql queries asynchronous and in threads

Problem: I have huge amount of sql queries (around 10k-20k) and I want to run them asynchronous in 50 (or more) threads.

I wrote a powershell script for this job, but it is very slow (It took about 20 hours to execute all). Desired result is 3-4 hours max.

Question: How can I optimize this powershell script? Should I reconsider and use another technology like python or c#?

I think it's powershell issue, because when I check with whoisactive the queries are executing fast. Creating, exiting and unloading jobs takes a lot of time, because for each thread is created separate PS instances.

My code:

$NumberOfParallerThreads = 50;


$Arr_AllQueries = @('Exec [mystoredproc] @param1=1, @param2=2',
                    'Exec [mystoredproc] @param1=11, @param2=22',
                    'Exec [mystoredproc] @param1=111, @param2=222')

#Creating the batches
$counter = [pscustomobject] @{ Value = 0 };
$Batches_AllQueries = $Arr_AllQueries | Group-Object -Property { 
    [math]::Floor($counter.Value++ / $NumberOfParallerThreads) 
};

forEach ($item in $Batches_AllQueries) {
    $tmpBatch = $item.Group;

    $tmpBatch | % {

        $ScriptBlock = {
            # accept the loop variable across the job-context barrier
            param($query) 
            # Execute a command

            Try 
            {
                Write-Host "[processing '$query']"
                $objConnection = New-Object System.Data.SqlClient.SqlConnection;
                $objConnection.ConnectionString = 'Data Source=...';

                $ObjCmd = New-Object System.Data.SqlClient.SqlCommand;
                $ObjCmd.CommandText = $query;
                $ObjCmd.Connection = $objConnection;
                $ObjCmd.CommandTimeout = 0;

                $objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter;
                $objAdapter.SelectCommand = $ObjCmd;
                $objDataTable = New-Object System.Data.DataTable;
                $objAdapter.Fill($objDataTable)  | Out-Null;

                $objConnection.Close();
                $objConnection = $null;
            } 
            Catch 
            { 
                $ErrorMessage = $_.Exception.Message
                $FailedItem = $_.Exception.ItemName
                Write-Host "[Error processing: $($query)]" -BackgroundColor Red;
                Write-Host $ErrorMessage 
            }

        }

        # pass the loop variable across the job-context barrier
        Start-Job $ScriptBlock -ArgumentList $_ | Out-Null
    }

    # Wait for all to complete
    While (Get-Job -State "Running") { Start-Sleep 2 }

    # Display output from all jobs
    Get-Job | Receive-Job | Out-Null

    # Cleanup
    Remove-Job *

}

UPDATE:

Resources: The DB server is on a remote machine with:

  • 24GB RAM,
  • 8 cores,
  • 500GB Storage,
  • SQL Server 2016

We want to use the maximum cpu power.

Framework limitation: The only limitation is not to use SQL Server to execute the queries. The requests should come from outside source like: Powershell, C#, Python, etc.

like image 712
nyagolova Avatar asked Jul 12 '19 14:07

nyagolova


People also ask

Are SQL queries asynchronous?

SQL Server allows applications to perform asynchronous database operations. Asynchronous processing enables methods to return immediately without blocking on the calling thread.


1 Answers

RunspacePool is the way to go here, try this:

$AllQueries = @( ... )
$MaxThreads = 5

# Each thread keeps its own connection but shares the query queue
$ScriptBlock = {
    Param($WorkQueue)

    $objConnection = New-Object System.Data.SqlClient.SqlConnection
    $objConnection.ConnectionString = 'Data Source=...'

    $objCmd = New-Object System.Data.SqlClient.SqlCommand
    $objCmd.Connection = $objConnection
    $objCmd.CommandTimeout = 0

    $query = ""

    while ($WorkQueue.TryDequeue([ref]$query)) {
        $objCmd.CommandText = $query
        $objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter $objCmd
        $objDataTable = New-Object System.Data.DataTable
        $objAdapter.Fill($objDataTable) | Out-Null
    }

    $objConnection.Close()

}

# create a pool
$pool = [RunspaceFactory]::CreateRunspacePool(1, $MaxThreads)
$pool.ApartmentState  = 'STA'
$pool.Open()

# convert the query array into a concurrent queue
$workQueue = New-Object System.Collections.Concurrent.ConcurrentQueue[object]
$AllQueries | % { $workQueue.Enqueue($_) }

$threads = @()

# Create each powershell thread and add them to the pool
1..$MaxThreads | % {
    $ps = [powershell]::Create()
    $ps.RunspacePool = $pool
    $ps.AddScript($ScriptBlock) | Out-Null
    $ps.AddParameter('WorkQueue', $workQueue) | Out-Null
    $threads += [pscustomobject]@{
        Ps = $ps
        Handle = $null
    }
}

# Start all the threads
$threads | % { $_.Handle = $_.Ps.BeginInvoke() }

# Wait for all the threads to complete - errors will still set the IsCompleted flag
while ($threads | ? { !$_.Handle.IsCompleted }) {
    Start-Sleep -Seconds 1
}

# Get any results and display an errors
$threads | % {
    $_.Ps.EndInvoke($_.Handle) | Write-Output
    if ($_.Ps.HadErrors) {
        $_.Ps.Streams.Error.ReadAll() | Write-Error
    }
}

Unlike powershell jobs, a RunspacePools can share resources. So there is one concurrent queue of all the queries, and each thread keeps its own connection to the database.

As others have said though - unless you're stress testing your database, you're probably better off reorganising the queries into bulk inserts.

like image 184
antonyoni Avatar answered Oct 11 '22 02:10

antonyoni