Mostly all examples of Numba, CuPy and etc available online are simple array additions, showing the speedup from going to cpu singles core/thread to a gpu. And commands documentations mostly lack good examples. This post is intended to provide a more comprehensive example.
The initial code is provided here. Its a simple model for the classic Cellular Automata. Originally, it doesn't even uses numpy, just plain python and the Pyglet module for visualization.
My goal is to extend this code to a specific problem (that will be very large), but first i thought its best to already optimize for GPU usage.
The game_of_life.py is this:
import random as rnd
import pyglet
#import numpy as np
#from numba import vectorize, cuda, jit
class GameOfLife:
def __init__(self, window_width, window_height, cell_size, percent_fill):
self.grid_width = int(window_width / cell_size) # cell_size
self.grid_height = int(window_height / cell_size) #
self.cell_size = cell_size
self.percent_fill = percent_fill
self.cells = []
self.generate_cells()
def generate_cells(self):
for row in range(0, self.grid_height):
self.cells.append([])
for col in range(0, self.grid_width):
if rnd.random() < self.percent_fill:
self.cells[row].append(1)
else:
self.cells[row].append(0)
def run_rules(self):
temp = []
for row in range(0, self.grid_height):
temp.append([])
for col in range(0, self.grid_width):
cell_sum = sum([self.get_cell_value(row - 1, col),
self.get_cell_value(row - 1, col - 1),
self.get_cell_value(row, col - 1),
self.get_cell_value(row + 1, col - 1),
self.get_cell_value(row + 1, col),
self.get_cell_value(row + 1, col + 1),
self.get_cell_value(row, col + 1),
self.get_cell_value(row - 1, col + 1)])
if self.cells[row][col] == 0 and cell_sum == 3:
temp[row].append(1)
elif self.cells[row][col] == 1 and (cell_sum == 3 or cell_sum == 2):
temp[row].append(1)
else:
temp[row].append(0)
self.cells = temp
def get_cell_value(self, row, col):
if row >= 0 and row < self.grid_height and col >= 0 and col < self.grid_width:
return self.cells[row][col]
return 0
def draw(self):
for row in range(0, self.grid_height):
for col in range(0, self.grid_width):
if self.cells[row][col] == 1:
#(0, 0) (0, 20) (20, 0) (20, 20)
square_coords = (row * self.cell_size, col * self.cell_size,
row * self.cell_size, col * self.cell_size + self.cell_size,
row * self.cell_size + self.cell_size, col * self.cell_size,
row * self.cell_size + self.cell_size, col * self.cell_size + self.cell_size)
pyglet.graphics.draw_indexed(4, pyglet.gl.GL_TRIANGLES,
[0, 1, 2, 1, 2, 3],
('v2i', square_coords))
Firstly, i could use numpy adding at the end of generate_cells
this self.cells = np.asarray(self.cells)
and at end of run_rules
this self.cells = np.asarray(temp)
, since doing this before wouldn't present speedups, as presented here.(Actually changing to numpy didn't present a noticeable speedup)
Regarding gpu's, for example, i added @jit
before every function, and became very slow.
Also tried to use @vectorize(['float32(float32, float32)'], target='cuda')
, but this raised a question: how to use @vectorize
in functions that only have self
as input argument?
I also tried substituting numpy for cupy, like self.cells = cupy.asarray(self.cells)
, but also became very slow.
Following the initial idea of a extended example of gpu usage, what would be the proper approach to the problem? Where is the right place to put the modifications/vectorizations/parallelizations/numba/cupy etc? And most important, why?
Additional info: besides the code provided, here's the main.py file:
import pyglet
from game_of_life import GameOfLife
class Window(pyglet.window.Window):
def __init__(self):
super().__init__(800,800)
self.gameOfLife = GameOfLife(self.get_size()[0],
self.get_size()[1],
15, # the lesser this value, more computation intensive will be
0.5)
pyglet.clock.schedule_interval(self.update, 1.0/24.0) # 24 frames per second
def on_draw(self):
self.clear()
self.gameOfLife.draw()
def update(self, dt):
self.gameOfLife.run_rules()
if __name__ == '__main__':
window = Window()
pyglet.app.run()
I don't quite understand your example, but I only need GPU computing. After a few days of pain, I may understand its usage, so I'll show it to you, hoping to help you. In addition, I need to point out that when using "...kernel(cuts, cuts", I will put two. Because the first one specifies the type when it is passed in, it will be used by the core as a traversal element and cannot be read by the index. So I use the second one to calculate free index data.
```
binsort_kernel = cp.ElementwiseKernel(
'int32 I,raw T cut,raw T ind,int32 row,int32 col,int32 q','raw T out,raw T bin,raw T num',
'''
int i_x = i / col;
int i_y = i % col;
int b_f = i_x*col;
int b_l = b_f+col;
int n_x = i_x * q;
int inx = i_x%row*col;
////////////////////////////////////////////////////////////////////////////////////////
int r_x = 0; int adi = 0; int adb = 0;
////////////////////////////////////////////////////////////////////////////////////////
if (i_y == 0)
{
for(size_t j=b_f; j<b_l; j++){
if (cut[j]<q){
r_x = inx + j -b_f;
adb = n_x + cut[j];
adi = bin[adb] + num[adb];
out[adi] = ind[r_x];
num[adb]+= 1;
}}
}
////////////////////////////////////////////////////////////////////////////////////////
''','binsort')
binsort_kernel(cuts,cuts,ind,row,col,q,iout,bins,bnum)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With