c - 2D array on CUDA -
i want dynamically allocate global 2d array in cuda. how can achieve this?
in main calling kernel in loop. before call kernel need allocate memory on gpu. after kernel call single integer send gpu cpu inform whether problem solved or not.
if problem not solved, not free old memory , since there further need of it, , should allocate new memory gpu , call kernel again.
a sudocode shown:
int n=0,i=0; while(n==0) { //allocate 2d memory mem[i++] //call kernel(mem,i) // n kernel } __global__ void kernerl(mem,int i) { mem[0][5]=1; mem[1][0]=mem[0][5]+23;//can use when mem[1] allocated before kernel call } any suggestions? thank you.
two opening comments - using dynamically allocated 2d array bad idea in cuda, , doing repetitive memory allocations in loop not idea. both incur needless performance penalties.
for host code, this:
size_t allocsize = 16000 * sizeof(float); int n_allocations = 16; float * dpointer cudamalloc((void **)&dpointer, n_allocations * size_t(allocsize)); float * dcurrent = dpointer; int n = 0; for(int i=0; ((n==0) && (i<n_allocations)); i++, dcurrent+=allocsize) { // whatever before kernel kernel <<< gridsize,blocksize >>> (dcurrent,.....); // whatever after kernel } is preferable. here call cudamalloc once, , pass offsets allocation, makes memory allocation , management free inside loop. loop structure means can't run endlessly , exhaust gpu memory.
on 2d array question itself, there 2 reasons why bad idea. firstly, allocation requires of 2d array n rows requires (n+1) cudamalloc calls , host device memory copy, slow , ugly. secondly inside kernel code, @ data, gpu must 2 global memory reads, 1 pointer indirection row address, , 1 fetch data row. slower alternative:
#define idx(i,j,lda) ( (j) + ((i)*(lda)) ) __global__ void kernerl(float * mem, int lda, ....) { mem[idx(0,5,lda)]=1; // memmem[0][5]=1; } which uses indexing 1d allocation. in gpu memory transactions expensive, flops , iops cheap. single integer multiply-add efficient way this. if need access results previous kernel call, pass offset previous results , use 2 pointers inside kernel, this:
__global__ void kernel(float *mem, int lda, int this, int previous) { float * mem0 = mem + this; float * mem1 = mem + previous; } efficient distributed memory programs (and cuda type of distributed memory programming) start fortran after while, price pay portability, transparency , efficiency.
hope helped.
Comments
Post a Comment