Let's talk about the GPU implementation of SPH. I have tested against the CPU implementation.
. Remember this value is get when we set the NUM_CPU_THREADS to 1, which means we only want one CPU thread to run the program.
I also have the FPS of the GPU implementation
the FPS is 32, it's about 2.7x compared to CPU. If we turn the light source on and display the shadow(such function is already implemented in the framework I use), the FPS of CPU implementation will not drop, but the GPU's will drop to 26.
Since I optimized the CPU implementation of OpenMP, I also try running the CPU code by using 8 CPU threads....
I was quite confused about this comparison result and looked into my code to see what causes this. I think this should be related to memory access issues and i am still working on it to see if I can improve the performance of the GPU implementation.
Since the framework I use have implemented the fluid simulation in both CPU and GPU, I have checked their performance. To my surprise, the GPU implementation is defeated by the CPU implementation while that CPU implementation is even not optimized by OpenMP...
I have checked the code, then I realize that the framework's original implementation is not computation intensive. In fact, it has significantly shorter loops than mine but it needs more memory accesses. It has a global vector call neighbor table storing the neighbor information of each particle, this neighbor table help reduce the loop length but the trade off is it increase the memory accesses. That's why the GPU implementation of this method is defeated by the CPU implementation. This result strengthen my idea that the optimization of my GPU implementation should be focused on the memory access issue.
When profiling and optimizing, you can turn off rendering to get a more apples to apples comparison of GPU vs. CPU performance. After tweaking, I expect the GPU to do quite well. Parallel Nsight may be of use to you: http://developer.nvidia.com/nvidia-parallel-nsight
ReplyDelete