Hi Folks,
I ran some measurements on the ADIOSWriter plugin and wanted to check if there are some reference numbers to validate against. I could only run it on a M2090 so numbers might not agree but still wanted to get a ballpark comparison.
Experiment description:
* -g 128 128 128
* -d 1 1 1
* Single node
Performance data:
* Avg simulation timestep : 2.1sec
* ADIOSWriter : 338sec
* Field : 1.8sec
* Species1 : 165sec
* kernel 'copySpecies' : 163.6sec
* Species2 : 165.1sec
* kernel 'copySpecies' : 163.6sec
Questions:
* Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd on host pinned memory and not on device memory?
* Does the 163sec of execution time of 'copySpecies' for the chosen simulation size look reasonable even for an M2090?
* If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied to host memory and a CPU version of 'copySpecies' is run instead, would it be same semantically?
-----> To do the above, I measured the following -
speciesTmp->synchronize();
cudaDeviceSynchronize();
It comes to -
Species1 : 369 ms
Species2 : 443 ms
Thanks,
Anshuman