picongpu-users@hzdr.de Mailing List Archive

Hi Folks,

I ran some measurements on the ADIOSWriter plugin and wanted to check if there are some reference numbers to validate against. I could only run it on a M2090 so numbers might not agree but still wanted to get a ballpark comparison.

Experiment description:

* -g 128 128 128

* -d 1 1 1

* Single node

Performance data:

* Avg simulation timestep : 2.1sec

* ADIOSWriter : 338sec

* Field : 1.8sec

* Species1 : 165sec

* kernel 'copySpecies' : 163.6sec

* Species2 : 165.1sec

* kernel 'copySpecies' : 163.6sec

Questions:

* Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd on host pinned memory and not on device memory?

* Does the 163sec of execution time of 'copySpecies' for the chosen simulation size look reasonable even for an M2090?

* If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied to host memory and a CPU version of 'copySpecies' is run instead, would it be same semantically?

-----> To do the above, I measured the following -

speciesTmp->synchronize();

cudaDeviceSynchronize();

It comes to -

Species1 : 369 ms

Species2 : 443 ms

Thanks,

Anshuman