picongpu-users@hzdr.de Mailing List Archive

Hi Anshuman, some general questions we need with each request: - what version of PIConGPU are you running (beta-rc6, release-0.1.0, latest dev?) - what system/compilers/third-party libraries are you using? Now let's start with your questions as far as we get. > * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd > on host pinned memory and not on device memory? In the current development (dev) version of PIConGPU, we do not use double-buffering for particles any more. For that, we nowadays use *mapped* memory for access from the host side, e.g., for dumps. Before that, we used *pinned* memory on the host side for asynchronous copies that were used as double-buffers for the previously mentioned operations. > * If the source buffer (speciesTmp->getDeviceParticlesBox()) is > copied to host memory and a CPU version of 'copySpecies' is run > instead, would it be same semantically? The speciesTmp->getDeviceParticlesBox() is kind-of an iterator pointing to non-contigous memory on the device (it contains for each super-cell the dbl-linked list of frames of particles). Due to the nested structure of device pointers we use in that object, simple copies to host scope are not possible without rendering the pointers to an undefined state (since a pointer values on the device are not the same in the address range of the host RAM). > * Does the 163sec of execution time of 'copySpecies' for the chosen > simulation size look reasonable even for an M2090? That's pretty hard to judge from the provided information. How many time steps did you run? How often did you dump (adios.period)? How many particles did you use in total? Also, the M2090 ("Fermi") generation is "pretty old" meaning it might be possible that the performance of the Fermi implementation for mapped memory is not comparable any more to modern hardware ("Kepler" is out since Nov/2012). Please use modern hardware for benchmarks, we won't optimize for that generation (even if we still support general operation on it). We did not benchmark `copySpecies` in detail yet, but since we only use it on Kepler the "163.6sec" sound a bit unrealistic (or way off). Can you compare it to your Kepler cards in your node, pls? > -----> To do the above, I measured the following - > speciesTmp->synchronize(); > cudaDeviceSynchronize(); > It comes to - > Species1 : 369 ms > Species2 : 443 ms The speciesTmp->synchronize(); is without implementation - it would usually synchronize data from device to host but is not implemented for particles (due to their nested memory structure, see above). So basically what you measured is the time of a cudaDeviceSynchronize(); and *all the kernels that were still running* before that, which can be anything. Also: before you measure kernels, always run a cudaDeviceSynchronize(); else you get the unpredictable load of asynchronous kernels that you don't intended to measure ;) Nevertheless, we measured the overall speedup comparing HDF5 and ADIOS, leading to significant speedups when running on several hundred GPUs/nodes with dumping around 6GB from each (per adios.period). Best regards, Axel & René On 06.05.2015 16:35, Anshuman Goswami wrote: > Hi Folks, > > I ran some measurements on the ADIOSWriter plugin and wanted to check if > there are some reference numbers to validate against. I could only run > it on a M2090 so numbers might not agree but still wanted to get a > ballpark comparison. > > Experiment description: > * -g 128 128 128 > * -d 1 1 1 > * Single node > > Performance data: > * Avg simulation timestep : 2.1sec > * ADIOSWriter : 338sec > * Field : 1.8sec > * Species1 : 165sec > * kernel 'copySpecies' : 163.6sec > * Species2 : 165.1sec > * kernel 'copySpecies' : 163.6sec > > Questions: > * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd > on host pinned memory and not on device memory? > * Does the 163sec of execution time of 'copySpecies' for the chosen > simulation size look reasonable even for an M2090? > * If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied > to host memory and a CPU version of 'copySpecies' is run instead, would > it be same semantically? > -----> To do the above, I measured the following - > speciesTmp->synchronize(); > cudaDeviceSynchronize(); > It comes to - > Species1 : 369 ms > Species2 : 443 ms > > Thanks, > Anshuman -- Axel Huebl Phone +49 351 260 3582 https://www.hzdr.de/crp Computational Radiation Physics Laser Particle Acceleration Division Helmholtz-Zentrum Dresden - Rossendorf e.V. Bautzner Landstrasse 400, 01328 Dresden POB 510119, D-01314 Dresden Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey Prof. Dr.Dr.h.c. P. Joehnk VR 1693 beim Amtsgericht Dresden