picongpu-users@hzdr.de Mailing List Archive

Hi Anshuman, the "copySpecies" kernel basically *is* a deep-copy. Best, Axel On 07.05.2015 17:20, Anshuman Goswami wrote: > Thanks a ton :+1: > Getting back with the environment details shortly.... > > Key points, I think are (pls correct me where I am wrong) - > > * Need to rerun on the K40 > * Perform a deep copy of the species data otherwise the copy doesn't > make sense On 07.05.2015 16:46, Huebl, Axel wrote: > Hi Anshuman, > > > some general questions we need with each request: > - what version of PIConGPU are you running > (beta-rc6, release-0.1.0, latest dev?) > - what system/compilers/third-party libraries are you using? > > Now let's start with your questions as far as we get. > >> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd >> on host pinned memory and not on device memory? > > In the current development (dev) version of PIConGPU, we do not use > double-buffering for particles any more. For that, we nowadays use > *mapped* memory for access from the host side, e.g., for dumps. > > Before that, we used *pinned* memory on the host side for asynchronous > copies that were used as double-buffers for the previously mentioned > operations. > >> * If the source buffer (speciesTmp->getDeviceParticlesBox()) is >> copied to host memory and a CPU version of 'copySpecies' is run >> instead, would it be same semantically? > > The speciesTmp->getDeviceParticlesBox() is kind-of an iterator pointing > to non-contigous memory on the device (it contains for each super-cell > the dbl-linked list of frames of particles). > > Due to the nested structure of device pointers we use in that object, > simple copies to host scope are not possible without rendering the > pointers to an undefined state (since a pointer values on the device are > not the same in the address range of the host RAM). > >> * Does the 163sec of execution time of 'copySpecies' for the chosen >> simulation size look reasonable even for an M2090? > > That's pretty hard to judge from the provided information. > > How many time steps did you run? > How often did you dump (adios.period)? > How many particles did you use in total? > > Also, the M2090 ("Fermi") generation is "pretty old" meaning it might be > possible that the performance of the Fermi implementation for mapped > memory is not comparable any more to modern hardware ("Kepler" is out > since Nov/2012). Please use modern hardware for benchmarks, we won't > optimize for that generation (even if we still support general operation > on it). > > We did not benchmark `copySpecies` in detail yet, but since we only use > it on Kepler the "163.6sec" sound a bit unrealistic (or way off). Can > you compare it to your Kepler cards in your node, pls? > >> -----> To do the above, I measured the following - >> speciesTmp->synchronize(); >> cudaDeviceSynchronize(); >> It comes to - >> Species1 : 369 ms >> Species2 : 443 ms > > The > speciesTmp->synchronize(); > is without implementation - it would usually synchronize data from > device to host but is not implemented for particles (due to their nested > memory structure, see above). > > So basically what you measured is the time of a > cudaDeviceSynchronize(); > and *all the kernels that were still running* before that, which can be > anything. > > Also: before you measure kernels, always run a > cudaDeviceSynchronize(); > else you get the unpredictable load of asynchronous kernels that you > don't intended to measure ;) > > Nevertheless, we measured the overall speedup comparing HDF5 and ADIOS, > leading to significant speedups when running on several hundred > GPUs/nodes with dumping around 6GB from each (per adios.period). > > > Best regards, > Axel & René > > On 06.05.2015 16:35, Anshuman Goswami wrote: >> Hi Folks, >> >> I ran some measurements on the ADIOSWriter plugin and wanted to check if >> there are some reference numbers to validate against. I could only run >> it on a M2090 so numbers might not agree but still wanted to get a >> ballpark comparison. >> >> Experiment description: >> * -g 128 128 128 >> * -d 1 1 1 >> * Single node >> >> Performance data: >> * Avg simulation timestep : 2.1sec >> * ADIOSWriter : 338sec >> * Field : 1.8sec >> * Species1 : 165sec >> * kernel 'copySpecies' : 163.6sec >> * Species2 : 165.1sec >> * kernel 'copySpecies' : 163.6sec >> >> Questions: >> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd >> on host pinned memory and not on device memory? >> * Does the 163sec of execution time of 'copySpecies' for the chosen >> simulation size look reasonable even for an M2090? >> * If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied >> to host memory and a CPU version of 'copySpecies' is run instead, would >> it be same semantically? >> -----> To do the above, I measured the following - >> speciesTmp->synchronize(); >> cudaDeviceSynchronize(); >> It comes to - >> Species1 : 369 ms >> Species2 : 443 ms >> >> Thanks, >> Anshuman > -- Axel Huebl Phone +49 351 260 3582 https://www.hzdr.de/crp Computational Radiation Physics Laser Particle Acceleration Division Helmholtz-Zentrum Dresden - Rossendorf e.V. Bautzner Landstrasse 400, 01328 Dresden POB 510119, D-01314 Dresden Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey Prof. Dr.Dr.h.c. P. Joehnk VR 1693 beim Amtsgericht Dresden