picongpu-users@hzdr.de Mailing List Archive

Due to missing headers and reply problems on the list (reported to listmaster, will be fixed soon), we will continue this thread on GitHub: https://github.com/ComputationalRadiationPhysics/picongpu/issues/858 Axel On 07.05.2015 17:27, Huebl, Axel wrote: > Hi Anshuman, > > > the "copySpecies" kernel basically *is* a deep-copy. > > > Best, > Axel > > On 07.05.2015 17:20, Anshuman Goswami wrote: >> Thanks a ton :+1: >> Getting back with the environment details shortly.... >> >> Key points, I think are (pls correct me where I am wrong) - >> >> * Need to rerun on the K40 >> * Perform a deep copy of the species data otherwise the copy doesn't >> make sense > > On 07.05.2015 16:46, Huebl, Axel wrote: >> Hi Anshuman, >> >> >> some general questions we need with each request: >> - what version of PIConGPU are you running >> (beta-rc6, release-0.1.0, latest dev?) >> - what system/compilers/third-party libraries are you using? >> >> Now let's start with your questions as far as we get. >> >>> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd >>> on host pinned memory and not on device memory? >> >> In the current development (dev) version of PIConGPU, we do not use >> double-buffering for particles any more. For that, we nowadays use >> *mapped* memory for access from the host side, e.g., for dumps. >> >> Before that, we used *pinned* memory on the host side for asynchronous >> copies that were used as double-buffers for the previously mentioned >> operations. >> >>> * If the source buffer (speciesTmp->getDeviceParticlesBox()) is >>> copied to host memory and a CPU version of 'copySpecies' is run >>> instead, would it be same semantically? >> >> The speciesTmp->getDeviceParticlesBox() is kind-of an iterator pointing >> to non-contigous memory on the device (it contains for each super-cell >> the dbl-linked list of frames of particles). >> >> Due to the nested structure of device pointers we use in that object, >> simple copies to host scope are not possible without rendering the >> pointers to an undefined state (since a pointer values on the device are >> not the same in the address range of the host RAM). >> >>> * Does the 163sec of execution time of 'copySpecies' for the chosen >>> simulation size look reasonable even for an M2090? >> >> That's pretty hard to judge from the provided information. >> >> How many time steps did you run? >> How often did you dump (adios.period)? >> How many particles did you use in total? >> >> Also, the M2090 ("Fermi") generation is "pretty old" meaning it might be >> possible that the performance of the Fermi implementation for mapped >> memory is not comparable any more to modern hardware ("Kepler" is out >> since Nov/2012). Please use modern hardware for benchmarks, we won't >> optimize for that generation (even if we still support general operation >> on it). >> >> We did not benchmark `copySpecies` in detail yet, but since we only use >> it on Kepler the "163.6sec" sound a bit unrealistic (or way off). Can >> you compare it to your Kepler cards in your node, pls? >> >>> -----> To do the above, I measured the following - >>> speciesTmp->synchronize(); >>> cudaDeviceSynchronize(); >>> It comes to - >>> Species1 : 369 ms >>> Species2 : 443 ms >> >> The >> speciesTmp->synchronize(); >> is without implementation - it would usually synchronize data from >> device to host but is not implemented for particles (due to their nested >> memory structure, see above). >> >> So basically what you measured is the time of a >> cudaDeviceSynchronize(); >> and *all the kernels that were still running* before that, which can be >> anything. >> >> Also: before you measure kernels, always run a >> cudaDeviceSynchronize(); >> else you get the unpredictable load of asynchronous kernels that you >> don't intended to measure ;) >> >> Nevertheless, we measured the overall speedup comparing HDF5 and ADIOS, >> leading to significant speedups when running on several hundred >> GPUs/nodes with dumping around 6GB from each (per adios.period). >> >> >> Best regards, >> Axel & René >> >> On 06.05.2015 16:35, Anshuman Goswami wrote: >>> Hi Folks, >>> >>> I ran some measurements on the ADIOSWriter plugin and wanted to check if >>> there are some reference numbers to validate against. I could only run >>> it on a M2090 so numbers might not agree but still wanted to get a >>> ballpark comparison. >>> >>> Experiment description: >>> * -g 128 128 128 >>> * -d 1 1 1 >>> * Single node >>> >>> Performance data: >>> * Avg simulation timestep : 2.1sec >>> * ADIOSWriter : 338sec >>> * Field : 1.8sec >>> * Species1 : 165sec >>> * kernel 'copySpecies' : 163.6sec >>> * Species2 : 165.1sec >>> * kernel 'copySpecies' : 163.6sec >>> >>> Questions: >>> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd >>> on host pinned memory and not on device memory? >>> * Does the 163sec of execution time of 'copySpecies' for the chosen >>> simulation size look reasonable even for an M2090? >>> * If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied >>> to host memory and a CPU version of 'copySpecies' is run instead, would >>> it be same semantically? >>> -----> To do the above, I measured the following - >>> speciesTmp->synchronize(); >>> cudaDeviceSynchronize(); >>> It comes to - >>> Species1 : 369 ms >>> Species2 : 443 ms >>> >>> Thanks, >>> Anshuman >> > -- Axel Huebl Phone +49 351 260 3582 https://www.hzdr.de/crp Computational Radiation Physics Laser Particle Acceleration Division Helmholtz-Zentrum Dresden - Rossendorf e.V. Bautzner Landstrasse 400, 01328 Dresden POB 510119, D-01314 Dresden Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey Prof. Dr.Dr.h.c. P. Joehnk VR 1693 beim Amtsgericht Dresden