Mailing List picongpu-users@hzdr.de Message #12
From: Huebl, Axel <a.huebl@hzdr.de>
Subject: Re: [PIConGPU-Users] [PIConGPU-Users] Adios performance data
Date: Thu, 07 May 2015 17:27:05 +0200
To: <picongpu-users@hzdr.de>
Signed Data (Text SHA1)
Hi Anshuman,


the "copySpecies" kernel basically *is* a deep-copy.


Best,
Axel

On 07.05.2015 17:20, Anshuman Goswami wrote:
> Thanks a ton :+1:
> Getting back with the environment details shortly....
>
> Key points, I think are (pls correct me where I am wrong) -
>
>   * Need to rerun on the K40
>   * Perform a deep copy of the species data otherwise the copy doesn't
>     make sense

On 07.05.2015 16:46, Huebl, Axel wrote:
> Hi Anshuman,
>
>
> some general questions we need with each request:
>   - what version of PIConGPU are you running
>     (beta-rc6, release-0.1.0, latest dev?)
>   - what system/compilers/third-party libraries are you using?
>
> Now let's start with your questions as far as we get.
>
>> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd
>> on host pinned memory and not on device memory?
>
> In the current development (dev) version of PIConGPU, we do not use
> double-buffering for particles any more. For that, we nowadays use
> *mapped* memory for access from the host side, e.g., for dumps.
>
> Before that, we used *pinned* memory on the host side for asynchronous
> copies that were used as double-buffers for the previously mentioned
> operations.
>
>> *  If the source buffer (speciesTmp->getDeviceParticlesBox()) is
>> copied to host memory and a CPU version of 'copySpecies' is run
>> instead, would it be same semantically?
>
> The speciesTmp->getDeviceParticlesBox() is kind-of an iterator pointing
> to non-contigous memory on the device (it contains for each super-cell
> the dbl-linked list of frames of particles).
>
> Due to the nested structure of device pointers we use in that object,
> simple copies to host scope are not possible without rendering the
> pointers to an undefined state (since a pointer values on the device are
> not the same in the address range of the host RAM).
>
>> * Does the 163sec of execution time of 'copySpecies' for the chosen
>> simulation size look reasonable even for an M2090?
>
> That's pretty hard to judge from the provided information.
>
> How many time steps did you run?
> How often did you dump (adios.period)?
> How many particles did you use in total?
>
> Also, the M2090 ("Fermi") generation is "pretty old" meaning it might be
> possible that the performance of the Fermi implementation for mapped
> memory is not comparable any more to modern hardware ("Kepler" is out
> since Nov/2012). Please use modern hardware for benchmarks, we won't
> optimize for that generation (even if we still support general operation
> on it).
>
> We did not benchmark `copySpecies` in detail yet, but since we only use
> it on Kepler the "163.6sec" sound a bit unrealistic (or way off). Can
> you compare it to your Kepler cards in your node, pls?
>
>> -----> To do the above, I measured the following -
>>            speciesTmp->synchronize();
>>            cudaDeviceSynchronize();
>>         It comes to -
>>              Species1 : 369 ms
>>              Species2 : 443 ms
>
> The
>   speciesTmp->synchronize();
> is without implementation - it would usually synchronize data from
> device to host but is not implemented for particles (due to their nested
> memory structure, see above).
>
> So basically what you measured is the time of a
>   cudaDeviceSynchronize();
> and *all the kernels that were still running* before that, which can be
> anything.
>
> Also: before you measure kernels, always run a
>   cudaDeviceSynchronize();
> else you get the unpredictable load of asynchronous kernels that you
> don't intended to measure ;)
>
> Nevertheless, we measured the overall speedup comparing HDF5 and ADIOS,
> leading to significant speedups when running on several hundred
> GPUs/nodes with dumping around 6GB from each (per adios.period).
>
>
> Best regards,
> Axel & René
>
> On 06.05.2015 16:35, Anshuman Goswami wrote:
>> Hi Folks,
>>
>> I ran some measurements on the ADIOSWriter plugin and wanted to check if
>> there are some reference numbers to validate against. I could only run
>> it on a M2090 so numbers might not agree but still wanted to get a
>> ballpark comparison.
>>
>> Experiment description:
>> * -g 128 128 128
>> * -d 1 1 1
>> * Single node
>>
>> Performance data:
>> * Avg simulation timestep : 2.1sec
>> * ADIOSWriter : 338sec
>>     * Field : 1.8sec
>>     * Species1 : 165sec
>>         * kernel 'copySpecies' : 163.6sec
>>     * Species2 : 165.1sec
>>         * kernel 'copySpecies' : 163.6sec
>>
>> Questions:
>> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd
>> on host pinned memory and not on device memory?
>> * Does the 163sec of execution time of 'copySpecies' for the chosen
>> simulation size look reasonable even for an M2090?
>> *  If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied
>> to host memory and a CPU version of 'copySpecies' is run instead, would
>> it be same semantically?
>> -----> To do the above, I measured the following -
>>            speciesTmp->synchronize();
>>            cudaDeviceSynchronize();
>>         It comes to -
>>              Species1 : 369 ms
>>              Species2 : 443 ms
>>
>> Thanks,
>> Anshuman
>

--

Axel Huebl
Phone +49 351 260 3582
https://www.hzdr.de/crp
Computational Radiation Physics
Laser Particle Acceleration Division
Helmholtz-Zentrum Dresden - Rossendorf e.V.

Bautzner Landstrasse 400, 01328 Dresden
POB 510119, D-01314 Dresden
Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
          Prof. Dr.Dr.h.c. P. Joehnk
VR 1693 beim Amtsgericht Dresden

Content Unaltered as verified By:
Huebl, Axel <a.huebl@hzdr.de>
Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster