Mailing List picongpu-users@hzdr.de Message #13
From: Huebl, Axel <a.huebl@hzdr.de>
Subject: Re: Adios performance data
Date: Fri, 08 May 2015 09:20:50 +0200
To: <picongpu-users@hzdr.de>
Signed Data (Text SHA1)
Due to missing headers and reply problems on the list (reported to
listmaster, will be fixed soon), we will continue this thread on GitHub:

  https://github.com/ComputationalRadiationPhysics/picongpu/issues/858

Axel

On 07.05.2015 17:27, Huebl, Axel wrote:
> Hi Anshuman,
>
>
> the "copySpecies" kernel basically *is* a deep-copy.
>
>
> Best,
> Axel
>
> On 07.05.2015 17:20, Anshuman Goswami wrote:
>> Thanks a ton :+1:
>> Getting back with the environment details shortly....
>>
>> Key points, I think are (pls correct me where I am wrong) -
>>
>>   * Need to rerun on the K40
>>   * Perform a deep copy of the species data otherwise the copy doesn't
>>     make sense
>
> On 07.05.2015 16:46, Huebl, Axel wrote:
>> Hi Anshuman,
>>
>>
>> some general questions we need with each request:
>>   - what version of PIConGPU are you running
>>     (beta-rc6, release-0.1.0, latest dev?)
>>   - what system/compilers/third-party libraries are you using?
>>
>> Now let's start with your questions as far as we get.
>>
>>> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd
>>> on host pinned memory and not on device memory?
>>
>> In the current development (dev) version of PIConGPU, we do not use
>> double-buffering for particles any more. For that, we nowadays use
>> *mapped* memory for access from the host side, e.g., for dumps.
>>
>> Before that, we used *pinned* memory on the host side for asynchronous
>> copies that were used as double-buffers for the previously mentioned
>> operations.
>>
>>> *  If the source buffer (speciesTmp->getDeviceParticlesBox()) is
>>> copied to host memory and a CPU version of 'copySpecies' is run
>>> instead, would it be same semantically?
>>
>> The speciesTmp->getDeviceParticlesBox() is kind-of an iterator pointing
>> to non-contigous memory on the device (it contains for each super-cell
>> the dbl-linked list of frames of particles).
>>
>> Due to the nested structure of device pointers we use in that object,
>> simple copies to host scope are not possible without rendering the
>> pointers to an undefined state (since a pointer values on the device are
>> not the same in the address range of the host RAM).
>>
>>> * Does the 163sec of execution time of 'copySpecies' for the chosen
>>> simulation size look reasonable even for an M2090?
>>
>> That's pretty hard to judge from the provided information.
>>
>> How many time steps did you run?
>> How often did you dump (adios.period)?
>> How many particles did you use in total?
>>
>> Also, the M2090 ("Fermi") generation is "pretty old" meaning it might be
>> possible that the performance of the Fermi implementation for mapped
>> memory is not comparable any more to modern hardware ("Kepler" is out
>> since Nov/2012). Please use modern hardware for benchmarks, we won't
>> optimize for that generation (even if we still support general operation
>> on it).
>>
>> We did not benchmark `copySpecies` in detail yet, but since we only use
>> it on Kepler the "163.6sec" sound a bit unrealistic (or way off). Can
>> you compare it to your Kepler cards in your node, pls?
>>
>>> -----> To do the above, I measured the following -
>>>            speciesTmp->synchronize();
>>>            cudaDeviceSynchronize();
>>>         It comes to -
>>>              Species1 : 369 ms
>>>              Species2 : 443 ms
>>
>> The
>>   speciesTmp->synchronize();
>> is without implementation - it would usually synchronize data from
>> device to host but is not implemented for particles (due to their nested
>> memory structure, see above).
>>
>> So basically what you measured is the time of a
>>   cudaDeviceSynchronize();
>> and *all the kernels that were still running* before that, which can be
>> anything.
>>
>> Also: before you measure kernels, always run a
>>   cudaDeviceSynchronize();
>> else you get the unpredictable load of asynchronous kernels that you
>> don't intended to measure ;)
>>
>> Nevertheless, we measured the overall speedup comparing HDF5 and ADIOS,
>> leading to significant speedups when running on several hundred
>> GPUs/nodes with dumping around 6GB from each (per adios.period).
>>
>>
>> Best regards,
>> Axel & René
>>
>> On 06.05.2015 16:35, Anshuman Goswami wrote:
>>> Hi Folks,
>>>
>>> I ran some measurements on the ADIOSWriter plugin and wanted to check if
>>> there are some reference numbers to validate against. I could only run
>>> it on a M2090 so numbers might not agree but still wanted to get a
>>> ballpark comparison.
>>>
>>> Experiment description:
>>> * -g 128 128 128
>>> * -d 1 1 1
>>> * Single node
>>>
>>> Performance data:
>>> * Avg simulation timestep : 2.1sec
>>> * ADIOSWriter : 338sec
>>>     * Field : 1.8sec
>>>     * Species1 : 165sec
>>>         * kernel 'copySpecies' : 163.6sec
>>>     * Species2 : 165.1sec
>>>         * kernel 'copySpecies' : 163.6sec
>>>
>>> Questions:
>>> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd
>>> on host pinned memory and not on device memory?
>>> * Does the 163sec of execution time of 'copySpecies' for the chosen
>>> simulation size look reasonable even for an M2090?
>>> *  If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied
>>> to host memory and a CPU version of 'copySpecies' is run instead, would
>>> it be same semantically?
>>> -----> To do the above, I measured the following -
>>>            speciesTmp->synchronize();
>>>            cudaDeviceSynchronize();
>>>         It comes to -
>>>              Species1 : 369 ms
>>>              Species2 : 443 ms
>>>
>>> Thanks,
>>> Anshuman
>>
>

--

Axel Huebl
Phone +49 351 260 3582
https://www.hzdr.de/crp
Computational Radiation Physics
Laser Particle Acceleration Division
Helmholtz-Zentrum Dresden - Rossendorf e.V.

Bautzner Landstrasse 400, 01328 Dresden
POB 510119, D-01314 Dresden
Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
          Prof. Dr.Dr.h.c. P. Joehnk
VR 1693 beim Amtsgericht Dresden

Content Unaltered as verified By:
Huebl, Axel <a.huebl@hzdr.de>
Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster