Mailing List picongpu-users@hzdr.de Message #11
From: Huebl, Axel <a.huebl@hzdr.de>
Subject: Re: [PIConGPU-Users] Adios performance data
Date: Thu, 07 May 2015 16:46:26 +0200
To: <picongpu-users@hzdr.de>
Signed Data (Text SHA1)
Hi Anshuman,


some general questions we need with each request:
  - what version of PIConGPU are you running
    (beta-rc6, release-0.1.0, latest dev?)
  - what system/compilers/third-party libraries are you using?

Now let's start with your questions as far as we get.

> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd
> on host pinned memory and not on device memory?

In the current development (dev) version of PIConGPU, we do not use
double-buffering for particles any more. For that, we nowadays use
*mapped* memory for access from the host side, e.g., for dumps.

Before that, we used *pinned* memory on the host side for asynchronous
copies that were used as double-buffers for the previously mentioned
operations.

> *  If the source buffer (speciesTmp->getDeviceParticlesBox()) is
> copied to host memory and a CPU version of 'copySpecies' is run
> instead, would it be same semantically?

The speciesTmp->getDeviceParticlesBox() is kind-of an iterator pointing
to non-contigous memory on the device (it contains for each super-cell
the dbl-linked list of frames of particles).

Due to the nested structure of device pointers we use in that object,
simple copies to host scope are not possible without rendering the
pointers to an undefined state (since a pointer values on the device are
not the same in the address range of the host RAM).

> * Does the 163sec of execution time of 'copySpecies' for the chosen
> simulation size look reasonable even for an M2090?

That's pretty hard to judge from the provided information.

How many time steps did you run?
How often did you dump (adios.period)?
How many particles did you use in total?

Also, the M2090 ("Fermi") generation is "pretty old" meaning it might be
possible that the performance of the Fermi implementation for mapped
memory is not comparable any more to modern hardware ("Kepler" is out
since Nov/2012). Please use modern hardware for benchmarks, we won't
optimize for that generation (even if we still support general operation
on it).

We did not benchmark `copySpecies` in detail yet, but since we only use
it on Kepler the "163.6sec" sound a bit unrealistic (or way off). Can
you compare it to your Kepler cards in your node, pls?

> -----> To do the above, I measured the following -
>            speciesTmp->synchronize();
>            cudaDeviceSynchronize();
>         It comes to -
>              Species1 : 369 ms
>              Species2 : 443 ms

The
  speciesTmp->synchronize();
is without implementation - it would usually synchronize data from
device to host but is not implemented for particles (due to their nested
memory structure, see above).

So basically what you measured is the time of a
  cudaDeviceSynchronize();
and *all the kernels that were still running* before that, which can be
anything.

Also: before you measure kernels, always run a
  cudaDeviceSynchronize();
else you get the unpredictable load of asynchronous kernels that you
don't intended to measure ;)

Nevertheless, we measured the overall speedup comparing HDF5 and ADIOS,
leading to significant speedups when running on several hundred
GPUs/nodes with dumping around 6GB from each (per adios.period).


Best regards,
Axel & René

On 06.05.2015 16:35, Anshuman Goswami wrote:
> Hi Folks,
>
> I ran some measurements on the ADIOSWriter plugin and wanted to check if
> there are some reference numbers to validate against. I could only run
> it on a M2090 so numbers might not agree but still wanted to get a
> ballpark comparison.
>
> Experiment description:
> * -g 128 128 128
> * -d 1 1 1
> * Single node
>
> Performance data:
> * Avg simulation timestep : 2.1sec
> * ADIOSWriter : 338sec
>     * Field : 1.8sec
>     * Species1 : 165sec
>         * kernel 'copySpecies' : 163.6sec
>     * Species2 : 165.1sec
>         * kernel 'copySpecies' : 163.6sec
>
> Questions:
> * Why is the destination buffer (deviceFrame) of 'copySpecies' alloc'd
> on host pinned memory and not on device memory?
> * Does the 163sec of execution time of 'copySpecies' for the chosen
> simulation size look reasonable even for an M2090?
> *  If the source buffer (speciesTmp->getDeviceParticlesBox()) is copied
> to host memory and a CPU version of 'copySpecies' is run instead, would
> it be same semantically?
> -----> To do the above, I measured the following -
>            speciesTmp->synchronize();
>            cudaDeviceSynchronize();
>         It comes to -
>              Species1 : 369 ms
>              Species2 : 443 ms
>
> Thanks,
> Anshuman

--

Axel Huebl
Phone +49 351 260 3582
https://www.hzdr.de/crp
Computational Radiation Physics
Laser Particle Acceleration Division
Helmholtz-Zentrum Dresden - Rossendorf e.V.

Bautzner Landstrasse 400, 01328 Dresden
POB 510119, D-01314 Dresden
Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
          Prof. Dr.Dr.h.c. P. Joehnk
VR 1693 beim Amtsgericht Dresden

Content Unaltered as verified By:
Huebl, Axel <a.huebl@hzdr.de>
Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster