Mailing List picongpu-users@hzdr.de Message #215
From: Khikhlukha Danila <Danila.Khikhlukha@eli-beams.eu>
Subject: RE: [PIConGPU-Users] Restart failure
Date: Thu, 16 Feb 2017 08:58:14 +0000
To: picongpu-users@hzdr.de <picongpu-users@hzdr.de>
Hi Axel,
let me please answer step by step. Adding --restart-step 1024 didn't help. Latest commit I have in my local branch is aabd50e.

I decided to debug the issue with HDF5. Following your instruction I tried to compile with debug flags on, however my gdb compilation fails. So what I did:
1. configure make files as usual with the command issued in a $BUILD_DIR
$PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29 -DPMACC_VERBOSE_LVL=7" $INCLUDE_DIR

2. then in a $BUILD_DIR  do  `ccmake . ` and setup flags:
CUDA_SHOW_CODELINES = ON
PMACC_BLOCKING_KERNEL = ON
CUDA_NVCC_FLAGS_DEBUG = -g;-G (this flag I found in advance mode. Maybe it worths mentioning on the wiki page )

3. Save and generated new Makefile, launch `make` and get an error:
ptxas error   : Entry function '<...really long function name...>' uses too much shared data (0xc00a bytes, 0xc000 max).
which puzzles me a bit. It is clear that the debug binary will need more static memory to store the debug info, but is it that large? Or I did something wrong?

4. If skip step#2 `make` has no problems to compile in a normal way.

Best,
Danila.
________________________________________
From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Huebl [a.huebl@hzdr.de]
Sent: Tuesday, February 14, 2017 11:02 PM
To: picongpu-users@hzdr.de
Subject: Re:  [PIConGPU-Users] Restart failure

Hi Danila,

so far that looks normal, can you try specifying your restart step
explicitly with e.g.
  --restart-step 1024
?


Which exact version of PIConGPU did you use?

We are currently not aware of known problems with restarts. If we can
narrow down the source of your segfault this would be wonderful
(although I suspect it might be in a third party lib), so in case that
you are able to do a small 1 node example and can hang in gdb there
  https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging

we might get a better understanding.

That said, you can also add ADIOS to your compile chain which will
automatically use ADIOS .bp files for checkpointing & restarting and you
can still use HDF5 for regular output (or use ADIOS for both) in case
HDF5 behaves too nasty and you need to move fast.


Axel

On 13.02.2017 15:47, Khikhlukha Danila wrote:
> Hi René,
> sure, pls. see the attachment. Please let me know if more information is needed.
>
> D.
> ________________________________________
> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of René Widera [r.widera@hzdr.de]
> Sent: Monday, February 13, 2017 3:39 PM
> To: picongpu-users@hzdr.de
> Subject: Re:  [PIConGPU-Users] [PIConGPU-Users] Restart failure
>
> Dear Danila,
>
> could you please send use the `stdout`, `stderr` and the files from the
> `tbg` folder?
>
> best,
>
> René
>
> On 02/13/2017 03:11 PM, Khikhlukha Danila wrote:
>> Dear all,
>> currently I was trying to setup PoG in the Jureca machine. It all worked
>> fine for the LWFA example, however when I tried to restart the
>> simulation I received a segfault almost immediately.
>> My tool chain is as follows
>>
>> GCC/5.4.0
>> CUDA/8.0.44
>> MVAPICH2/2.2-GDR
>> HDF5/1.8.17
>> Boost/1.61.0
>>
>> So, the first run didn't have any problems -- pictures, save points and
>> data dumps were created. When I tried to launch the restart it crashes
>> although I explicitly specify the savepoint directory.
>>
>> test$ diff -r 0002/submit/ 0002_restart/submit/
>> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg
>> 39c39
>> < TBG_steps="-s 1024"
>> ---
>>> TBG_steps="-s 2048"
>> 41a42
>>> TBG_restart="--restart --restart-directory
>> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints"
>> 67a69
>>>                    !TBG_restart      \
>>
>> I also checked that it exists and accessible. I tried to switch on some
>> debug information, with the following command:
>>
>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29
>> -DPMACC_VERBOSE_LVL=7"
>>
>> however I didn't find any information except a standard message:
>> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>>
>> Could you please advice me if there are another way how to diagnose the
>> problem (except launching a gdb). may be I'm doing something wrong?
>> However restart used to work on other machines...
>>
>>
>> Thank you in advance,
>> Danila.
>>
>
> --
> René Widera
> Abteilung Laser-Teilchenbeschleunigung (FWKT)
> Helmholtz-Zentrum Dresden-Rossendorf
> Tel: +49 (0351) 260 3543
> r.widera@hzdr.de
> http://www.hzdr.de
>
> Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey,
>            Prof. Dr. Dr. h. c. Peter Joehnk
> Vereinsregister: VR 1693 beim Amtsgericht Dresden
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>
>
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>

--

Axel Huebl
Phone +49 351 260 3582
https://www.hzdr.de/crp
Computational Radiation Physics
Laser Particle Acceleration Division
Helmholtz-Zentrum Dresden - Rossendorf e.V.

Bautzner Landstrasse 400, 01328 Dresden
POB 510119, D-01314 Dresden
Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
          Prof. Dr.Dr.h.c. P. Joehnk
VR 1693 beim Amtsgericht Dresden

#############################################################
This message is sent to you because you are subscribed to
  the mailing list <picongpu-users@hzdr.de>.
To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
Send administrative queries to  <picongpu-users-request@hzdr.de>

Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster