Mailing List picongpu-users@hzdr.de Message #218
From: Axel Huebl <a.huebl@hzdr.de>
Subject: Re: Restart failure
Date: Tue, 28 Feb 2017 21:05:51 +0100
To: <picongpu-users@hzdr.de>
Hi Danila,

sorry your mail got lost.

-G is currently not possible due to a problem in cuSTL (a library inside
PMacc that is used at some points). You won't need it anyway, `-g` is
enough.

Can you try debugging it like that?


Best,
Axel

On 16.02.2017 09:58, Khikhlukha Danila wrote:
> Hi Axel,
> let me please answer step by step. Adding --restart-step 1024 didn't help. Latest commit I have in my local branch is aabd50e.
>
> I decided to debug the issue with HDF5. Following your instruction I tried to compile with debug flags on, however my gdb compilation fails. So what I did:
> 1. configure make files as usual with the command issued in a $BUILD_DIR
> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29 -DPMACC_VERBOSE_LVL=7" $INCLUDE_DIR
>
> 2. then in a $BUILD_DIR  do  `ccmake . ` and setup flags:
> CUDA_SHOW_CODELINES = ON
> PMACC_BLOCKING_KERNEL = ON
> CUDA_NVCC_FLAGS_DEBUG = -g;-G (this flag I found in advance mode. Maybe it worths mentioning on the wiki page )
>
> 3. Save and generated new Makefile, launch `make` and get an error:
> ptxas error   : Entry function '<...really long function name...>' uses too much shared data (0xc00a bytes, 0xc000 max).
> which puzzles me a bit. It is clear that the debug binary will need more static memory to store the debug info, but is it that large? Or I did something wrong?
>
> 4. If skip step#2 `make` has no problems to compile in a normal way.
>
> Best,
> Danila.
> ________________________________________
> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Huebl [a.huebl@hzdr.de]
> Sent: Tuesday, February 14, 2017 11:02 PM
> To: picongpu-users@hzdr.de
> Subject: Re:  [PIConGPU-Users] Restart failure
>
> Hi Danila,
>
> so far that looks normal, can you try specifying your restart step
> explicitly with e.g.
>   --restart-step 1024
> ?
>
>
> Which exact version of PIConGPU did you use?
>
> We are currently not aware of known problems with restarts. If we can
> narrow down the source of your segfault this would be wonderful
> (although I suspect it might be in a third party lib), so in case that
> you are able to do a small 1 node example and can hang in gdb there
>   https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging
>
> we might get a better understanding.
>
> That said, you can also add ADIOS to your compile chain which will
> automatically use ADIOS .bp files for checkpointing & restarting and you
> can still use HDF5 for regular output (or use ADIOS for both) in case
> HDF5 behaves too nasty and you need to move fast.
>
>
> Axel
>
> On 13.02.2017 15:47, Khikhlukha Danila wrote:
>> Hi René,
>> sure, pls. see the attachment. Please let me know if more information is needed.
>>
>> D.
>> ________________________________________
>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of René Widera [r.widera@hzdr.de]
>> Sent: Monday, February 13, 2017 3:39 PM
>> To: picongpu-users@hzdr.de
>> Subject: Re:  [PIConGPU-Users] [PIConGPU-Users] Restart failure
>>
>> Dear Danila,
>>
>> could you please send use the `stdout`, `stderr` and the files from the
>> `tbg` folder?
>>
>> best,
>>
>> René
>>
>> On 02/13/2017 03:11 PM, Khikhlukha Danila wrote:
>>> Dear all,
>>> currently I was trying to setup PoG in the Jureca machine. It all worked
>>> fine for the LWFA example, however when I tried to restart the
>>> simulation I received a segfault almost immediately.
>>> My tool chain is as follows
>>>
>>> GCC/5.4.0
>>> CUDA/8.0.44
>>> MVAPICH2/2.2-GDR
>>> HDF5/1.8.17
>>> Boost/1.61.0
>>>
>>> So, the first run didn't have any problems -- pictures, save points and
>>> data dumps were created. When I tried to launch the restart it crashes
>>> although I explicitly specify the savepoint directory.
>>>
>>> test$ diff -r 0002/submit/ 0002_restart/submit/
>>> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg
>>> 39c39
>>> < TBG_steps="-s 1024"
>>> ---
>>>> TBG_steps="-s 2048"
>>> 41a42
>>>> TBG_restart="--restart --restart-directory
>>> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints"
>>> 67a69
>>>>                    !TBG_restart      \
>>>
>>> I also checked that it exists and accessible. I tried to switch on some
>>> debug information, with the following command:
>>>
>>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29
>>> -DPMACC_VERBOSE_LVL=7"
>>>
>>> however I didn't find any information except a standard message:
>>> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation fault
>>> (signal 11)
>>>
>>> Could you please advice me if there are another way how to diagnose the
>>> problem (except launching a gdb). may be I'm doing something wrong?
>>> However restart used to work on other machines...
>>>
>>>
>>> Thank you in advance,
>>> Danila.
>>>
>>
>> --
>> René Widera
>> Abteilung Laser-Teilchenbeschleunigung (FWKT)
>> Helmholtz-Zentrum Dresden-Rossendorf
>> Tel: +49 (0351) 260 3543
>> r.widera@hzdr.de
>> http://www.hzdr.de
>>
>> Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey,
>>            Prof. Dr. Dr. h. c. Peter Joehnk
>> Vereinsregister: VR 1693 beim Amtsgericht Dresden
>>
>> #############################################################
>> This message is sent to you because you are subscribed to
>>   the mailing list <picongpu-users@hzdr.de>.
>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>
>>
>>
>> #############################################################
>> This message is sent to you because you are subscribed to
>>   the mailing list <picongpu-users@hzdr.de>.
>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>
>
> --
>
> Axel Huebl
> Phone +49 351 260 3582
> https://www.hzdr.de/crp
> Computational Radiation Physics
> Laser Particle Acceleration Division
> Helmholtz-Zentrum Dresden - Rossendorf e.V.
>
> Bautzner Landstrasse 400, 01328 Dresden
> POB 510119, D-01314 Dresden
> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
>           Prof. Dr.Dr.h.c. P. Joehnk
> VR 1693 beim Amtsgericht Dresden
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>

--

Axel Huebl
Phone +49 351 260 3582
https://www.hzdr.de/crp
Computational Radiation Physics
Laser Particle Acceleration Division
Helmholtz-Zentrum Dresden - Rossendorf e.V.

Bautzner Landstrasse 400, 01328 Dresden
POB 510119, D-01314 Dresden
Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
          Prof. Dr.Dr.h.c. P. Joehnk
VR 1693 beim Amtsgericht Dresden
Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster