Return-Path: Received: from [140.105.21.13] (account huebl@hzdr.de [140.105.21.13] verified) by hzdr.de (CommuniGate Pro SMTP 6.1.12) with ESMTPSA id 16007344 for picongpu-users@hzdr.de; Tue, 28 Feb 2017 21:05:51 +0100 Subject: Re: Restart failure To: picongpu-users@hzdr.de References: From: Axel Huebl Organization: HZDR Message-ID: <0c61d4ef-1d79-20ef-9372-23ac10815c69@hzdr.de> Date: Tue, 28 Feb 2017 21:05:51 +0100 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Hi Danila, sorry your mail got lost. -G is currently not possible due to a problem in cuSTL (a library inside PMacc that is used at some points). You won't need it anyway, `-g` is enough. Can you try debugging it like that? Best, Axel On 16.02.2017 09:58, Khikhlukha Danila wrote: > Hi Axel, > let me please answer step by step. Adding --restart-step 1024 didn't help. Latest commit I have in my local branch is aabd50e. > > I decided to debug the issue with HDF5. Following your instruction I tried to compile with debug flags on, however my gdb compilation fails. So what I did: > 1. configure make files as usual with the command issued in a $BUILD_DIR > $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29 -DPMACC_VERBOSE_LVL=7" $INCLUDE_DIR > > 2. then in a $BUILD_DIR do `ccmake . ` and setup flags: > CUDA_SHOW_CODELINES = ON > PMACC_BLOCKING_KERNEL = ON > CUDA_NVCC_FLAGS_DEBUG = -g;-G (this flag I found in advance mode. Maybe it worths mentioning on the wiki page ) > > 3. Save and generated new Makefile, launch `make` and get an error: > ptxas error : Entry function '<...really long function name...>' uses too much shared data (0xc00a bytes, 0xc000 max). > which puzzles me a bit. It is clear that the debug binary will need more static memory to store the debug info, but is it that large? Or I did something wrong? > > 4. If skip step#2 `make` has no problems to compile in a normal way. > > Best, > Danila. > ________________________________________ > From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Huebl [a.huebl@hzdr.de] > Sent: Tuesday, February 14, 2017 11:02 PM > To: picongpu-users@hzdr.de > Subject: Re: [PIConGPU-Users] Restart failure > > Hi Danila, > > so far that looks normal, can you try specifying your restart step > explicitly with e.g. > --restart-step 1024 > ? > > > Which exact version of PIConGPU did you use? > > We are currently not aware of known problems with restarts. If we can > narrow down the source of your segfault this would be wonderful > (although I suspect it might be in a third party lib), so in case that > you are able to do a small 1 node example and can hang in gdb there > https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging > > we might get a better understanding. > > That said, you can also add ADIOS to your compile chain which will > automatically use ADIOS .bp files for checkpointing & restarting and you > can still use HDF5 for regular output (or use ADIOS for both) in case > HDF5 behaves too nasty and you need to move fast. > > > Axel > > On 13.02.2017 15:47, Khikhlukha Danila wrote: >> Hi René, >> sure, pls. see the attachment. Please let me know if more information is needed. >> >> D. >> ________________________________________ >> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of René Widera [r.widera@hzdr.de] >> Sent: Monday, February 13, 2017 3:39 PM >> To: picongpu-users@hzdr.de >> Subject: Re: [PIConGPU-Users] [PIConGPU-Users] Restart failure >> >> Dear Danila, >> >> could you please send use the `stdout`, `stderr` and the files from the >> `tbg` folder? >> >> best, >> >> René >> >> On 02/13/2017 03:11 PM, Khikhlukha Danila wrote: >>> Dear all, >>> currently I was trying to setup PoG in the Jureca machine. It all worked >>> fine for the LWFA example, however when I tried to restart the >>> simulation I received a segfault almost immediately. >>> My tool chain is as follows >>> >>> GCC/5.4.0 >>> CUDA/8.0.44 >>> MVAPICH2/2.2-GDR >>> HDF5/1.8.17 >>> Boost/1.61.0 >>> >>> So, the first run didn't have any problems -- pictures, save points and >>> data dumps were created. When I tried to launch the restart it crashes >>> although I explicitly specify the savepoint directory. >>> >>> test$ diff -r 0002/submit/ 0002_restart/submit/ >>> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg >>> 39c39 >>> < TBG_steps="-s 1024" >>> --- >>>> TBG_steps="-s 2048" >>> 41a42 >>>> TBG_restart="--restart --restart-directory >>> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints" >>> 67a69 >>>> !TBG_restart \ >>> >>> I also checked that it exists and accessible. I tried to switch on some >>> debug information, with the following command: >>> >>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29 >>> -DPMACC_VERBOSE_LVL=7" >>> >>> however I didn't find any information except a standard message: >>> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation fault >>> (signal 11) >>> >>> Could you please advice me if there are another way how to diagnose the >>> problem (except launching a gdb). may be I'm doing something wrong? >>> However restart used to work on other machines... >>> >>> >>> Thank you in advance, >>> Danila. >>> >> >> -- >> René Widera >> Abteilung Laser-Teilchenbeschleunigung (FWKT) >> Helmholtz-Zentrum Dresden-Rossendorf >> Tel: +49 (0351) 260 3543 >> r.widera@hzdr.de >> http://www.hzdr.de >> >> Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey, >> Prof. Dr. Dr. h. c. Peter Joehnk >> Vereinsregister: VR 1693 beim Amtsgericht Dresden >> >> ############################################################# >> This message is sent to you because you are subscribed to >> the mailing list . >> To unsubscribe, E-mail to: >> To switch to the DIGEST mode, E-mail to >> To switch to the INDEX mode, E-mail to >> Send administrative queries to >> >> >> >> ############################################################# >> This message is sent to you because you are subscribed to >> the mailing list . >> To unsubscribe, E-mail to: >> To switch to the DIGEST mode, E-mail to >> To switch to the INDEX mode, E-mail to >> Send administrative queries to >> > > -- > > Axel Huebl > Phone +49 351 260 3582 > https://www.hzdr.de/crp > Computational Radiation Physics > Laser Particle Acceleration Division > Helmholtz-Zentrum Dresden - Rossendorf e.V. > > Bautzner Landstrasse 400, 01328 Dresden > POB 510119, D-01314 Dresden > Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey > Prof. Dr.Dr.h.c. P. Joehnk > VR 1693 beim Amtsgericht Dresden > > ############################################################# > This message is sent to you because you are subscribed to > the mailing list . > To unsubscribe, E-mail to: > To switch to the DIGEST mode, E-mail to > To switch to the INDEX mode, E-mail to > Send administrative queries to > > > ############################################################# > This message is sent to you because you are subscribed to > the mailing list . > To unsubscribe, E-mail to: > To switch to the DIGEST mode, E-mail to > To switch to the INDEX mode, E-mail to > Send administrative queries to > -- Axel Huebl Phone +49 351 260 3582 https://www.hzdr.de/crp Computational Radiation Physics Laser Particle Acceleration Division Helmholtz-Zentrum Dresden - Rossendorf e.V. Bautzner Landstrasse 400, 01328 Dresden POB 510119, D-01314 Dresden Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey Prof. Dr.Dr.h.c. P. Joehnk VR 1693 beim Amtsgericht Dresden