Return-Path: Received: from mx2.fz-rossendorf.de ([149.220.142.12] verified) by hzdr.de (CommuniGate Pro SMTP 6.1.12) with ESMTP id 15688154 for picongpu-users@cg.hzdr.de; Thu, 16 Feb 2017 09:58:44 +0100 Received: from localhost (localhost [127.0.0.1]) by mx2.fz-rossendorf.de (Postfix) with ESMTP id 8940043394 for ; Thu, 16 Feb 2017 09:58:44 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at mx2.fz-rossendorf.de Received: from mx2.fz-rossendorf.de ([127.0.0.1]) by localhost (mx2.fz-rossendorf.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZQGxjXLPLetu for ; Thu, 16 Feb 2017 09:58:40 +0100 (CET) Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=147.231.234.10; helo=mailgw.eli-beams.eu; envelope-from=prvs=122004cf65=danila.khikhlukha@eli-beams.eu; receiver=picongpu-users@hzdr.de Received: from mailgw.eli-beams.eu (mailgw.eli-beams.eu [147.231.234.10]) by mx2.fz-rossendorf.de (Postfix) with ESMTPS id AAA1843395 for ; Thu, 16 Feb 2017 09:58:39 +0100 (CET) Received: from mail.eli-beams.eu ([10.1.5.17]) by mailgw.eli-beams.eu with ESMTP id v1G8wE5K017899-v1G8wE5M017899 (version=TLSv1.0 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=CAFAIL) for ; Thu, 16 Feb 2017 09:58:15 +0100 Received: from BRAUN.eli-beams.eu ([::1]) by braun.eli-beams.eu ([::1]) with mapi id 14.03.0319.002; Thu, 16 Feb 2017 09:58:14 +0100 From: Khikhlukha Danila To: "picongpu-users@hzdr.de" Subject: RE: [PIConGPU-Users] Restart failure Thread-Topic: [PIConGPU-Users] Restart failure Thread-Index: AQHShw4S5nEpfIqNHE6qPkshnW8Mv6FqLXoS Date: Thu, 16 Feb 2017 08:58:14 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US, cs-CZ Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.36.30.5] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Hi Axel,=0A= let me please answer step by step. Adding --restart-step 1024 didn't help. = Latest commit I have in my local branch is aabd50e.=0A= =0A= I decided to debug the issue with HDF5. Following your instruction I tried = to compile with debug flags on, however my gdb compilation fails. So what I= did:=0A= 1. configure make files as usual with the command issued in a $BUILD_DIR=0A= $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=3DON -DPIC_VERBOSE_LVL=3D29 -= DPMACC_VERBOSE_LVL=3D7" $INCLUDE_DIR=0A= =0A= 2. then in a $BUILD_DIR do `ccmake . ` and setup flags:=0A= CUDA_SHOW_CODELINES =3D ON=0A= PMACC_BLOCKING_KERNEL =3D ON=0A= CUDA_NVCC_FLAGS_DEBUG =3D -g;-G (this flag I found in advance mode. Maybe i= t worths mentioning on the wiki page )=0A= =0A= 3. Save and generated new Makefile, launch `make` and get an error:=0A= ptxas error : Entry function '<...really long function name...>' uses too= much shared data (0xc00a bytes, 0xc000 max).=0A= which puzzles me a bit. It is clear that the debug binary will need more st= atic memory to store the debug info, but is it that large? Or I did somethi= ng wrong?=0A= =0A= 4. If skip step#2 `make` has no problems to compile in a normal way. =0A= =0A= Best,=0A= Danila.=0A= ________________________________________=0A= From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Hue= bl [a.huebl@hzdr.de]=0A= Sent: Tuesday, February 14, 2017 11:02 PM=0A= To: picongpu-users@hzdr.de=0A= Subject: Re: [PIConGPU-Users] Restart failure=0A= =0A= Hi Danila,=0A= =0A= so far that looks normal, can you try specifying your restart step=0A= explicitly with e.g.=0A= --restart-step 1024=0A= ?=0A= =0A= =0A= Which exact version of PIConGPU did you use?=0A= =0A= We are currently not aware of known problems with restarts. If we can=0A= narrow down the source of your segfault this would be wonderful=0A= (although I suspect it might be in a third party lib), so in case that=0A= you are able to do a small 1 node example and can hang in gdb there=0A= https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging= =0A= =0A= we might get a better understanding.=0A= =0A= That said, you can also add ADIOS to your compile chain which will=0A= automatically use ADIOS .bp files for checkpointing & restarting and you=0A= can still use HDF5 for regular output (or use ADIOS for both) in case=0A= HDF5 behaves too nasty and you need to move fast.=0A= =0A= =0A= Axel=0A= =0A= On 13.02.2017 15:47, Khikhlukha Danila wrote:=0A= > Hi Ren=E9,=0A= > sure, pls. see the attachment. Please let me know if more information is = needed.=0A= >=0A= > D.=0A= > ________________________________________=0A= > From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Ren=E9= Widera [r.widera@hzdr.de]=0A= > Sent: Monday, February 13, 2017 3:39 PM=0A= > To: picongpu-users@hzdr.de=0A= > Subject: Re: [PIConGPU-Users] [PIConGPU-Users] Restart failure=0A= >=0A= > Dear Danila,=0A= >=0A= > could you please send use the `stdout`, `stderr` and the files from the= =0A= > `tbg` folder?=0A= >=0A= > best,=0A= >=0A= > Ren=E9=0A= >=0A= > On 02/13/2017 03:11 PM, Khikhlukha Danila wrote:=0A= >> Dear all,=0A= >> currently I was trying to setup PoG in the Jureca machine. It all worked= =0A= >> fine for the LWFA example, however when I tried to restart the=0A= >> simulation I received a segfault almost immediately.=0A= >> My tool chain is as follows=0A= >>=0A= >> GCC/5.4.0=0A= >> CUDA/8.0.44=0A= >> MVAPICH2/2.2-GDR=0A= >> HDF5/1.8.17=0A= >> Boost/1.61.0=0A= >>=0A= >> So, the first run didn't have any problems -- pictures, save points and= =0A= >> data dumps were created. When I tried to launch the restart it crashes= =0A= >> although I explicitly specify the savepoint directory.=0A= >>=0A= >> test$ diff -r 0002/submit/ 0002_restart/submit/=0A= >> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg=0A= >> 39c39=0A= >> < TBG_steps=3D"-s 1024"=0A= >> ---=0A= >>> TBG_steps=3D"-s 2048"=0A= >> 41a42=0A= >>> TBG_restart=3D"--restart --restart-directory=0A= >> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints"=0A= >> 67a69=0A= >>> !TBG_restart \=0A= >>=0A= >> I also checked that it exists and accessible. I tried to switch on some= =0A= >> debug information, with the following command:=0A= >>=0A= >> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=3DON -DPIC_VERBOSE_LVL=3D2= 9=0A= >> -DPMACC_VERBOSE_LVL=3D7"=0A= >>=0A= >> however I didn't find any information except a standard message:=0A= >> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation fault= =0A= >> (signal 11)=0A= >>=0A= >> Could you please advice me if there are another way how to diagnose the= =0A= >> problem (except launching a gdb). may be I'm doing something wrong?=0A= >> However restart used to work on other machines...=0A= >>=0A= >>=0A= >> Thank you in advance,=0A= >> Danila.=0A= >>=0A= >=0A= > --=0A= > Ren=E9 Widera=0A= > Abteilung Laser-Teilchenbeschleunigung (FWKT)=0A= > Helmholtz-Zentrum Dresden-Rossendorf=0A= > Tel: +49 (0351) 260 3543=0A= > r.widera@hzdr.de=0A= > http://www.hzdr.de=0A= >=0A= > Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey,=0A= > Prof. Dr. Dr. h. c. Peter Joehnk=0A= > Vereinsregister: VR 1693 beim Amtsgericht Dresden=0A= >=0A= > #############################################################=0A= > This message is sent to you because you are subscribed to=0A= > the mailing list .=0A= > To unsubscribe, E-mail to: =0A= > To switch to the DIGEST mode, E-mail to = =0A= > To switch to the INDEX mode, E-mail to =0A= > Send administrative queries to =0A= >=0A= >=0A= >=0A= > #############################################################=0A= > This message is sent to you because you are subscribed to=0A= > the mailing list .=0A= > To unsubscribe, E-mail to: =0A= > To switch to the DIGEST mode, E-mail to = =0A= > To switch to the INDEX mode, E-mail to =0A= > Send administrative queries to =0A= >=0A= =0A= --=0A= =0A= Axel Huebl=0A= Phone +49 351 260 3582=0A= https://www.hzdr.de/crp=0A= Computational Radiation Physics=0A= Laser Particle Acceleration Division=0A= Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= =0A= Bautzner Landstrasse 400, 01328 Dresden=0A= POB 510119, D-01314 Dresden=0A= Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= Prof. Dr.Dr.h.c. P. Joehnk=0A= VR 1693 beim Amtsgericht Dresden=0A= =0A= #############################################################=0A= This message is sent to you because you are subscribed to=0A= the mailing list .=0A= To unsubscribe, E-mail to: =0A= To switch to the DIGEST mode, E-mail to =0A= To switch to the INDEX mode, E-mail to =0A= Send administrative queries to =0A= =0A=