Return-Path: Received: from mx2.fz-rossendorf.de ([149.220.142.12] verified) by hzdr.de (CommuniGate Pro SMTP 6.1.12) with ESMTP id 17074875 for picongpu-users@cg.hzdr.de; Mon, 27 Mar 2017 15:27:54 +0200 Received: from localhost (localhost [127.0.0.1]) by mx2.fz-rossendorf.de (Postfix) with ESMTP id B868541DA3 for ; Mon, 27 Mar 2017 15:27:54 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mx2.fz-rossendorf.de Received: from mx2.fz-rossendorf.de ([127.0.0.1]) by localhost (mx2.fz-rossendorf.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tNBl_EupkZgP for ; Mon, 27 Mar 2017 15:27:49 +0200 (CEST) Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=147.231.234.10; helo=mailgw.eli-beams.eu; envelope-from=prvs=1259cab05e=danila.khikhlukha@eli-beams.eu; receiver=picongpu-users@hzdr.de Received: from mailgw.eli-beams.eu (mailgw.eli-beams.eu [147.231.234.10]) by mx2.fz-rossendorf.de (Postfix) with ESMTPS id 6429241D6E for ; Mon, 27 Mar 2017 15:27:49 +0200 (CEST) Received: from mail.eli-beams.eu ([10.1.5.17]) by mailgw.eli-beams.eu with ESMTP id v2RDRXni001722-v2RDRXnk001722 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=CAFAIL) for ; Mon, 27 Mar 2017 15:27:33 +0200 Received: from BRAUN.eli-beams.eu ([::1]) by braun.eli-beams.eu ([::1]) with mapi id 14.03.0319.002; Mon, 27 Mar 2017 15:27:33 +0200 From: Khikhlukha Danila To: "picongpu-users@hzdr.de" Subject: RE: [PIConGPU-Users] Restart failure Thread-Topic: [PIConGPU-Users] Restart failure Thread-Index: AQHSkf4ggk/JclgyPU6zb5gl3vnVsqGo1kdw Date: Mon, 27 Mar 2017 13:27:31 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US, cs-CZ Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.36.30.5] Content-Type: text/plain; charset="iso-8859-2" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Hi Axel,=0A= so I tried to launch a restart on a single device under dgb. The Segmentati= on fault is coming from a check point reader I guess (see log below)... If = it is need I can provide more detailed log...Could you please advise me wha= t can I check next?=0A= =0A= srun gdb -ex r -ex tb --args bin/picongpu -d 1 1 1 -g 128 1024 128 -s 2048 = --restart --restart-directory /work/hhh20/hhh20z/run_0002_s/simOutput/check= points --restart-step 1024 --e_png.period 32 --e_png.axis yx --e_png.sliceP= oint 0.5 --e_png.folder pngElectronsYX --e_png.period 32 --e_png.axis yz --= e_png.slicePoint 0.5 --e_png.folder pngElectronsYZ --hdf5.period 128 --hdf= 5.file simData --checkpoints 512=0A= NU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7=0A= Copyright (C) 2013 Free Software Foundation, Inc.=0A= License GPLv3+: GNU GPL version 3 or later =0A= ......=0A= Program received signal SIGSEGV, Segmentation fault.=0A= 0x0000000000a545b6 in splash::ParallelDataCollector::readDataSetMeta(int, i= nt, char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions= , splash::Dimensions&, unsigned int&) ()=0A= Temporary breakpoint 1 at 0xa545b6=0A= Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7= .x86_64 glibc-2.17-106.el7_2.8.x86_64 libibverbs-1.2.1mlnx1-3.OFED.4.0.0.1.= 3.40101.el7.centos.x86_64 libicu-50.1.2-15.el7.x86_64 libnl-1.1.4-3.el7.x86= _64 libpng-1.5.13-7.el7_2.x86_64=0A= =0A= bt=0A= (gdb) #0 0x0000000000a545b6 in splash::ParallelDataCollector::readDataSetM= eta(int, int, char const*, splash::Dimensions, splash::Dimensions, splash::= Dimensions, splash::Dimensions&, unsigned int&) ()=0A= #1 0x0000000000a54d1a in splash::ParallelDataCollector::readMeta(int, char= const*, splash::Dimensions, splash::Dimensions, splash::Dimensions&) ()=0A= #2 0x00000000007a3254 in picongpu::hdf5::openPMD::PatchReader::checkSpatia= lTypeSize (this=3D0x7ffffffef03f, dc=3D0x17e1fa8, availableRanks=3D1, id=3D= 1024, =0A= particlePatchPathComponent=3D...)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:49= =0A= #3 0x00000000007a33ed in picongpu::hdf5::openPMD::PatchReader::readPatchAt= tribute (this=3D0x7ffffffef03f, dc=3D0x17e1fa8, availableRanks=3D1, id=3D10= 24, =0A= particlePatchPathComponent=3D..., dest=3D0x1bf1850)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:76= =0A= #4 0x00000000007a3665 in picongpu::hdf5::openPMD::PatchReader::operator() = (=0A= this=3D0x7ffffffef03f, dc=3D0x17e1fa8, availableRanks=3D1, dimensionali= ty=3D3, =0A= id=3D1024, particlePatchPath=3D...)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:102= =0A= #5 0x0000000000857c7b in picongpu::hdf5::LoadSpecies, PMa= cc::math::CT::Vector, mpl_::integral_c, mp= l_::integral_c >, boost::mpl::v_item, boost::mpl::vector0, 0>, 0>, 0>, boost::mpl::= vector, picongp= u::placeholder_definition27::shape, picongpu::placeholder_definition= 32::interpolation, PMacc::placehold= er_definition15::pmacc_isAlias>, picongpu::placeholder_definition33::curren= t,= PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_def= inition36::massRatio, picongpu::placeholder_de= finition37::chargeRatio, mpl_::na, mpl_::na,= mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_= ::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, PMacc::HandleGuardRegion, boost::mpl::vector0, boost::mpl::vector0= > > >::operator() (this=3D0x7fffffff0bca, =0A= params=3D0x10d6d98, restartChunkSize=3D1000000)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugins/hdf5= /restart/LoadSpecies.hpp:126=0A= #6 0x00000000007f8d41 in operator() (t1=3D@0x10d6ee0: 1000000, t0=3D@0x7fffffff0530: 0x10d6d98, this=3D<= optimized out>)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/= algorithms/ForEach.hpp:158=0A= #7 operator() (=0A= t1=3D@0x10d6ee0: 1000000, t0=3D, this=3D)= =0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/= algorithms/ForEach.hpp:239=0A= #8 picongpu::hdf5::HDF5Writer::restart (this=3D0x10d6d80, restartStep=3D10= 24, =0A= restartDirectory=3D...)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugins/hdf5= /HDF5Writer.hpp:226=0A= #9 0x00000000007dedb8 in PMacc::PluginConnector::restartPlugins (=0A= this=3D0x10ac700 , =0A= restartStep=3D1024, restartDirectory=3D...)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/= pluginSystem/PluginConnector.hpp:181=0A= #10 0x00000000007e503d in picongpu::InitialiserController::restart (=0A= this=3D0x10d6010, restartStep=3D1024, restartDirectory=3D...)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/initializati= on/InitialiserController.hpp:84=0A= #11 0x000000000080ab73 in picongpu::MySimulation::fillSimulation (=0A= this=3D0x10d5e20)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simulationCo= ntrol/MySimulation.hpp:413=0A= #12 0x000000000084d5f7 in PMacc::SimulationHelper<3u>::startSimulation (=0A= this=3D0x10d5e20)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/= simulationControl/SimulationHelper.hpp:223=0A= #13 0x00000000008315b8 in picongpu::SimulationStarter::start (=0A= this=3D0x7fffffff1930)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simulationCo= ntrol/SimulationStarter.hpp:83=0A= #14 0x00000000007cc249 in main (argc=3D38, argv=3D0x7fffffff1ad8)=0A= at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/main.cu:10=0A= =0A= =0A= =0A= ________________________________________=0A= From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Hue= bl [a.huebl@hzdr.de]=0A= Sent: Tuesday, February 28, 2017 9:05 PM=0A= To: picongpu-users@hzdr.de=0A= Subject: Re: [PIConGPU-Users] Restart failure=0A= =0A= Hi Danila,=0A= =0A= sorry your mail got lost.=0A= =0A= -G is currently not possible due to a problem in cuSTL (a library inside=0A= PMacc that is used at some points). You won't need it anyway, `-g` is=0A= enough.=0A= =0A= Can you try debugging it like that?=0A= =0A= =0A= Best,=0A= Axel=0A= =0A= On 16.02.2017 09:58, Khikhlukha Danila wrote:=0A= > Hi Axel,=0A= > let me please answer step by step. Adding --restart-step 1024 didn't help= . Latest commit I have in my local branch is aabd50e.=0A= >=0A= > I decided to debug the issue with HDF5. Following your instruction I trie= d to compile with debug flags on, however my gdb compilation fails. So what= I did:=0A= > 1. configure make files as usual with the command issued in a $BUILD_DIR= =0A= > $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=3DON -DPIC_VERBOSE_LVL=3D29= -DPMACC_VERBOSE_LVL=3D7" $INCLUDE_DIR=0A= >=0A= > 2. then in a $BUILD_DIR do `ccmake . ` and setup flags:=0A= > CUDA_SHOW_CODELINES =3D ON=0A= > PMACC_BLOCKING_KERNEL =3D ON=0A= > CUDA_NVCC_FLAGS_DEBUG =3D -g;-G (this flag I found in advance mode. Maybe= it worths mentioning on the wiki page )=0A= >=0A= > 3. Save and generated new Makefile, launch `make` and get an error:=0A= > ptxas error : Entry function '<...really long function name...>' uses t= oo much shared data (0xc00a bytes, 0xc000 max).=0A= > which puzzles me a bit. It is clear that the debug binary will need more = static memory to store the debug info, but is it that large? Or I did somet= hing wrong?=0A= >=0A= > 4. If skip step#2 `make` has no problems to compile in a normal way.=0A= >=0A= > Best,=0A= > Danila.=0A= > ________________________________________=0A= > From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel H= uebl [a.huebl@hzdr.de]=0A= > Sent: Tuesday, February 14, 2017 11:02 PM=0A= > To: picongpu-users@hzdr.de=0A= > Subject: Re: [PIConGPU-Users] Restart failure=0A= >=0A= > Hi Danila,=0A= >=0A= > so far that looks normal, can you try specifying your restart step=0A= > explicitly with e.g.=0A= > --restart-step 1024=0A= > ?=0A= >=0A= >=0A= > Which exact version of PIConGPU did you use?=0A= >=0A= > We are currently not aware of known problems with restarts. If we can=0A= > narrow down the source of your segfault this would be wonderful=0A= > (although I suspect it might be in a third party lib), so in case that=0A= > you are able to do a small 1 node example and can hang in gdb there=0A= > https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debuggin= g=0A= >=0A= > we might get a better understanding.=0A= >=0A= > That said, you can also add ADIOS to your compile chain which will=0A= > automatically use ADIOS .bp files for checkpointing & restarting and you= =0A= > can still use HDF5 for regular output (or use ADIOS for both) in case=0A= > HDF5 behaves too nasty and you need to move fast.=0A= >=0A= >=0A= > Axel=0A= >=0A= > On 13.02.2017 15:47, Khikhlukha Danila wrote:=0A= >> Hi Ren=E9,=0A= >> sure, pls. see the attachment. Please let me know if more information is= needed.=0A= >>=0A= >> D.=0A= >> ________________________________________=0A= >> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Ren= =E9 Widera [r.widera@hzdr.de]=0A= >> Sent: Monday, February 13, 2017 3:39 PM=0A= >> To: picongpu-users@hzdr.de=0A= >> Subject: Re: [PIConGPU-Users] [PIConGPU-Users] Restart failure=0A= >>=0A= >> Dear Danila,=0A= >>=0A= >> could you please send use the `stdout`, `stderr` and the files from the= =0A= >> `tbg` folder?=0A= >>=0A= >> best,=0A= >>=0A= >> Ren=E9=0A= >>=0A= >> On 02/13/2017 03:11 PM, Khikhlukha Danila wrote:=0A= >>> Dear all,=0A= >>> currently I was trying to setup PoG in the Jureca machine. It all worke= d=0A= >>> fine for the LWFA example, however when I tried to restart the=0A= >>> simulation I received a segfault almost immediately.=0A= >>> My tool chain is as follows=0A= >>>=0A= >>> GCC/5.4.0=0A= >>> CUDA/8.0.44=0A= >>> MVAPICH2/2.2-GDR=0A= >>> HDF5/1.8.17=0A= >>> Boost/1.61.0=0A= >>>=0A= >>> So, the first run didn't have any problems -- pictures, save points and= =0A= >>> data dumps were created. When I tried to launch the restart it crashes= =0A= >>> although I explicitly specify the savepoint directory.=0A= >>>=0A= >>> test$ diff -r 0002/submit/ 0002_restart/submit/=0A= >>> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg=0A= >>> 39c39=0A= >>> < TBG_steps=3D"-s 1024"=0A= >>> ---=0A= >>>> TBG_steps=3D"-s 2048"=0A= >>> 41a42=0A= >>>> TBG_restart=3D"--restart --restart-directory=0A= >>> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints"=0A= >>> 67a69=0A= >>>> !TBG_restart \=0A= >>>=0A= >>> I also checked that it exists and accessible. I tried to switch on some= =0A= >>> debug information, with the following command:=0A= >>>=0A= >>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=3DON -DPIC_VERBOSE_LVL=3D= 29=0A= >>> -DPMACC_VERBOSE_LVL=3D7"=0A= >>>=0A= >>> however I didn't find any information except a standard message:=0A= >>> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation fault= =0A= >>> (signal 11)=0A= >>>=0A= >>> Could you please advice me if there are another way how to diagnose the= =0A= >>> problem (except launching a gdb). may be I'm doing something wrong?=0A= >>> However restart used to work on other machines...=0A= >>>=0A= >>>=0A= >>> Thank you in advance,=0A= >>> Danila.=0A= >>>=0A= >>=0A= >> --=0A= >> Ren=E9 Widera=0A= >> Abteilung Laser-Teilchenbeschleunigung (FWKT)=0A= >> Helmholtz-Zentrum Dresden-Rossendorf=0A= >> Tel: +49 (0351) 260 3543=0A= >> r.widera@hzdr.de=0A= >> http://www.hzdr.de=0A= >>=0A= >> Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey,=0A= >> Prof. Dr. Dr. h. c. Peter Joehnk=0A= >> Vereinsregister: VR 1693 beim Amtsgericht Dresden=0A= >>=0A= >> #############################################################=0A= >> This message is sent to you because you are subscribed to=0A= >> the mailing list .=0A= >> To unsubscribe, E-mail to: =0A= >> To switch to the DIGEST mode, E-mail to = =0A= >> To switch to the INDEX mode, E-mail to =0A= >> Send administrative queries to =0A= >>=0A= >>=0A= >>=0A= >> #############################################################=0A= >> This message is sent to you because you are subscribed to=0A= >> the mailing list .=0A= >> To unsubscribe, E-mail to: =0A= >> To switch to the DIGEST mode, E-mail to = =0A= >> To switch to the INDEX mode, E-mail to =0A= >> Send administrative queries to =0A= >>=0A= >=0A= > --=0A= >=0A= > Axel Huebl=0A= > Phone +49 351 260 3582=0A= > https://www.hzdr.de/crp=0A= > Computational Radiation Physics=0A= > Laser Particle Acceleration Division=0A= > Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= >=0A= > Bautzner Landstrasse 400, 01328 Dresden=0A= > POB 510119, D-01314 Dresden=0A= > Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= > Prof. Dr.Dr.h.c. P. Joehnk=0A= > VR 1693 beim Amtsgericht Dresden=0A= >=0A= > #############################################################=0A= > This message is sent to you because you are subscribed to=0A= > the mailing list .=0A= > To unsubscribe, E-mail to: =0A= > To switch to the DIGEST mode, E-mail to = =0A= > To switch to the INDEX mode, E-mail to =0A= > Send administrative queries to =0A= >=0A= >=0A= > #############################################################=0A= > This message is sent to you because you are subscribed to=0A= > the mailing list .=0A= > To unsubscribe, E-mail to: =0A= > To switch to the DIGEST mode, E-mail to = =0A= > To switch to the INDEX mode, E-mail to =0A= > Send administrative queries to =0A= >=0A= =0A= --=0A= =0A= Axel Huebl=0A= Phone +49 351 260 3582=0A= https://www.hzdr.de/crp=0A= Computational Radiation Physics=0A= Laser Particle Acceleration Division=0A= Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= =0A= Bautzner Landstrasse 400, 01328 Dresden=0A= POB 510119, D-01314 Dresden=0A= Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= Prof. Dr.Dr.h.c. P. Joehnk=0A= VR 1693 beim Amtsgericht Dresden=0A= =0A= #############################################################=0A= This message is sent to you because you are subscribed to=0A= the mailing list .=0A= To unsubscribe, E-mail to: =0A= To switch to the DIGEST mode, E-mail to =0A= To switch to the INDEX mode, E-mail to =0A= Send administrative queries to =0A= =0A=