Return-Path: Received: from mx2.fz-rossendorf.de ([149.220.142.12] verified) by hzdr.de (CommuniGate Pro SMTP 6.1.16) with ESMTP id 16562910 for picongpu-users@cg.hzdr.de; Mon, 05 Jun 2017 13:32:03 +0200 Received: from localhost (localhost [127.0.0.1]) by mx2.fz-rossendorf.de (Postfix) with ESMTP id 92EEB4117A for ; Mon, 5 Jun 2017 13:32:03 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mx2.fz-rossendorf.de Received: from mx2.fz-rossendorf.de ([127.0.0.1]) by localhost (mx2.fz-rossendorf.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NAfaQDWUjGRv for ; Mon, 5 Jun 2017 13:31:57 +0200 (CEST) Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=147.231.234.10; helo=mailgw.eli-beams.eu; envelope-from=prvs=1329da72ef=danila.khikhlukha@eli-beams.eu; receiver=picongpu-users@hzdr.de Received: from mailgw.eli-beams.eu (mailgw.eli-beams.eu [147.231.234.10]) by mx2.fz-rossendorf.de (Postfix) with ESMTPS id 63DAF40A9A for ; Mon, 5 Jun 2017 13:31:57 +0200 (CEST) Received: from mail.eli-beams.eu ([10.1.5.17]) by mailgw.eli-beams.eu with ESMTP id v55BVXqK028513-v55BVXqM028513 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=CAFAIL) for ; Mon, 5 Jun 2017 13:31:33 +0200 Received: from BRAUN.eli-beams.eu ([::1]) by braun.eli-beams.eu ([::1]) with mapi id 14.03.0319.002; Mon, 5 Jun 2017 13:31:33 +0200 From: Khikhlukha Danila To: "picongpu-users@hzdr.de" Subject: RE: [PIConGPU-Users] [PIConGPU-Users] Restart failure Thread-Topic: [PIConGPU-Users] [PIConGPU-Users] Restart failure Thread-Index: AQHS2rgHbartddxUTEucJBpyIxlFK6IWJXLR Date: Mon, 5 Jun 2017 11:31:32 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US, cs-CZ Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-originating-ip: [10.36.30.5] Content-Type: multipart/mixed; boundary="_003_BA7C853FEE430847B9C35FFCC6E5B2A555B64328braunelibeamseu_" MIME-Version: 1.0 --_003_BA7C853FEE430847B9C35FFCC6E5B2A555B64328braunelibeamseu_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi Axel,=0A= unfortunately it didn't work also. Also there was for sure no code modifica= tion between writing and reading simulation.=0A= 1. With latest versions of pngwriter (8bae057) and libSplash(4aa0c03) I tr= ied PoG v.0.3.0 (d634052).=0A= 2. I followed the procedure I have described in my private email to you.=0A= 3. You can check the difference between write simulation and standard lwfa = example in file diff_log_1 attached:=0A= > $PICSRC/pic-create $PICSRC/examples/LaserWakefield/ lwfa_ref && diff -r l= wfa_ref 0000_1 >& diff_log_1=0A= Diff between read and write simulation is in a diff_log_2 file:=0A= > diff --exclude b* -r 0000_1 0000_2 >& diff_log_2.=0A= =0A= 4. The problem is still there: read simulation fails with segfault at the = very beginning. Frankly I'm quite puzzled. Do you think it is reasonable to= try ADIOS instead libSplash?=0A= D. =0A= ________________________________________=0A= From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Hue= bl [a.huebl@hzdr.de]=0A= Sent: Thursday, June 01, 2017 11:17 AM=0A= To: picongpu-users@hzdr.de=0A= Subject: Re: [PIConGPU-Users] [PIConGPU-Users] Restart failure=0A= =0A= Hi,=0A= =0A= an other reason could have been that you updated PIConGPU's source=0A= between writing and restarting the simulation.=0A= =0A= If you don't mind, can you repeat the simulation and restart with the=0A= `release-0.3.0` branch? This is the nearly-finished next stable release=0A= and we just had a similar case inhouse that suffered a similar segfault=0A= issue during restart due to a code-base change.=0A= =0A= =0A= Axel=0A= =0A= On 24.05.2017 11:13, Axel Huebl wrote:=0A= > Hi Danila,=0A= >=0A= >=0A= >> I hope I don't use any specific particles attributes as I'm running=0A= > slightly modified LWFA example.=0A= >=0A= > one last check, can you send me the diff you did for the LWFA example? A= =0A= > private mail if you don't want to share your setup publicly would be=0A= > fine, too. Maybe it's really just a particle attribute that has been=0A= > forgotten to be handled in the restart I/O.=0A= >=0A= >=0A= > Otherwise, looking remotely at your problem I am kind of running out of= =0A= > ideas without having a look on the system directly.=0A= >=0A= > I can't promise anything, but if there is a way for us to create a=0A= > regular user account on your cluster one of us might be able to set up=0A= > an environment and debug it in more depths.=0A= >=0A= > From here, it looks very much like an environment problem that I can't=0A= > reproduce on our clusters right now.=0A= >=0A= >=0A= > Best regards,=0A= > Axel=0A= >=0A= > On 15.05.2017 15:32, Khikhlukha Danila wrote:=0A= >> Hi Axel,=0A= >> after some time I came back to this issue. I have updated the whole tool= -chain and my PoG installation. However, unfortunately, the the problem is = still there...=0A= >>=0A= >> To answer your question I added the following line to my *.tpl: file=0A= >>> ldd !TBG_dstPath/picongpu/bin/picongpu=0A= >> just before calling a picongpu. I can see nothing suspicious during the = runtime -- all libraries seem to be consistent to ones I used for pngwriter= /libSplash compilation...=0A= >>=0A= >> I hope I don't use any specific particles attributes as I'm running slig= htly modified LWFA example.=0A= >> I'm afraid currently I run out of the ideas what might be wrong here? Co= uld you advise me what else can I check?=0A= >>=0A= >> Best regards,=0A= >> Danila.=0A= >>=0A= >> ________________________________________=0A= >> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel = Huebl [a.huebl@hzdr.de]=0A= >> Sent: Wednesday, April 05, 2017 10:08 AM=0A= >> To: picongpu-users@hzdr.de=0A= >> Subject: Re: [PIConGPU-Users] Restart failure=0A= >>=0A= >> Oh the last output helps indeed!=0A= >>=0A= >> Is it possible that you just linked an older libhdf5 or libsplash vs.=0A= >> the one you use at runtime as with pngwriter?=0A= >>=0A= >> Otherwise we need to check your file you are restarting from, maybe you= =0A= >> used a specific particle attribute we did not expect.=0A= >>=0A= >>=0A= >> Axel=0A= >>=0A= >> On 28.03.2017 17:45, Khikhlukha Danila wrote:=0A= >>> Just a small update on this topic=0A= >>> 1. I force cmake to use only one version of libpng by tweaking CMakeLis= t file of libpngwriter. The compilation issue is gone.=0A= >>>=0A= >>> 2. I compiled libsplash in hdf5-verbose mode. Please find attached rest= art log with libsplash + hdf5 debug output.=0A= >>>=0A= >>> D.=0A= >>> ________________________________________=0A= >>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Khik= hlukha Danila=0A= >>> Sent: Tuesday, March 28, 2017 3:09 PM=0A= >>> To: picongpu-users@hzdr.de=0A= >>> Subject: Re: [PIConGPU-Users] [PIConGPU-Users] Restart failure=0A= >>>=0A= >>> Sorry, for a confusion. Please find a right stderr file attached.=0A= >>>=0A= >>> Answer to not important:=0A= >>> the pngwriter compilation error is caused by the fact that cmake got co= nfused between two versions of libpng available: system's native 1.5 and 1= .6 available after `module load` call. As soon as I force cmake to use a sp= ecific version of libpng I will report if the problem is still there...=0A= >>>=0A= >>> Cheers,=0A= >>> D.=0A= >>> ________________________________________=0A= >>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel= Huebl [a.huebl@hzdr.de]=0A= >>> Sent: Tuesday, March 28, 2017 3:01 PM=0A= >>> To: picongpu-users@hzdr.de=0A= >>> Subject: Re: [PIConGPU-Users] Restart failure=0A= >>>=0A= >>> Thanks!=0A= >>>=0A= >>> I don't see any SPLASH_VERBOSE output in the stderr file you attached.= =0A= >>> is it the right one?=0A= >>>=0A= >>> How did you see the other calls to libSplash succeeded?=0A= >>> Is the file it tries to read in good shape and readable with=0A= >>> HDFView/HDFCompass or h5py?=0A= >>> The particle patches are stored in=0A= >>> /data//particles//particlePatches/*=0A= >>>=0A= >>> Not important for now:=0A= >>> Are you using pngwriter 0.5.6 or `dev`? That call in libpng should be= =0A= >>> fixed in all recent releases, otherwise just report g++ version, OS,=0A= >>> pngwriter version, libpng version and the exact compile error in=0A= >>> https://github.com/pngwriter/pngwriter/releases=0A= >>>=0A= >>> and we can have a look.=0A= >>>=0A= >>>=0A= >>> Cheers,=0A= >>> Axel=0A= >>>=0A= >>> On 28.03.2017 14:49, Khikhlukha Danila wrote:=0A= >>>> So, checked with ldd picongpu binary and libsplash.so and found no err= ors there. I also checked my PATH and LD path and found it ok.=0A= >>>> I also recompiled libsplash just to be on the save side.=0A= >>>> Unfortunately this didn't help -- restart is still failing with the sa= me error.=0A= >>>> output with SPLASH_VERBOSE env. switched just confirmed the previous v= ersion: libsplash reader is failing trying to read meta data from the check= point (see the attachment). It's interesting that libsplash didn't have any= problem reading actual data...=0A= >>>> I guess I will try to compile libSplash in a verbose mode. Maybe HDF5= log will give more insights about what is happening there...=0A= >>>>=0A= >>>> BTW, I did have some problems trying to recompile pngwriter. I receive= a compile error connected to png_convert_to_rfc1123_buffer function. So I = just left it how it is, thinking that the problem is barely connected to it= . After all pictures are created with no problems.=0A= >>>>=0A= >>>> Cheers,=0A= >>>> D.=0A= >>>> ________________________________________=0A= >>>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axe= l Huebl [a.huebl@hzdr.de]=0A= >>>> Sent: Monday, March 27, 2017 8:39 PM=0A= >>>> To: picongpu-users@hzdr.de=0A= >>>> Subject: Re: [PIConGPU-Users] Restart failure=0A= >>>>=0A= >>>> Very weird, the error resides from this [1] libSplash read call.=0A= >>>>=0A= >>>> Are you sure you are running with the same libSplash and HDF5 as you= =0A= >>>> compiled against (ldd bin/picongpu). Is your environment=0A= >>>> (LD_LIBRARY_PATH) in good shape? Did the compiler change?=0A= >>>>=0A= >>>> It might be the system updated your HDF5 module, MPI or something on t= he=0A= >>>> line and you might want to rebuild libSplash? Are you restarting with= =0A= >>>> the same device configuration (-d)?=0A= >>>>=0A= >>>> Please redo the same as you did but enable more libSplash verbosity vi= a:=0A= >>>> export SPLASH_VERBOSE=3D3=0A= >>>> (needs no recompile).=0A= >>>>=0A= >>>> You can even go one step further and compile libSplash with=0A= >>>> -DDEBUG_VERBOSE=3DON=0A= >>>> which also adds (some) HDF5 RT checks.=0A= >>>>=0A= >>>> Be aware we currently do not support re-partitioning a job during=0A= >>>> restart (e.g. start with -d 2 2 2, resume with -d 2 2 2).=0A= >>>>=0A= >>>> Best,=0A= >>>> Axel=0A= >>>>=0A= >>>> [1]=0A= >>>> https://github.com/ComputationalRadiationPhysics/picongpu/blob/0.2.4/s= rc/picongpu/patchReader.cpp#L44-L49=0A= >>>>=0A= >>>> On 27.03.2017 15:27, Khikhlukha Danila wrote:=0A= >>>>> Hi Axel,=0A= >>>>> so I tried to launch a restart on a single device under dgb. The Segm= entation fault is coming from a check point reader I guess (see log below).= .. If it is need I can provide more detailed log...Could you please advise = me what can I check next?=0A= >>>>>=0A= >>>>> srun gdb -ex r -ex tb --args bin/picongpu -d 1 1 1 -g 128 1024 128 -s= 2048 --restart --restart-directory /work/hhh20/hhh20z/run_0002_s/simOutput= /checkpoints --restart-step 1024 --e_png.period 32 --e_png.axis yx --e_png.= slicePoint 0.5 --e_png.folder pngElectronsYX --e_png.period 32 --e_png.axis= yz --e_png.slicePoint 0.5 --e_png.folder pngElectronsYZ --hdf5.period 128= --hdf5.file simData --checkpoints 512=0A= >>>>> NU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7=0A= >>>>> Copyright (C) 2013 Free Software Foundation, Inc.=0A= >>>>> License GPLv3+: GNU GPL version 3 or later =0A= >>>>> ......=0A= >>>>> Program received signal SIGSEGV, Segmentation fault.=0A= >>>>> 0x0000000000a545b6 in splash::ParallelDataCollector::readDataSetMeta(= int, int, char const*, splash::Dimensions, splash::Dimensions, splash::Dime= nsions, splash::Dimensions&, unsigned int&) ()=0A= >>>>> Temporary breakpoint 1 at 0xa545b6=0A= >>>>> Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-= 13.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 libibverbs-1.2.1mlnx1-3.OFED.4.= 0.0.1.3.40101.el7.centos.x86_64 libicu-50.1.2-15.el7.x86_64 libnl-1.1.4-3.e= l7.x86_64 libpng-1.5.13-7.el7_2.x86_64=0A= >>>>>=0A= >>>>> bt=0A= >>>>> (gdb) #0 0x0000000000a545b6 in splash::ParallelDataCollector::readDa= taSetMeta(int, int, char const*, splash::Dimensions, splash::Dimensions, sp= lash::Dimensions, splash::Dimensions&, unsigned int&) ()=0A= >>>>> #1 0x0000000000a54d1a in splash::ParallelDataCollector::readMeta(int= , char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions&)= ()=0A= >>>>> #2 0x00000000007a3254 in picongpu::hdf5::openPMD::PatchReader::check= SpatialTypeSize (this=3D0x7ffffffef03f, dc=3D0x17e1fa8, availableRanks=3D1,= id=3D1024,=0A= >>>>> particlePatchPathComponent=3D...)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cp= p:49=0A= >>>>> #3 0x00000000007a33ed in picongpu::hdf5::openPMD::PatchReader::readP= atchAttribute (this=3D0x7ffffffef03f, dc=3D0x17e1fa8, availableRanks=3D1, i= d=3D1024,=0A= >>>>> particlePatchPathComponent=3D..., dest=3D0x1bf1850)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cp= p:76=0A= >>>>> #4 0x00000000007a3665 in picongpu::hdf5::openPMD::PatchReader::opera= tor() (=0A= >>>>> this=3D0x7ffffffef03f, dc=3D0x17e1fa8, availableRanks=3D1, dimens= ionality=3D3,=0A= >>>>> id=3D1024, particlePatchPath=3D...)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cp= p:102=0A= >>>>> #5 0x0000000000857c7b in picongpu::hdf5::LoadSpecies, PMacc::math::CT::Vector, mpl_::integral_c, mpl_::integral_c >, boost::mpl::v_item, boost::mpl::vector0, 0>, 0>, 0>, boost:= :mpl::vector, p= icongpu::placeholder_definition27::shape, picongpu::placeholder_defi= nition32::interpolation, PMacc::pla= ceholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition33::= current, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placehold= er_definition36::massRatio, picongpu::placehol= der_definition37::chargeRatio, mpl_::na, mpl= _::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na= , mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, PMacc::HandleGuardRegi= on, boost::mpl::vector0, boost::mpl::vector0 > > >::operator() (this=3D0x7fffffff0bca,=0A= >>>>> params=3D0x10d6d98, restartChunkSize=3D1000000)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugin= s/hdf5/restart/LoadSpecies.hpp:126=0A= >>>>> #6 0x00000000007f8d41 in operator() (t1=3D@0x10d6ee0: 1000000, t0=3D@0x7fffffff0530: 0x10d6d98, th= is=3D)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/in= clude/algorithms/ForEach.hpp:158=0A= >>>>> #7 operator() (=0A= >>>>> t1=3D@0x10d6ee0: 1000000, t0=3D, this=3D)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/in= clude/algorithms/ForEach.hpp:239=0A= >>>>> #8 picongpu::hdf5::HDF5Writer::restart (this=3D0x10d6d80, restartSte= p=3D1024,=0A= >>>>> restartDirectory=3D...)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugin= s/hdf5/HDF5Writer.hpp:226=0A= >>>>> #9 0x00000000007dedb8 in PMacc::PluginConnector::restartPlugins (=0A= >>>>> this=3D0x10ac700 ,=0A= >>>>> restartStep=3D1024, restartDirectory=3D...)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/in= clude/pluginSystem/PluginConnector.hpp:181=0A= >>>>> #10 0x00000000007e503d in picongpu::InitialiserController::restart (= =0A= >>>>> this=3D0x10d6010, restartStep=3D1024, restartDirectory=3D...)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/initia= lization/InitialiserController.hpp:84=0A= >>>>> #11 0x000000000080ab73 in picongpu::MySimulation::fillSimulation (=0A= >>>>> this=3D0x10d5e20)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simula= tionControl/MySimulation.hpp:413=0A= >>>>> #12 0x000000000084d5f7 in PMacc::SimulationHelper<3u>::startSimulatio= n (=0A= >>>>> this=3D0x10d5e20)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/in= clude/simulationControl/SimulationHelper.hpp:223=0A= >>>>> #13 0x00000000008315b8 in picongpu::SimulationStarter::star= t (=0A= >>>>> this=3D0x7fffffff1930)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simula= tionControl/SimulationStarter.hpp:83=0A= >>>>> #14 0x00000000007cc249 in main (argc=3D38, argv=3D0x7fffffff1ad8)=0A= >>>>> at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/main.cu:10=0A= >>>>>=0A= >>>>>=0A= >>>>>=0A= >>>>> ________________________________________=0A= >>>>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Ax= el Huebl [a.huebl@hzdr.de]=0A= >>>>> Sent: Tuesday, February 28, 2017 9:05 PM=0A= >>>>> To: picongpu-users@hzdr.de=0A= >>>>> Subject: Re: [PIConGPU-Users] Restart failure=0A= >>>>>=0A= >>>>> Hi Danila,=0A= >>>>>=0A= >>>>> sorry your mail got lost.=0A= >>>>>=0A= >>>>> -G is currently not possible due to a problem in cuSTL (a library ins= ide=0A= >>>>> PMacc that is used at some points). You won't need it anyway, `-g` is= =0A= >>>>> enough.=0A= >>>>>=0A= >>>>> Can you try debugging it like that?=0A= >>>>>=0A= >>>>>=0A= >>>>> Best,=0A= >>>>> Axel=0A= >>>>>=0A= >>>>> On 16.02.2017 09:58, Khikhlukha Danila wrote:=0A= >>>>>> Hi Axel,=0A= >>>>>> let me please answer step by step. Adding --restart-step 1024 didn't= help. Latest commit I have in my local branch is aabd50e.=0A= >>>>>>=0A= >>>>>> I decided to debug the issue with HDF5. Following your instruction I= tried to compile with debug flags on, however my gdb compilation fails. So= what I did:=0A= >>>>>> 1. configure make files as usual with the command issued in a $BUILD= _DIR=0A= >>>>>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=3DON -DPIC_VERBOSE_LVL= =3D29 -DPMACC_VERBOSE_LVL=3D7" $INCLUDE_DIR=0A= >>>>>>=0A= >>>>>> 2. then in a $BUILD_DIR do `ccmake . ` and setup flags:=0A= >>>>>> CUDA_SHOW_CODELINES =3D ON=0A= >>>>>> PMACC_BLOCKING_KERNEL =3D ON=0A= >>>>>> CUDA_NVCC_FLAGS_DEBUG =3D -g;-G (this flag I found in advance mode. = Maybe it worths mentioning on the wiki page )=0A= >>>>>>=0A= >>>>>> 3. Save and generated new Makefile, launch `make` and get an error:= =0A= >>>>>> ptxas error : Entry function '<...really long function name...>' u= ses too much shared data (0xc00a bytes, 0xc000 max).=0A= >>>>>> which puzzles me a bit. It is clear that the debug binary will need = more static memory to store the debug info, but is it that large? Or I did = something wrong?=0A= >>>>>>=0A= >>>>>> 4. If skip step#2 `make` has no problems to compile in a normal way.= =0A= >>>>>>=0A= >>>>>> Best,=0A= >>>>>> Danila.=0A= >>>>>> ________________________________________=0A= >>>>>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of A= xel Huebl [a.huebl@hzdr.de]=0A= >>>>>> Sent: Tuesday, February 14, 2017 11:02 PM=0A= >>>>>> To: picongpu-users@hzdr.de=0A= >>>>>> Subject: Re: [PIConGPU-Users] Restart failure=0A= >>>>>>=0A= >>>>>> Hi Danila,=0A= >>>>>>=0A= >>>>>> so far that looks normal, can you try specifying your restart step= =0A= >>>>>> explicitly with e.g.=0A= >>>>>> --restart-step 1024=0A= >>>>>> ?=0A= >>>>>>=0A= >>>>>>=0A= >>>>>> Which exact version of PIConGPU did you use?=0A= >>>>>>=0A= >>>>>> We are currently not aware of known problems with restarts. If we ca= n=0A= >>>>>> narrow down the source of your segfault this would be wonderful=0A= >>>>>> (although I suspect it might be in a third party lib), so in case th= at=0A= >>>>>> you are able to do a small 1 node example and can hang in gdb there= =0A= >>>>>> https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Deb= ugging=0A= >>>>>>=0A= >>>>>> we might get a better understanding.=0A= >>>>>>=0A= >>>>>> That said, you can also add ADIOS to your compile chain which will= =0A= >>>>>> automatically use ADIOS .bp files for checkpointing & restarting and= you=0A= >>>>>> can still use HDF5 for regular output (or use ADIOS for both) in cas= e=0A= >>>>>> HDF5 behaves too nasty and you need to move fast.=0A= >>>>>>=0A= >>>>>>=0A= >>>>>> Axel=0A= >>>>>>=0A= >>>>>> On 13.02.2017 15:47, Khikhlukha Danila wrote:=0A= >>>>>>> Hi Ren=E9,=0A= >>>>>>> sure, pls. see the attachment. Please let me know if more informati= on is needed.=0A= >>>>>>>=0A= >>>>>>> D.=0A= >>>>>>> ________________________________________=0A= >>>>>>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of = Ren=E9 Widera [r.widera@hzdr.de]=0A= >>>>>>> Sent: Monday, February 13, 2017 3:39 PM=0A= >>>>>>> To: picongpu-users@hzdr.de=0A= >>>>>>> Subject: Re: [PIConGPU-Users] [PIConGPU-Users] Restart failure=0A= >>>>>>>=0A= >>>>>>> Dear Danila,=0A= >>>>>>>=0A= >>>>>>> could you please send use the `stdout`, `stderr` and the files from= the=0A= >>>>>>> `tbg` folder?=0A= >>>>>>>=0A= >>>>>>> best,=0A= >>>>>>>=0A= >>>>>>> Ren=E9=0A= >>>>>>>=0A= >>>>>>> On 02/13/2017 03:11 PM, Khikhlukha Danila wrote:=0A= >>>>>>>> Dear all,=0A= >>>>>>>> currently I was trying to setup PoG in the Jureca machine. It all = worked=0A= >>>>>>>> fine for the LWFA example, however when I tried to restart the=0A= >>>>>>>> simulation I received a segfault almost immediately.=0A= >>>>>>>> My tool chain is as follows=0A= >>>>>>>>=0A= >>>>>>>> GCC/5.4.0=0A= >>>>>>>> CUDA/8.0.44=0A= >>>>>>>> MVAPICH2/2.2-GDR=0A= >>>>>>>> HDF5/1.8.17=0A= >>>>>>>> Boost/1.61.0=0A= >>>>>>>>=0A= >>>>>>>> So, the first run didn't have any problems -- pictures, save point= s and=0A= >>>>>>>> data dumps were created. When I tried to launch the restart it cra= shes=0A= >>>>>>>> although I explicitly specify the savepoint directory.=0A= >>>>>>>>=0A= >>>>>>>> test$ diff -r 0002/submit/ 0002_restart/submit/=0A= >>>>>>>> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg= =0A= >>>>>>>> 39c39=0A= >>>>>>>> < TBG_steps=3D"-s 1024"=0A= >>>>>>>> ---=0A= >>>>>>>>> TBG_steps=3D"-s 2048"=0A= >>>>>>>> 41a42=0A= >>>>>>>>> TBG_restart=3D"--restart --restart-directory=0A= >>>>>>>> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints"=0A= >>>>>>>> 67a69=0A= >>>>>>>>> !TBG_restart \=0A= >>>>>>>>=0A= >>>>>>>> I also checked that it exists and accessible. I tried to switch on= some=0A= >>>>>>>> debug information, with the following command:=0A= >>>>>>>>=0A= >>>>>>>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=3DON -DPIC_VERBOSE_L= VL=3D29=0A= >>>>>>>> -DPMACC_VERBOSE_LVL=3D7"=0A= >>>>>>>>=0A= >>>>>>>> however I didn't find any information except a standard message:= =0A= >>>>>>>> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation = fault=0A= >>>>>>>> (signal 11)=0A= >>>>>>>>=0A= >>>>>>>> Could you please advice me if there are another way how to diagnos= e the=0A= >>>>>>>> problem (except launching a gdb). may be I'm doing something wrong= ?=0A= >>>>>>>> However restart used to work on other machines...=0A= >>>>>>>>=0A= >>>>>>>>=0A= >>>>>>>> Thank you in advance,=0A= >>>>>>>> Danila.=0A= >>>>>>>>=0A= >>>>>>>=0A= >>>>>>> --=0A= >>>>>>> Ren=E9 Widera=0A= >>>>>>> Abteilung Laser-Teilchenbeschleunigung (FWKT)=0A= >>>>>>> Helmholtz-Zentrum Dresden-Rossendorf=0A= >>>>>>> Tel: +49 (0351) 260 3543=0A= >>>>>>> r.widera@hzdr.de=0A= >>>>>>> http://www.hzdr.de=0A= >>>>>>>=0A= >>>>>>> Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey,=0A= >>>>>>> Prof. Dr. Dr. h. c. Peter Joehnk=0A= >>>>>>> Vereinsregister: VR 1693 beim Amtsgericht Dresden=0A= >>>>>>>=0A= >>>>>>> #############################################################=0A= >>>>>>> This message is sent to you because you are subscribed to=0A= >>>>>>> the mailing list .=0A= >>>>>>> To unsubscribe, E-mail to: =0A= >>>>>>> To switch to the DIGEST mode, E-mail to =0A= >>>>>>> To switch to the INDEX mode, E-mail to =0A= >>>>>>> Send administrative queries to =0A= >>>>>>>=0A= >>>>>>>=0A= >>>>>>>=0A= >>>>>>> #############################################################=0A= >>>>>>> This message is sent to you because you are subscribed to=0A= >>>>>>> the mailing list .=0A= >>>>>>> To unsubscribe, E-mail to: =0A= >>>>>>> To switch to the DIGEST mode, E-mail to =0A= >>>>>>> To switch to the INDEX mode, E-mail to =0A= >>>>>>> Send administrative queries to =0A= >>>>>>>=0A= >>>>>>=0A= >>>>>> --=0A= >>>>>>=0A= >>>>>> Axel Huebl=0A= >>>>>> Phone +49 351 260 3582=0A= >>>>>> https://www.hzdr.de/crp=0A= >>>>>> Computational Radiation Physics=0A= >>>>>> Laser Particle Acceleration Division=0A= >>>>>> Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= >>>>>>=0A= >>>>>> Bautzner Landstrasse 400, 01328 Dresden=0A= >>>>>> POB 510119, D-01314 Dresden=0A= >>>>>> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= >>>>>> Prof. Dr.Dr.h.c. P. Joehnk=0A= >>>>>> VR 1693 beim Amtsgericht Dresden=0A= >>>>>>=0A= >>>>>> #############################################################=0A= >>>>>> This message is sent to you because you are subscribed to=0A= >>>>>> the mailing list .=0A= >>>>>> To unsubscribe, E-mail to: =0A= >>>>>> To switch to the DIGEST mode, E-mail to =0A= >>>>>> To switch to the INDEX mode, E-mail to =0A= >>>>>> Send administrative queries to =0A= >>>>>>=0A= >>>>>>=0A= >>>>>> #############################################################=0A= >>>>>> This message is sent to you because you are subscribed to=0A= >>>>>> the mailing list .=0A= >>>>>> To unsubscribe, E-mail to: =0A= >>>>>> To switch to the DIGEST mode, E-mail to =0A= >>>>>> To switch to the INDEX mode, E-mail to =0A= >>>>>> Send administrative queries to =0A= >>>>>>=0A= >>>>>=0A= >>>>> --=0A= >>>>>=0A= >>>>> Axel Huebl=0A= >>>>> Phone +49 351 260 3582=0A= >>>>> https://www.hzdr.de/crp=0A= >>>>> Computational Radiation Physics=0A= >>>>> Laser Particle Acceleration Division=0A= >>>>> Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= >>>>>=0A= >>>>> Bautzner Landstrasse 400, 01328 Dresden=0A= >>>>> POB 510119, D-01314 Dresden=0A= >>>>> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= >>>>> Prof. Dr.Dr.h.c. P. Joehnk=0A= >>>>> VR 1693 beim Amtsgericht Dresden=0A= >>>>>=0A= >>>>> #############################################################=0A= >>>>> This message is sent to you because you are subscribed to=0A= >>>>> the mailing list .=0A= >>>>> To unsubscribe, E-mail to: =0A= >>>>> To switch to the DIGEST mode, E-mail to =0A= >>>>> To switch to the INDEX mode, E-mail to = =0A= >>>>> Send administrative queries to =0A= >>>>>=0A= >>>>>=0A= >>>>> #############################################################=0A= >>>>> This message is sent to you because you are subscribed to=0A= >>>>> the mailing list .=0A= >>>>> To unsubscribe, E-mail to: =0A= >>>>> To switch to the DIGEST mode, E-mail to =0A= >>>>> To switch to the INDEX mode, E-mail to = =0A= >>>>> Send administrative queries to =0A= >>>>>=0A= >>>>=0A= >>>> --=0A= >>>>=0A= >>>> Axel Huebl=0A= >>>> Phone +49 351 260 3582=0A= >>>> https://www.hzdr.de/crp=0A= >>>> Computational Radiation Physics=0A= >>>> Laser Particle Acceleration Division=0A= >>>> Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= >>>>=0A= >>>> Bautzner Landstrasse 400, 01328 Dresden=0A= >>>> POB 510119, D-01314 Dresden=0A= >>>> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= >>>> Prof. Dr.Dr.h.c. P. Joehnk=0A= >>>> VR 1693 beim Amtsgericht Dresden=0A= >>>>=0A= >>>> #############################################################=0A= >>>> This message is sent to you because you are subscribed to=0A= >>>> the mailing list .=0A= >>>> To unsubscribe, E-mail to: =0A= >>>> To switch to the DIGEST mode, E-mail to =0A= >>>> To switch to the INDEX mode, E-mail to = =0A= >>>> Send administrative queries to =0A= >>>>=0A= >>>>=0A= >>>>=0A= >>>> #############################################################=0A= >>>> This message is sent to you because you are subscribed to=0A= >>>> the mailing list .=0A= >>>> To unsubscribe, E-mail to: =0A= >>>> To switch to the DIGEST mode, E-mail to =0A= >>>> To switch to the INDEX mode, E-mail to = =0A= >>>> Send administrative queries to =0A= >>>>=0A= >>>=0A= >>> --=0A= >>>=0A= >>> Axel Huebl=0A= >>> Phone +49 351 260 3582=0A= >>> https://www.hzdr.de/crp=0A= >>> Computational Radiation Physics=0A= >>> Laser Particle Acceleration Division=0A= >>> Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= >>>=0A= >>> Bautzner Landstrasse 400, 01328 Dresden=0A= >>> POB 510119, D-01314 Dresden=0A= >>> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= >>> Prof. Dr.Dr.h.c. P. Joehnk=0A= >>> VR 1693 beim Amtsgericht Dresden=0A= >>>=0A= >>> #############################################################=0A= >>> This message is sent to you because you are subscribed to=0A= >>> the mailing list .=0A= >>> To unsubscribe, E-mail to: =0A= >>> To switch to the DIGEST mode, E-mail to = =0A= >>> To switch to the INDEX mode, E-mail to = =0A= >>> Send administrative queries to =0A= >>>=0A= >>>=0A= >>>=0A= >>> #############################################################=0A= >>> This message is sent to you because you are subscribed to=0A= >>> the mailing list .=0A= >>> To unsubscribe, E-mail to: =0A= >>> To switch to the DIGEST mode, E-mail to = =0A= >>> To switch to the INDEX mode, E-mail to = =0A= >>> Send administrative queries to =0A= >>>=0A= >>=0A= >> --=0A= >>=0A= >> Axel Huebl=0A= >> Phone +49 351 260 3582=0A= >> https://www.hzdr.de/crp=0A= >> Computational Radiation Physics=0A= >> Laser Particle Acceleration Division=0A= >> Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= >>=0A= >> Bautzner Landstrasse 400, 01328 Dresden=0A= >> POB 510119, D-01314 Dresden=0A= >> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= >> Prof. Dr.Dr.h.c. P. Joehnk=0A= >> VR 1693 beim Amtsgericht Dresden=0A= >>=0A= >> #############################################################=0A= >> This message is sent to you because you are subscribed to=0A= >> the mailing list .=0A= >> To unsubscribe, E-mail to: =0A= >> To switch to the DIGEST mode, E-mail to = =0A= >> To switch to the INDEX mode, E-mail to =0A= >> Send administrative queries to =0A= >>=0A= >>=0A= >> #############################################################=0A= >> This message is sent to you because you are subscribed to=0A= >> the mailing list .=0A= >> To unsubscribe, E-mail to: =0A= >> To switch to the DIGEST mode, E-mail to = =0A= >> To switch to the INDEX mode, E-mail to =0A= >> Send administrative queries to =0A= >>=0A= >=0A= =0A= --=0A= =0A= Axel Huebl=0A= Phone +49 351 260 3582=0A= https://www.hzdr.de/crp=0A= Computational Radiation Physics=0A= Laser Particle Acceleration Division=0A= Helmholtz-Zentrum Dresden - Rossendorf e.V.=0A= =0A= Bautzner Landstrasse 400, 01328 Dresden=0A= POB 510119, D-01314 Dresden=0A= Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey=0A= Prof. Dr.Dr.h.c. P. Joehnk=0A= VR 1693 beim Amtsgericht Dresden=0A= =0A= #############################################################=0A= This message is sent to you because you are subscribed to=0A= the mailing list .=0A= To unsubscribe, E-mail to: =0A= To switch to the DIGEST mode, E-mail to =0A= To switch to the INDEX mode, E-mail to =0A= Send administrative queries to =0A= =0A= --_003_BA7C853FEE430847B9C35FFCC6E5B2A555B64328braunelibeamseu_ Content-Type: application/octet-stream; name="diff_log_1" Content-Description: diff_log_1 Content-Disposition: attachment; filename="diff_log_1"; size=2373; creation-date="Mon, 05 Jun 2017 11:30:28 GMT"; modification-date="Mon, 05 Jun 2017 11:30:28 GMT" Content-Transfer-Encoding: base64 ZGlmZiAtciBsd2ZhX3JlZi9zdWJtaXQvMDAwOGdwdXMuY2ZnIDAwMDBfMS9zdWJtaXQvMDAwOGdw dXMuY2ZnCjEsMTljMSwxOQo8ICMgQ29weXJpZ2h0IDIwMTMtMjAxNyBBeGVsIEh1ZWJsLCBSZW5l IFdpZGVyYSwgRmVsaXggU2NobWl0dAo8ICMKPCAjIFRoaXMgZmlsZSBpcyBwYXJ0IG9mIFBJQ29u R1BVLgo8ICMKPCAjIFBJQ29uR1BVIGlzIGZyZWUgc29mdHdhcmU6IHlvdSBjYW4gcmVkaXN0cmli dXRlIGl0IGFuZC9vciBtb2RpZnkKPCAjIGl0IHVuZGVyIHRoZSB0ZXJtcyBvZiB0aGUgR05VIEdl bmVyYWwgUHVibGljIExpY2Vuc2UgYXMgcHVibGlzaGVkIGJ5CjwgIyB0aGUgRnJlZSBTb2Z0d2Fy ZSBGb3VuZGF0aW9uLCBlaXRoZXIgdmVyc2lvbiAzIG9mIHRoZSBMaWNlbnNlLCBvcgo8ICMgKGF0 IHlvdXIgb3B0aW9uKSBhbnkgbGF0ZXIgdmVyc2lvbi4KPCAjCjwgIyBQSUNvbkdQVSBpcyBkaXN0 cmlidXRlZCBpbiB0aGUgaG9wZSB0aGF0IGl0IHdpbGwgYmUgdXNlZnVsLAo8ICMgYnV0IFdJVEhP VVQgQU5ZIFdBUlJBTlRZOyB3aXRob3V0IGV2ZW4gdGhlIGltcGxpZWQgd2FycmFudHkgb2YKPCAj IE1FUkNIQU5UQUJJTElUWSBvciBGSVRORVNTIEZPUiBBIFBBUlRJQ1VMQVIgUFVSUE9TRS4gIFNl ZSB0aGUKPCAjIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNlIGZvciBtb3JlIGRldGFpbHMuCjwg Iwo8ICMgWW91IHNob3VsZCBoYXZlIHJlY2VpdmVkIGEgY29weSBvZiB0aGUgR05VIEdlbmVyYWwg UHVibGljIExpY2Vuc2UKPCAjIGFsb25nIHdpdGggUElDb25HUFUuCjwgIyBJZiBub3QsIHNlZSA8 aHR0cDovL3d3dy5nbnUub3JnL2xpY2Vuc2VzLz4uCjwgIwo8IAotLS0KPiAjIENvcHlyaWdodCAy MDEzLTIwMTYgQXhlbCBIdWVibCwgUmVuZSBXaWRlcmEsIEZlbGl4IFNjaG1pdHQKPiAjIAo+ICMg VGhpcyBmaWxlIGlzIHBhcnQgb2YgUElDb25HUFUuIAo+ICMgCj4gIyBQSUNvbkdQVSBpcyBmcmVl IHNvZnR3YXJlOiB5b3UgY2FuIHJlZGlzdHJpYnV0ZSBpdCBhbmQvb3IgbW9kaWZ5IAo+ICMgaXQg dW5kZXIgdGhlIHRlcm1zIG9mIHRoZSBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSBhcyBwdWJs aXNoZWQgYnkgCj4gIyB0aGUgRnJlZSBTb2Z0d2FyZSBGb3VuZGF0aW9uLCBlaXRoZXIgdmVyc2lv biAzIG9mIHRoZSBMaWNlbnNlLCBvciAKPiAjIChhdCB5b3VyIG9wdGlvbikgYW55IGxhdGVyIHZl cnNpb24uIAo+ICMgCj4gIyBQSUNvbkdQVSBpcyBkaXN0cmlidXRlZCBpbiB0aGUgaG9wZSB0aGF0 IGl0IHdpbGwgYmUgdXNlZnVsLCAKPiAjIGJ1dCBXSVRIT1VUIEFOWSBXQVJSQU5UWTsgd2l0aG91 dCBldmVuIHRoZSBpbXBsaWVkIHdhcnJhbnR5IG9mIAo+ICMgTUVSQ0hBTlRBQklMSVRZIG9yIEZJ VE5FU1MgRk9SIEEgUEFSVElDVUxBUiBQVVJQT1NFLiAgU2VlIHRoZSAKPiAjIEdOVSBHZW5lcmFs IFB1YmxpYyBMaWNlbnNlIGZvciBtb3JlIGRldGFpbHMuIAo+ICMgCj4gIyBZb3Ugc2hvdWxkIGhh dmUgcmVjZWl2ZWQgYSBjb3B5IG9mIHRoZSBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAKPiAj IGFsb25nIHdpdGggUElDb25HUFUuICAKPiAjIElmIG5vdCwgc2VlIDxodHRwOi8vd3d3LmdudS5v cmcvbGljZW5zZXMvPi4gCj4gIyAKPiAgCjI1YzI1CjwgIyMgICAgICAgICAgICAgICAgICAgICAg ZG9jcy9UQkdfbWFjcm9zLmNmZwotLS0KPiAjIyAgICAgICAgICAgICAgICAgICAgICBkb2MvVEJH X21hY3Jvcy5jZmcKMzUsNDBjMzUsMzkKPCBUQkdfZ3B1X3g9Mgo8IFRCR19ncHVfeT0yCjwgVEJH X2dwdV96PTIKPCAKPCBUQkdfZ3JpZFNpemU9Ii1nIDE5MiA1MTIgMTkyIgo8IFRCR19zdGVwcz0i LXMgNDAwMCIKLS0tCj4gVEJHX2dwdV94PTEKPiBUQkdfZ3B1X3k9OAo+IFRCR19ncHVfej0xCj4g VEJHX2dyaWRTaXplPSItZyAyNTYgMTAyNCAyNTYiCj4gVEJHX3N0ZXBzPSItcyAxMDI0Igo0MmQ0 MAo8ICMgbGVhdmUgVEJHX21vdmluZ1dpbmRvdyBlbXB0eSB0byBkaXNhYmxlIG1vdmluZyB3aW5k b3cKNTBhNDksNTAKPiBUQkdfaGRmNT0iLS1oZGY1LnBlcmlvZCAxMjggLS1oZGY1LmZpbGUgc2lt RGF0YSIKPiBUQkdfY2hlY2twb2ludHM9Ii0tY2hlY2twb2ludHMgNTEyIgo1Myw1NGM1Myw1NQo8 ICAgICAgICAgICAgICAgIVRCR19wbmdZWiAgICAgICAgICAgICAgICAgICAgXAo8ICAgICAgICAg ICAgICAgLS1lX21hY3JvUGFydGljbGVzQ291bnQucGVyaW9kIDEwMCIKLS0tCj4gICAgICAgICAg ICAgICFUQkdfcG5nWVogICAgICAgICAgICAgICAgICAgIFwKPiAgICAgICAgICAgICAgIVRCR19o ZGY1ICAgICAgICAgICAgICAgICAgICAgXAo+ICAgICAgICAgICAgICAhVEJHX2NoZWNrcG9pbnRz IgpPbmx5IGluIDAwMDBfMS9zdWJtaXQ6IGp1cmVjYS1memoK --_003_BA7C853FEE430847B9C35FFCC6E5B2A555B64328braunelibeamseu_ Content-Type: application/octet-stream; name="diff_log_2" Content-Description: diff_log_2 Content-Disposition: attachment; filename="diff_log_2"; size=302; creation-date="Mon, 05 Jun 2017 11:30:28 GMT"; modification-date="Mon, 05 Jun 2017 11:30:28 GMT" Content-Transfer-Encoding: base64 ZGlmZiAtLWV4Y2x1ZGUgJ2IqJyAtciAwMDAwXzEvc3VibWl0LzAwMDhncHVzLmNmZyAwMDAwXzIv c3VibWl0LzAwMDhncHVzLmNmZwozOWMzOQo8IFRCR19zdGVwcz0iLXMgMTAyNCIKLS0tCj4gVEJH X3N0ZXBzPSItcyAyMDQ4Igo0MWE0Mgo+IFRCR19yZXN0YXJ0PSItLXJlc3RhcnQgLS1yZXN0YXJ0 LWRpcmVjdG9yeSAvd29yay9oaGgyMC9oaGgyMHovcnVuXzAwMDBfMS9zaW1PdXRwdXQvY2hlY2tw b2ludHMgLS1yZXN0YXJ0LXN0ZXAgMTAyNCIKNjdhNjkKPiAgICAgICAgICAgICAgICAgICAgIVRC R19yZXN0YXJ0ICAgICAgXAo= --_003_BA7C853FEE430847B9C35FFCC6E5B2A555B64328braunelibeamseu_--