Mailing List picongpu-users@hzdr.de Message #232
From: Axel Huebl <a.huebl@hzdr.de>
Subject: Re: Restart failure
Date: Mon, 27 Mar 2017 20:39:33 +0200
To: <picongpu-users@hzdr.de>
Very weird, the error resides from this [1] libSplash read call.

Are you sure you are running with the same libSplash and HDF5 as you
compiled against (ldd bin/picongpu). Is your environment
(LD_LIBRARY_PATH) in good shape? Did the compiler change?

It might be the system updated your HDF5 module, MPI or something on the
line and you might want to rebuild libSplash? Are you restarting with
the same device configuration (-d)?

Please redo the same as you did but enable more libSplash verbosity via:
  export SPLASH_VERBOSE=3
(needs no recompile).

You can even go one step further and compile libSplash with
  -DDEBUG_VERBOSE=ON
which also adds (some) HDF5 RT checks.

Be aware we currently do not support re-partitioning a job during
restart (e.g. start with -d 2 2 2, resume with -d 2 2 2).

Best,
Axel

[1]
https://github.com/ComputationalRadiationPhysics/picongpu/blob/0.2.4/src/picongpu/patchReader.cpp#L44-L49

On 27.03.2017 15:27, Khikhlukha Danila wrote:
> Hi Axel,
> so I tried to launch a restart on a single device under dgb. The Segmentation fault is coming from a check point reader I guess (see log below)... If it is need I can provide more detailed log...Could you please advise me what can I check next?
>
> srun gdb -ex r -ex tb --args bin/picongpu -d 1 1 1 -g 128 1024 128 -s 2048 --restart --restart-directory /work/hhh20/hhh20z/run_0002_s/simOutput/checkpoints --restart-step 1024 --e_png.period 32 --e_png.axis yx --e_png.slicePoint 0.5 --e_png.folder pngElectronsYX --e_png.period 32 --e_png.axis yz --e_png.slicePoint 0.5 --e_png.folder pngElectronsYZ  --hdf5.period 128 --hdf5.file simData --checkpoints 512
> NU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> ......
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000000a545b6 in splash::ParallelDataCollector::readDataSetMeta(int, int, char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions, splash::Dimensions&, unsigned int&) ()
> Temporary breakpoint 1 at 0xa545b6
> Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 libibverbs-1.2.1mlnx1-3.OFED.4.0.0.1.3.40101.el7.centos.x86_64 libicu-50.1.2-15.el7.x86_64 libnl-1.1.4-3.el7.x86_64 libpng-1.5.13-7.el7_2.x86_64
>
> bt
> (gdb) #0  0x0000000000a545b6 in splash::ParallelDataCollector::readDataSetMeta(int, int, char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions, splash::Dimensions&, unsigned int&) ()
> #1  0x0000000000a54d1a in splash::ParallelDataCollector::readMeta(int, char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions&) ()
> #2  0x00000000007a3254 in picongpu::hdf5::openPMD::PatchReader::checkSpatialTypeSize (this=0x7ffffffef03f, dc=0x17e1fa8, availableRanks=1, id=1024,
>     particlePatchPathComponent=...)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:49
> #3  0x00000000007a33ed in picongpu::hdf5::openPMD::PatchReader::readPatchAttribute (this=0x7ffffffef03f, dc=0x17e1fa8, availableRanks=1, id=1024,
>     particlePatchPathComponent=..., dest=0x1bf1850)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:76
> #4  0x00000000007a3665 in picongpu::hdf5::openPMD::PatchReader::operator() (
>     this=0x7ffffffef03f, dc=0x17e1fa8, availableRanks=1, dimensionality=3,
>     id=1024, particlePatchPath=...)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:102
> #5  0x0000000000857c7b in picongpu::hdf5::LoadSpecies<picongpu::Particles<PMacc::ParticleDescription<boost::mpl::string<101, 0, 0, 0, 0, 0, 0, 0>, PMacc::math::CT::Vector<mpl_::integral_c<int, 8>, mpl_::integral_c<int, 8>, mpl_::integral_c<int, 4> >, boost::mpl::v_item<picongpu::placeholder_definition23::weighting, boost::mpl::v_item<picongpu::placeholder_definition21::momentum, boost::mpl::v_item<picongpu::placeholder_definition18::position<picongpu::placeholder_definition20::position_pic, PMacc::placeholder_definition15::pmacc_isAlias>, boost::mpl::vector0<mpl_::na>, 0>, 0>, 0>, boost::mpl::vector<picongpu::placeholder_definition28::particlePusher<picongpu::particles::pusher::Boris, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition27::shape<picongpu::particles::shapes::TSC, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition32::interpolation<picongpu::FieldToParticleInterpolation<picongpu::particles::shapes::TSC, picongpu::AssignedTrilinearInterpolation>, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition33::current<picongpu::currentSolver::Esirkepov<picongpu::particles::shapes::TSC, 3u>, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition36::massRatio<picongpu::placeholder_definition55::MassRatioElectrons, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition37::chargeRatio<picongpu::placeholder_definition56::ChargeRatioElectrons, PMacc::placeholder_definition15::pmacc_isAlias>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, PMacc::HandleGuardRegion<PMacc::particles::policies::ExchangeParticles, PMacc::particles::policies::DeleteParticles>, boost::mpl::vector0<mpl_::na>, boost::mpl::vector0<mpl_::na> > > >::operator() (this=0x7fffffff0bca,
>     params=0x10d6d98, restartChunkSize=1000000)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugins/hdf5/restart/LoadSpecies.hpp:126
> #6  0x00000000007f8d41 in operator()<picongpu::hdf5::ThreadParams*, unsigned int> (t1=@0x10d6ee0: 1000000, t0=@0x7fffffff0530: 0x10d6d98, this=<optimized out>)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/algorithms/ForEach.hpp:158
> #7  operator()<picongpu::hdf5::ThreadParams*, unsigned int> (
>     t1=@0x10d6ee0: 1000000, t0=<optimized out>, this=<optimized out>)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/algorithms/ForEach.hpp:239
> #8  picongpu::hdf5::HDF5Writer::restart (this=0x10d6d80, restartStep=1024,
>     restartDirectory=...)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp:226
> #9  0x00000000007dedb8 in PMacc::PluginConnector::restartPlugins (
>     this=0x10ac700 <PMacc::PluginConnector::getInstance()::instance>,
>     restartStep=1024, restartDirectory=...)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/pluginSystem/PluginConnector.hpp:181
> #10 0x00000000007e503d in picongpu::InitialiserController::restart (
>     this=0x10d6010, restartStep=1024, restartDirectory=...)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/initialization/InitialiserController.hpp:84
> #11 0x000000000080ab73 in picongpu::MySimulation::fillSimulation (
>     this=0x10d5e20)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simulationControl/MySimulation.hpp:413
> #12 0x000000000084d5f7 in PMacc::SimulationHelper<3u>::startSimulation (
>     this=0x10d5e20)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/simulationControl/SimulationHelper.hpp:223
> #13 0x00000000008315b8 in picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::start (
>     this=0x7fffffff1930)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simulationControl/SimulationStarter.hpp:83
> #14 0x00000000007cc249 in main (argc=38, argv=0x7fffffff1ad8)
>     at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/main.cu:10
>
>
>
> ________________________________________
> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Huebl [a.huebl@hzdr.de]
> Sent: Tuesday, February 28, 2017 9:05 PM
> To: picongpu-users@hzdr.de
> Subject: Re:  [PIConGPU-Users] Restart failure
>
> Hi Danila,
>
> sorry your mail got lost.
>
> -G is currently not possible due to a problem in cuSTL (a library inside
> PMacc that is used at some points). You won't need it anyway, `-g` is
> enough.
>
> Can you try debugging it like that?
>
>
> Best,
> Axel
>
> On 16.02.2017 09:58, Khikhlukha Danila wrote:
>> Hi Axel,
>> let me please answer step by step. Adding --restart-step 1024 didn't help. Latest commit I have in my local branch is aabd50e.
>>
>> I decided to debug the issue with HDF5. Following your instruction I tried to compile with debug flags on, however my gdb compilation fails. So what I did:
>> 1. configure make files as usual with the command issued in a $BUILD_DIR
>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29 -DPMACC_VERBOSE_LVL=7" $INCLUDE_DIR
>>
>> 2. then in a $BUILD_DIR  do  `ccmake . ` and setup flags:
>> CUDA_SHOW_CODELINES = ON
>> PMACC_BLOCKING_KERNEL = ON
>> CUDA_NVCC_FLAGS_DEBUG = -g;-G (this flag I found in advance mode. Maybe it worths mentioning on the wiki page )
>>
>> 3. Save and generated new Makefile, launch `make` and get an error:
>> ptxas error   : Entry function '<...really long function name...>' uses too much shared data (0xc00a bytes, 0xc000 max).
>> which puzzles me a bit. It is clear that the debug binary will need more static memory to store the debug info, but is it that large? Or I did something wrong?
>>
>> 4. If skip step#2 `make` has no problems to compile in a normal way.
>>
>> Best,
>> Danila.
>> ________________________________________
>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Huebl [a.huebl@hzdr.de]
>> Sent: Tuesday, February 14, 2017 11:02 PM
>> To: picongpu-users@hzdr.de
>> Subject: Re:  [PIConGPU-Users] Restart failure
>>
>> Hi Danila,
>>
>> so far that looks normal, can you try specifying your restart step
>> explicitly with e.g.
>>   --restart-step 1024
>> ?
>>
>>
>> Which exact version of PIConGPU did you use?
>>
>> We are currently not aware of known problems with restarts. If we can
>> narrow down the source of your segfault this would be wonderful
>> (although I suspect it might be in a third party lib), so in case that
>> you are able to do a small 1 node example and can hang in gdb there
>>   https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging
>>
>> we might get a better understanding.
>>
>> That said, you can also add ADIOS to your compile chain which will
>> automatically use ADIOS .bp files for checkpointing & restarting and you
>> can still use HDF5 for regular output (or use ADIOS for both) in case
>> HDF5 behaves too nasty and you need to move fast.
>>
>>
>> Axel
>>
>> On 13.02.2017 15:47, Khikhlukha Danila wrote:
>>> Hi René,
>>> sure, pls. see the attachment. Please let me know if more information is needed.
>>>
>>> D.
>>> ________________________________________
>>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of René Widera [r.widera@hzdr.de]
>>> Sent: Monday, February 13, 2017 3:39 PM
>>> To: picongpu-users@hzdr.de
>>> Subject: Re:  [PIConGPU-Users] [PIConGPU-Users] Restart failure
>>>
>>> Dear Danila,
>>>
>>> could you please send use the `stdout`, `stderr` and the files from the
>>> `tbg` folder?
>>>
>>> best,
>>>
>>> René
>>>
>>> On 02/13/2017 03:11 PM, Khikhlukha Danila wrote:
>>>> Dear all,
>>>> currently I was trying to setup PoG in the Jureca machine. It all worked
>>>> fine for the LWFA example, however when I tried to restart the
>>>> simulation I received a segfault almost immediately.
>>>> My tool chain is as follows
>>>>
>>>> GCC/5.4.0
>>>> CUDA/8.0.44
>>>> MVAPICH2/2.2-GDR
>>>> HDF5/1.8.17
>>>> Boost/1.61.0
>>>>
>>>> So, the first run didn't have any problems -- pictures, save points and
>>>> data dumps were created. When I tried to launch the restart it crashes
>>>> although I explicitly specify the savepoint directory.
>>>>
>>>> test$ diff -r 0002/submit/ 0002_restart/submit/
>>>> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg
>>>> 39c39
>>>> < TBG_steps="-s 1024"
>>>> ---
>>>>> TBG_steps="-s 2048"
>>>> 41a42
>>>>> TBG_restart="--restart --restart-directory
>>>> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints"
>>>> 67a69
>>>>>                    !TBG_restart      \
>>>>
>>>> I also checked that it exists and accessible. I tried to switch on some
>>>> debug information, with the following command:
>>>>
>>>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29
>>>> -DPMACC_VERBOSE_LVL=7"
>>>>
>>>> however I didn't find any information except a standard message:
>>>> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation fault
>>>> (signal 11)
>>>>
>>>> Could you please advice me if there are another way how to diagnose the
>>>> problem (except launching a gdb). may be I'm doing something wrong?
>>>> However restart used to work on other machines...
>>>>
>>>>
>>>> Thank you in advance,
>>>> Danila.
>>>>
>>>
>>> --
>>> René Widera
>>> Abteilung Laser-Teilchenbeschleunigung (FWKT)
>>> Helmholtz-Zentrum Dresden-Rossendorf
>>> Tel: +49 (0351) 260 3543
>>> r.widera@hzdr.de
>>> http://www.hzdr.de
>>>
>>> Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey,
>>>            Prof. Dr. Dr. h. c. Peter Joehnk
>>> Vereinsregister: VR 1693 beim Amtsgericht Dresden
>>>
>>> #############################################################
>>> This message is sent to you because you are subscribed to
>>>   the mailing list <picongpu-users@hzdr.de>.
>>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>>
>>>
>>>
>>> #############################################################
>>> This message is sent to you because you are subscribed to
>>>   the mailing list <picongpu-users@hzdr.de>.
>>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>>
>>
>> --
>>
>> Axel Huebl
>> Phone +49 351 260 3582
>> https://www.hzdr.de/crp
>> Computational Radiation Physics
>> Laser Particle Acceleration Division
>> Helmholtz-Zentrum Dresden - Rossendorf e.V.
>>
>> Bautzner Landstrasse 400, 01328 Dresden
>> POB 510119, D-01314 Dresden
>> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
>>           Prof. Dr.Dr.h.c. P. Joehnk
>> VR 1693 beim Amtsgericht Dresden
>>
>> #############################################################
>> This message is sent to you because you are subscribed to
>>   the mailing list <picongpu-users@hzdr.de>.
>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>
>>
>> #############################################################
>> This message is sent to you because you are subscribed to
>>   the mailing list <picongpu-users@hzdr.de>.
>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>
>
> --
>
> Axel Huebl
> Phone +49 351 260 3582
> https://www.hzdr.de/crp
> Computational Radiation Physics
> Laser Particle Acceleration Division
> Helmholtz-Zentrum Dresden - Rossendorf e.V.
>
> Bautzner Landstrasse 400, 01328 Dresden
> POB 510119, D-01314 Dresden
> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
>           Prof. Dr.Dr.h.c. P. Joehnk
> VR 1693 beim Amtsgericht Dresden
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>

--

Axel Huebl
Phone +49 351 260 3582
https://www.hzdr.de/crp
Computational Radiation Physics
Laser Particle Acceleration Division
Helmholtz-Zentrum Dresden - Rossendorf e.V.

Bautzner Landstrasse 400, 01328 Dresden
POB 510119, D-01314 Dresden
Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
          Prof. Dr.Dr.h.c. P. Joehnk
VR 1693 beim Amtsgericht Dresden
Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster