Mailing List picongpu-users@hzdr.de Message #231
From: Khikhlukha Danila <Danila.Khikhlukha@eli-beams.eu>
Subject: RE: [PIConGPU-Users] Restart failure
Date: Mon, 27 Mar 2017 13:27:31 +0000
To: picongpu-users@hzdr.de <picongpu-users@hzdr.de>
Hi Axel,
so I tried to launch a restart on a single device under dgb. The Segmentation fault is coming from a check point reader I guess (see log below)... If it is need I can provide more detailed log...Could you please advise me what can I check next?

srun gdb -ex r -ex tb --args bin/picongpu -d 1 1 1 -g 128 1024 128 -s 2048 --restart --restart-directory /work/hhh20/hhh20z/run_0002_s/simOutput/checkpoints --restart-step 1024 --e_png.period 32 --e_png.axis yx --e_png.slicePoint 0.5 --e_png.folder pngElectronsYX --e_png.period 32 --e_png.axis yz --e_png.slicePoint 0.5 --e_png.folder pngElectronsYZ  --hdf5.period 128 --hdf5.file simData --checkpoints 512
NU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
......
Program received signal SIGSEGV, Segmentation fault.
0x0000000000a545b6 in splash::ParallelDataCollector::readDataSetMeta(int, int, char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions, splash::Dimensions&, unsigned int&) ()
Temporary breakpoint 1 at 0xa545b6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 libibverbs-1.2.1mlnx1-3.OFED.4.0.0.1.3.40101.el7.centos.x86_64 libicu-50.1.2-15.el7.x86_64 libnl-1.1.4-3.el7.x86_64 libpng-1.5.13-7.el7_2.x86_64

bt
(gdb) #0  0x0000000000a545b6 in splash::ParallelDataCollector::readDataSetMeta(int, int, char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions, splash::Dimensions&, unsigned int&) ()
#1  0x0000000000a54d1a in splash::ParallelDataCollector::readMeta(int, char const*, splash::Dimensions, splash::Dimensions, splash::Dimensions&) ()
#2  0x00000000007a3254 in picongpu::hdf5::openPMD::PatchReader::checkSpatialTypeSize (this=0x7ffffffef03f, dc=0x17e1fa8, availableRanks=1, id=1024,
    particlePatchPathComponent=...)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:49
#3  0x00000000007a33ed in picongpu::hdf5::openPMD::PatchReader::readPatchAttribute (this=0x7ffffffef03f, dc=0x17e1fa8, availableRanks=1, id=1024,
    particlePatchPathComponent=..., dest=0x1bf1850)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:76
#4  0x00000000007a3665 in picongpu::hdf5::openPMD::PatchReader::operator() (
    this=0x7ffffffef03f, dc=0x17e1fa8, availableRanks=1, dimensionality=3,
    id=1024, particlePatchPath=...)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/patchReader.cpp:102
#5  0x0000000000857c7b in picongpu::hdf5::LoadSpecies<picongpu::Particles<PMacc::ParticleDescription<boost::mpl::string<101, 0, 0, 0, 0, 0, 0, 0>, PMacc::math::CT::Vector<mpl_::integral_c<int, 8>, mpl_::integral_c<int, 8>, mpl_::integral_c<int, 4> >, boost::mpl::v_item<picongpu::placeholder_definition23::weighting, boost::mpl::v_item<picongpu::placeholder_definition21::momentum, boost::mpl::v_item<picongpu::placeholder_definition18::position<picongpu::placeholder_definition20::position_pic, PMacc::placeholder_definition15::pmacc_isAlias>, boost::mpl::vector0<mpl_::na>, 0>, 0>, 0>, boost::mpl::vector<picongpu::placeholder_definition28::particlePusher<picongpu::particles::pusher::Boris, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition27::shape<picongpu::particles::shapes::TSC, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition32::interpolation<picongpu::FieldToParticleInterpolation<picongpu::particles::shapes::TSC, picongpu::AssignedTrilinearInterpolation>, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition33::current<picongpu::currentSolver::Esirkepov<picongpu::particles::shapes::TSC, 3u>, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition36::massRatio<picongpu::placeholder_definition55::MassRatioElectrons, PMacc::placeholder_definition15::pmacc_isAlias>, picongpu::placeholder_definition37::chargeRatio<picongpu::placeholder_definition56::ChargeRatioElectrons, PMacc::placeholder_definition15::pmacc_isAlias>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, PMacc::HandleGuardRegion<PMacc::particles::policies::ExchangeParticles, PMacc::particles::policies::DeleteParticles>, boost::mpl::vector0<mpl_::na>, boost::mpl::vector0<mpl_::na> > > >::operator() (this=0x7fffffff0bca,
    params=0x10d6d98, restartChunkSize=1000000)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugins/hdf5/restart/LoadSpecies.hpp:126
#6  0x00000000007f8d41 in operator()<picongpu::hdf5::ThreadParams*, unsigned int> (t1=@0x10d6ee0: 1000000, t0=@0x7fffffff0530: 0x10d6d98, this=<optimized out>)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/algorithms/ForEach.hpp:158
#7  operator()<picongpu::hdf5::ThreadParams*, unsigned int> (
    t1=@0x10d6ee0: 1000000, t0=<optimized out>, this=<optimized out>)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/algorithms/ForEach.hpp:239
#8  picongpu::hdf5::HDF5Writer::restart (this=0x10d6d80, restartStep=1024,
    restartDirectory=...)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp:226
#9  0x00000000007dedb8 in PMacc::PluginConnector::restartPlugins (
    this=0x10ac700 <PMacc::PluginConnector::getInstance()::instance>,
    restartStep=1024, restartDirectory=...)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/pluginSystem/PluginConnector.hpp:181
#10 0x00000000007e503d in picongpu::InitialiserController::restart (
    this=0x10d6010, restartStep=1024, restartDirectory=...)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/initialization/InitialiserController.hpp:84
#11 0x000000000080ab73 in picongpu::MySimulation::fillSimulation (
    this=0x10d5e20)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simulationControl/MySimulation.hpp:413
#12 0x000000000084d5f7 in PMacc::SimulationHelper<3u>::startSimulation (
    this=0x10d5e20)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/../libPMacc/include/simulationControl/SimulationHelper.hpp:223
#13 0x00000000008315b8 in picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::start (
    this=0x7fffffff1930)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/include/simulationControl/SimulationStarter.hpp:83
#14 0x00000000007cc249 in main (argc=38, argv=0x7fffffff1ad8)
    at /homeb/hhh20/hhh20z/tools/picongpu/src/picongpu/main.cu:10



________________________________________
From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Huebl [a.huebl@hzdr.de]
Sent: Tuesday, February 28, 2017 9:05 PM
To: picongpu-users@hzdr.de
Subject: Re:  [PIConGPU-Users] Restart failure

Hi Danila,

sorry your mail got lost.

-G is currently not possible due to a problem in cuSTL (a library inside
PMacc that is used at some points). You won't need it anyway, `-g` is
enough.

Can you try debugging it like that?


Best,
Axel

On 16.02.2017 09:58, Khikhlukha Danila wrote:
> Hi Axel,
> let me please answer step by step. Adding --restart-step 1024 didn't help. Latest commit I have in my local branch is aabd50e.
>
> I decided to debug the issue with HDF5. Following your instruction I tried to compile with debug flags on, however my gdb compilation fails. So what I did:
> 1. configure make files as usual with the command issued in a $BUILD_DIR
> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29 -DPMACC_VERBOSE_LVL=7" $INCLUDE_DIR
>
> 2. then in a $BUILD_DIR  do  `ccmake . ` and setup flags:
> CUDA_SHOW_CODELINES = ON
> PMACC_BLOCKING_KERNEL = ON
> CUDA_NVCC_FLAGS_DEBUG = -g;-G (this flag I found in advance mode. Maybe it worths mentioning on the wiki page )
>
> 3. Save and generated new Makefile, launch `make` and get an error:
> ptxas error   : Entry function '<...really long function name...>' uses too much shared data (0xc00a bytes, 0xc000 max).
> which puzzles me a bit. It is clear that the debug binary will need more static memory to store the debug info, but is it that large? Or I did something wrong?
>
> 4. If skip step#2 `make` has no problems to compile in a normal way.
>
> Best,
> Danila.
> ________________________________________
> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of Axel Huebl [a.huebl@hzdr.de]
> Sent: Tuesday, February 14, 2017 11:02 PM
> To: picongpu-users@hzdr.de
> Subject: Re:  [PIConGPU-Users] Restart failure
>
> Hi Danila,
>
> so far that looks normal, can you try specifying your restart step
> explicitly with e.g.
>   --restart-step 1024
> ?
>
>
> Which exact version of PIConGPU did you use?
>
> We are currently not aware of known problems with restarts. If we can
> narrow down the source of your segfault this would be wonderful
> (although I suspect it might be in a third party lib), so in case that
> you are able to do a small 1 node example and can hang in gdb there
>   https://github.com/ComputationalRadiationPhysics/picongpu/wiki/Debugging
>
> we might get a better understanding.
>
> That said, you can also add ADIOS to your compile chain which will
> automatically use ADIOS .bp files for checkpointing & restarting and you
> can still use HDF5 for regular output (or use ADIOS for both) in case
> HDF5 behaves too nasty and you need to move fast.
>
>
> Axel
>
> On 13.02.2017 15:47, Khikhlukha Danila wrote:
>> Hi René,
>> sure, pls. see the attachment. Please let me know if more information is needed.
>>
>> D.
>> ________________________________________
>> From: picongpu-users@hzdr.de [picongpu-users@hzdr.de] on behalf of René Widera [r.widera@hzdr.de]
>> Sent: Monday, February 13, 2017 3:39 PM
>> To: picongpu-users@hzdr.de
>> Subject: Re:  [PIConGPU-Users] [PIConGPU-Users] Restart failure
>>
>> Dear Danila,
>>
>> could you please send use the `stdout`, `stderr` and the files from the
>> `tbg` folder?
>>
>> best,
>>
>> René
>>
>> On 02/13/2017 03:11 PM, Khikhlukha Danila wrote:
>>> Dear all,
>>> currently I was trying to setup PoG in the Jureca machine. It all worked
>>> fine for the LWFA example, however when I tried to restart the
>>> simulation I received a segfault almost immediately.
>>> My tool chain is as follows
>>>
>>> GCC/5.4.0
>>> CUDA/8.0.44
>>> MVAPICH2/2.2-GDR
>>> HDF5/1.8.17
>>> Boost/1.61.0
>>>
>>> So, the first run didn't have any problems -- pictures, save points and
>>> data dumps were created. When I tried to launch the restart it crashes
>>> although I explicitly specify the savepoint directory.
>>>
>>> test$ diff -r 0002/submit/ 0002_restart/submit/
>>> diff -r 0002/submit/0008gpus.cfg 0002_restart/submit/0008gpus.cfg
>>> 39c39
>>> < TBG_steps="-s 1024"
>>> ---
>>>> TBG_steps="-s 2048"
>>> 41a42
>>>> TBG_restart="--restart --restart-directory
>>> /work/hhh20/hhh20z/run_0002/simOutput/checkpoints"
>>> 67a69
>>>>                    !TBG_restart      \
>>>
>>> I also checked that it exists and accessible. I tried to switch on some
>>> debug information, with the following command:
>>>
>>> $PICSRC/configure -c"-DCMAKE_VERBOSE_MAKEFILE=ON -DPIC_VERBOSE_LVL=29
>>> -DPMACC_VERBOSE_LVL=7"
>>>
>>> however I didn't find any information except a standard message:
>>> [jrc0007:mpi_rank_4][error_sighandler] Caught error: Segmentation fault
>>> (signal 11)
>>>
>>> Could you please advice me if there are another way how to diagnose the
>>> problem (except launching a gdb). may be I'm doing something wrong?
>>> However restart used to work on other machines...
>>>
>>>
>>> Thank you in advance,
>>> Danila.
>>>
>>
>> --
>> René Widera
>> Abteilung Laser-Teilchenbeschleunigung (FWKT)
>> Helmholtz-Zentrum Dresden-Rossendorf
>> Tel: +49 (0351) 260 3543
>> r.widera@hzdr.de
>> http://www.hzdr.de
>>
>> Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey,
>>            Prof. Dr. Dr. h. c. Peter Joehnk
>> Vereinsregister: VR 1693 beim Amtsgericht Dresden
>>
>> #############################################################
>> This message is sent to you because you are subscribed to
>>   the mailing list <picongpu-users@hzdr.de>.
>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>
>>
>>
>> #############################################################
>> This message is sent to you because you are subscribed to
>>   the mailing list <picongpu-users@hzdr.de>.
>> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
>> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
>> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
>> Send administrative queries to  <picongpu-users-request@hzdr.de>
>>
>
> --
>
> Axel Huebl
> Phone +49 351 260 3582
> https://www.hzdr.de/crp
> Computational Radiation Physics
> Laser Particle Acceleration Division
> Helmholtz-Zentrum Dresden - Rossendorf e.V.
>
> Bautzner Landstrasse 400, 01328 Dresden
> POB 510119, D-01314 Dresden
> Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
>           Prof. Dr.Dr.h.c. P. Joehnk
> VR 1693 beim Amtsgericht Dresden
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>
>
> #############################################################
> This message is sent to you because you are subscribed to
>   the mailing list <picongpu-users@hzdr.de>.
> To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
> To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
> To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
> Send administrative queries to  <picongpu-users-request@hzdr.de>
>

--

Axel Huebl
Phone +49 351 260 3582
https://www.hzdr.de/crp
Computational Radiation Physics
Laser Particle Acceleration Division
Helmholtz-Zentrum Dresden - Rossendorf e.V.

Bautzner Landstrasse 400, 01328 Dresden
POB 510119, D-01314 Dresden
Vorstand: Prof. Dr.Dr.h.c. R. Sauerbrey
          Prof. Dr.Dr.h.c. P. Joehnk
VR 1693 beim Amtsgericht Dresden

#############################################################
This message is sent to you because you are subscribed to
  the mailing list <picongpu-users@hzdr.de>.
To unsubscribe, E-mail to: <picongpu-users-off@hzdr.de>
To switch to the DIGEST mode, E-mail to <picongpu-users-digest@hzdr.de>
To switch to the INDEX mode, E-mail to <picongpu-users-index@hzdr.de>
Send administrative queries to  <picongpu-users-request@hzdr.de>

Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster