Geant4 Benchmarking and Profiling
The Geant4 (G4) Computing Performance Task (G4CP Task) is part of
the Geant4 testing and QA group. The charge of the G4CP Task is to
monitor G4 software through its development cycle for expected and unexpected
changes in computing performance, identify problems and opportunities for code
improvement and optimization, and communicate the results and findings to the
appropriate G4 working group leaders and the Steering Board.
We are suggesting the following Geant4 benchmarking/profiling protocol for
profiling/timing of the official, reference and candidate releases:
Contents
- The Goal
- Tools and Observables
- Applications to be used
- Event Samples
- Physics List
- Hardware Platform
- Estimate of the number of benchmarking "runs"
- Procedure
- Additions for Multithreaded Applications (Proposed)
- Reports
1. The Goal
- provide general Geant4 benchmarking/profiling information to the Geant4 developers
2. Tools and Observables
- timing tools: G4Timer or other generic timing tools (Posix Timer, rusage)
- average event timing for a given architecture/kernel/os:
(user time, system time, real elapsed time between the begin of event and the end of event)
- CPU performance profiling tools: fast/simple profiler (user's manual)
- fast is a set of tools for collecting, managing, and analyzing data about code performance
- top 10-20 function/library counts/fractions profiles
- memory profiling tools: igprof (home page)
- igProf is a simple tool for measuring and analyzing application memory and performance characteristics
- memory profile (live, max, total, difference)
- live: the memory that has not been freed - snapshot of the heap, i.e. a heap profile
- max: the largest single allocation by any function
- total: the total amount of memory allocated by any function - a snapshot of poor memory locality
- difference: difference of the live memory between N events - s snapshot of direct memory leakages
3. Applications to be used
- SimplifiedCalo - the common base for all the releases
(standard configuration file)
adapted from A. Dotti and modified for PYTHIA input events and performance measures (timing, igprof)
(ReadMe)
Optional applications: they evolve with time; it may be difficult to profile many Geant4 releases with them:
- cmsExp - an extended application with CMS geometry and magnetic field
(standard configuration file)
                cmsExp(ress)
is a standalone Geant4 application using the GDML interface to read the CMS geometry
(ReadMe)
- CMSSW - as the time permits to make the code compatible
- mu2e (geometry only?) - as the time permits
Share applications and tools used for performance benchmarking: put standalone applications
in G4 benchmarks repository and relevant tools in G4 tools repository while production code
will be available in local FNAL repositories.
4. Event Samples
A set of samples for different event types and configurations:
- single particles (electrons, pions-, protons, anti-protons) at 1, 5, 10, 50 GeV
for each sample, 2000 events per job are generated on the fly
(or used static input files if necessary)
- general pp physics events (SimplifiedCalo/PYTHIA H->ZZ and CMSSW/Z'decay)
50 events of 14 TeV pp -> 500 GeV H->ZZ (Z->all decays) per job are processed from an input file
100 events of 14 TeV pp -> 700 GeV Z'->dijet (udsc quarks) per job are processed from an input file
additional samples may be prepared (QCD jets,ttbar, W/Z, SUSY and etc.)
- magnetic field ON for all samples
- magnetic field OFF only for single electrons and pions
5. Physics list
- FTFP_BERT (default) for all samples
- QGSP_BERT, QGSP_BIC and LHEP for single pions only
- The default physics list can be changed by the decision of the testing and QA team
6. Hardware Platform
- platform to run profiling jobs at FNAL
- current: condor batch queues which consist of Intel(R) Xeon(R) E5430 (CPU 2666 MHz) and
Quad-Core AMD Opteron(tm) Processor 2389 (CPU 2914 MHz)
- future: PBS batch queues which consist of 5 x 32-Core AMD Opteron(tm) Processor 6128 (CPU 2000 MHz)
- option for the partial mode with a selected list of samples for quick results
7. Estimate of the number of benchmarking "runs"
The above would mean the following number of benchmarking jobs for a "standard" SimplifiedCalo run:
- simple profiler with SimplifiedCalo for single particle samples:
- 4 (particle types) x 4 (energies) event samples with magnetic field ON times 1 "standard" physics list = 16
- 2 (e- and pion-) x 4 (energies) event samples with magnetic field OFF times 1 "standard" physics list = 8
- 1 (pion-) x 4 (energies) event samples with magnetic field ON times 3 "additional" physics lists = 4 x 3 = 12
for the total of 36 runs (note that a "single" "run" is actually 32 jobs on a batch system)
which means about 36 x 32 = 1152 actual batch jobs - it may be more for inhomogeneous
computing clusters to get enough statistics for a given architecture/configuration
- simple profiler with SimplifiedCalo for general physics events:
- 1 sample with magnetic field ON times 1 "standard" physics list = 1
1 run is equivalent to 128 jobs on a batch system
- igprof: (a single job is enough to profile memory usages over N-events)
- similar as above, except only one job per each sample which gives 37 jobs
- processing time for profiling all samples with SimplifiedCalo (excluding time for the post analysis)
- projected CPU hours per job on the future platform: ~8 hours for physics events and <1 hour for each single particle sample
- total CPU time: 2220 hours = 1152 x 1 + 128 x 8.0 + 36 x 1 + 1 x 8.0
- estimated processing time = 2220/(5x32 cores) = ~14 hours
- keep the total processing time to finish the entire profiling jobs for a given release to less than 1 day
- keep the post analysis time to less than 1 day
8. Procedure
- apply the above defined profiling protocol for each major development release (approximately 5 times a
year - one reference release between January and the beta release (~March),
the beta release (June), two reference releases before the candidate release between June and
November (more toward November), and the candidate release.
- perform visual scan of results, report a summary and obvious performance issues to the
relevant Geant4 working group leaders, and communicate with them to perform further analysis
or to sign off the performance check for the release.
-
communicate results and actions after each benchmark exercise to the Steering Board.
9. Additions for Multithreaded Applications (Proposed)
- build geant4 and applications with the multithreaded mode
- applications: cmsExpMT or SimplifiedCaloMT
- repeate the profiling/benchmaking task for multithreaded applications with a single thread
- tools: FAST/igprof (if possible), otherwise use Open|Speedshop (or HPCToolkits)
- physics list: FTFP_BERT (default)
- samples: Higgs->ZZ, single pions and electrons (1, 5, 10, 50 GeV) - 9 samples x 1 job
- run scalability tests for event time and memory as N-threads
the number of threads (1, 2, 4, 8, 16, 32)
- samples: single pions and electrons (5, 50 GeV) - 2 samples x 6 jobs
- plot throughput (events/sec) versus the number of threads and verify linearity
- plot memory (RSS-shared) versus the number of threads
10. Reports
- maintain a web page containing summaries/plots and detailed benchmarking results
(see an example web page at here)
- provide or maintain links to the relevant information related to the profiling tasks/tools
(tool description, processes/configuration details, log files etc.)
- look for a storage system (enstore or database?) for archiving raw data and analysis results
We would start the above procedure with Geant4 release 9.5, we would also
re-benchmark 9.4, 9.4.p01, 9.4.p02 for continuity purposes