Geant4 Benchmarking and Profiling

Geant4 Benchmarking and Profiling

The Geant4 (G4) Computing Performance Task (G4CP Task) is part of the Geant4 testing and QA group. The charge of the G4CP Task is to monitor G4 software through its development cycle for expected and unexpected changes in computing performance, identify problems and opportunities for code improvement and optimization, and communicate the results and findings to the appropriate G4 working group leaders and the Steering Board.

We are suggesting the following Geant4 benchmarking/profiling protocol for profiling/timing of the official, reference and candidate releases:

The Goal
Tools and Observables
Applications to be used
Event Samples
Physics List
Hardware Platform
Estimate of the number of benchmarking "runs"
Procedure
Additions for Multithreaded Applications (Proposed)
Reports

1. The Goal

provide general Geant4 benchmarking/profiling information to the Geant4 developers

2. Tools and Observables

timing tools: G4Timer or other generic timing tools (Posix Timer, rusage)

average event timing for a given architecture/kernel/os:
(user time, system time, real elapsed time between the begin of event and the end of event)

CPU performance profiling tools: fast/simple profiler (user's manual)

fast is a set of tools for collecting, managing, and analyzing data about code performance
top 10-20 function/library counts/fractions profiles

memory profiling tools: igprof (home page)

igProf is a simple tool for measuring and analyzing application memory and performance characteristics
memory profile (live, max, total, difference)

3. Applications to be used

SimplifiedCalo - the common base for all the releases (standard configuration file)
adapted from A. Dotti and modified for PYTHIA input events and performance measures (timing, igprof) (ReadMe)

Optional applications: they evolve with time; it may be difficult to profile many Geant4 releases with them:

cmsExp - an extended application with CMS geometry and magnetic field (standard configuration file)
cmsExp(ress) is a standalone Geant4 application using the GDML interface to read the CMS geometry (ReadMe)
CMSSW - as the time permits to make the code compatible
mu2e (geometry only?) - as the time permits

Share applications and tools used for performance benchmarking: put standalone applications in G4 benchmarks repository and relevant tools in G4 tools repository while production code will be available in local FNAL repositories.

4. Event Samples

A set of samples for different event types and configurations:

single particles (electrons, pions-, protons, anti-protons) at 1, 5, 10, 50 GeV
for each sample, 2000 events per job are generated on the fly (or used static input files if necessary)
general pp physics events (SimplifiedCalo/PYTHIA H->ZZ and CMSSW/Z'decay)
50 events of 14 TeV pp -> 500 GeV H->ZZ (Z->all decays) per job are processed from an input file
100 events of 14 TeV pp -> 700 GeV Z'->dijet (udsc quarks) per job are processed from an input file
additional samples may be prepared (QCD jets,ttbar, W/Z, SUSY and etc.)
magnetic field ON for all samples
magnetic field OFF only for single electrons and pions

5. Physics list

FTFP_BERT (default) for all samples
QGSP_BERT, QGSP_BIC and LHEP for single pions only
The default physics list can be changed by the decision of the testing and QA team

6. Hardware Platform

platform to run profiling jobs at FNAL

current: condor batch queues which consist of Intel(R) Xeon(R) E5430 (CPU 2666 MHz) and
Quad-Core AMD Opteron(tm) Processor 2389 (CPU 2914 MHz)
future: PBS batch queues which consist of 5 x 32-Core AMD Opteron(tm) Processor 6128 (CPU 2000 MHz)
option for the partial mode with a selected list of samples for quick results

7. Estimate of the number of benchmarking "runs"

The above would mean the following number of benchmarking jobs for a "standard" SimplifiedCalo run:

simple profiler with SimplifiedCalo for single particle samples:

4 (particle types) x 4 (energies) event samples with magnetic field ON times 1 "standard" physics list = 16
2 (e- and pion-) x 4 (energies) event samples with magnetic field OFF times 1 "standard" physics list = 8
1 (pion-) x 4 (energies) event samples with magnetic field ON times 3 "additional" physics lists = 4 x 3 = 12

simple profiler with SimplifiedCalo for general physics events:

1 sample with magnetic field ON times 1 "standard" physics list = 1

igprof: (a single job is enough to profile memory usages over N-events)

similar as above, except only one job per each sample which gives 37 jobs

processing time for profiling all samples with SimplifiedCalo (excluding time for the post analysis)

projected CPU hours per job on the future platform: ~8 hours for physics events and <1 hour for each single particle sample
total CPU time: 2220 hours = 1152 x 1 + 128 x 8.0 + 36 x 1 + 1 x 8.0
estimated processing time = 2220/(5x32 cores) = ~14 hours
keep the total processing time to finish the entire profiling jobs for a given release to less than 1 day
keep the post analysis time to less than 1 day

8. Procedure

apply the above defined profiling protocol for each major development release (approximately 5 times a year - one reference release between January and the beta release (~March), the beta release (June), two reference releases before the candidate release between June and November (more toward November), and the candidate release.
perform visual scan of results, report a summary and obvious performance issues to the relevant Geant4 working group leaders, and communicate with them to perform further analysis or to sign off the performance check for the release.
communicate results and actions after each benchmark exercise to the Steering Board.

9. Additions for Multithreaded Applications (Proposed)

build geant4 and applications with the multithreaded mode

applications: cmsExpMT or SimplifiedCaloMT

repeate the profiling/benchmaking task for multithreaded applications with a single thread

tools: FAST/igprof (if possible), otherwise use Open|Speedshop (or HPCToolkits)
physics list: FTFP_BERT (default)
samples: Higgs->ZZ, single pions and electrons (1, 5, 10, 50 GeV) - 9 samples x 1 job

run scalability tests for event time and memory as N-threads

samples: single pions and electrons (5, 50 GeV) - 2 samples x 6 jobs
plot throughput (events/sec) versus the number of threads and verify linearity
plot memory (RSS-shared) versus the number of threads

10. Reports

maintain a web page containing summaries/plots and detailed benchmarking results

here

provide or maintain links to the relevant information related to the profiling tasks/tools
(tool description, processes/configuration details, log files etc.)
look for a storage system (enstore or database?) for archiving raw data and analysis results

We would start the above procedure with Geant4 release 9.5, we would also re-benchmark 9.4, 9.4.p01, 9.4.p02 for continuity purposes

Contents