Experience from implementing Fasjet clustering with the Julia language

## Experience from implementing Fastjet clustering with the Julia language Benedikt Hegner¹, <ins>Philippe Gras²</ins>, Graeme A. Stewart¹ ¹CERN ²Irfu, CEA, Université Paris-Saclay  Julia-for-HEP Discussion HSF PyHEP working group November 9, 2022 --- ## Motivation Jet clustering is high-CPU demanding at LHC due to the high particle multiplicity. Addressed by [Fastjet](https://fastjet.fr), that uses optimized algorithm. **Jet clustering is a good testbench to evaluate Julia as a replacement of C++ for HEP applications.** --- ## Two evaluations performed One by Atell Krasnopolski, [IRIS-HEP fellow](https://iris-hep.org/fellows/gojakuch.html) Jun – Sep, 2022 mentored by Benedikt and Graeme. Based on the **N² plain algorithm**, adapted for low particle multiplicity, e.g. at LEP. One by Philippe, based on the **N² tile algorithm**, adapted to LHC events. Fastjet includes more algorithms and by default selects the most suitable for the particle multiplicity of the event. --- ## Atell et al. excercise, plain N² Work presented [here](https://indico.cern.ch/event/1195272/) Constant profiling needed to catch performance bottlenecks not obvious to C/C++ programmers Puzzling conclusion: Julia implementation indecently slower with respect to Fasjet. ![](https://codimd.web.cern.ch/uploads/upload_355a7e9fc665d1c54f303328ad4ed9ec.png =40x) https://github.com/JuliaHEP/JetReconstruction.jl --- ## Update on Atell et al. excercise Difference was due to use of different algorithms: plain N² for Julia and "Best". _Benchmark done on 10 Pythia8 LHC QCD multijet events $\hat{p_{\textrm{T}}}>1\,$TeV_ --- ## Update on Atell et al. excercise ![](https://codimd.web.cern.ch/uploads/upload_92288bca1dd83e928c09b4e1493ee987.png =60%x) The apples-to-apples comparison --- ## Atell et al. exercise conclusion Julia code performance similar to the original C++ code _Caveat: several optimizations implemented in Julia (fastmath, function value caching). Such implementation easy to implement in Julia. Would C++ profit of from the same optimization?_ --- ## Tiled N² exercise Originally done for Philippe's talks at PyHEP2021, but not finished on time to be presented. Work resumed last month. ![](https://codimd.web.cern.ch/uploads/upload_355a7e9fc665d1c54f303328ad4ed9ec.png =40x) https://github.com/grasph/AntiKt.jl --- ## Tiled N² exercise Started from FastJet code Methodology: 1. Extracted code from Fastjet to produce a standalone C++ code with only the part to port to Julia; 2. Ported the code to Julia; 3. Tested, compared, improved. --- ## Tile N² algorithm The ($\eta, \varphi$) plan is partitioned to limit the extend of the search area when looking for the nearest neighbours. --- ## Performance status Tested on 100 QCD pp collision multijet events produced with Pythia8 at √s = 13 TeV $$\frac{\textrm{Julia}}{\textrm{C++}} = \frac{318 ± 5 μs}{263 ± 9 μs} = 1.2$$ --- ## Optimization history --- ### Preamble Fastjet code has been optimised over years. Every single clock tic accounted: e.g. pointer used instead of arrays because of slight better measured performance; $\Delta \varphi$ function optimised; ... $\Rightarrow$ Not expecting to get same performance from the first implementation --- ## First debugged implementation x5.4 --- ## Added a type annotations in a `struct` x4.4 To be considered as a bug/mistake: annotating the struct fields was part of our coding rules. --- ### Fixed type unstability of an iterator over a Union{T, Nothing} collection x2.7 Tricky: a common Julia pattern to replace `nullptr`. The Julia manual says _``The compiler is able to generate efficient code when working with Union{T, Nothing} arguments or fields.''_ Here it was a **collection of Union{T, Nothing}** in a return type... --- ## Removed the `Union{Nothing, X}`types (x2) x1.8 Even better without Union{Nothing, X} collection at all! --- How the `Union{Nothing, X}` was removed ? ```julia= mutable struct TiledJet ... NN::TiledJet # ← recursive type previous::TiledJet # ← recursive type next::TiledJet # ← recursive type TiledJet(::Type{Nothing}) = begin t = new(...) t.NN = t.previous = t.next = t # ← Not mandatory end end # The "null" reference: const noTiledJet::TiledJet = TiledJet(Nothing) ``` --- ## Replaced a Vector by a static array x1.5 A common optimization. --- ## Optimised tile neighbour navigation x1.3 Tile navigation implemented in the C++ code with chained lists and pointers. First implementation tried to mimic it. Reimplemented using index manipulation without use of precalculated tile links. --- ## Removed several boundary checks base on profiling (`@inbounds`, `unsafe_trunc`) Small improvement (4%) x1.2 The best performance so far. --- ## The remaining 20% Code profiling has not permitted so far to find the origin --- ## Code profiling <ul> <li> Run profiler from Julia standard libraries</li> <ul><li>Used `perf` for C++</li></ul> <li> Wrote small test codes when a feature was suspected to be slower in Julia.</li> </ul> --- ## Profiler data analysis ### Julia Easy-to-use tools to produce Flamegraph and reports * Occurence of GC and runtime dispatch identified in Flame graph by colors. * Report (time spent in each function and line of code) with hyperlinks produce with StatProfilerHTML. --- ## Profiler data analysis ### C++ Used¹ `perf record` command and [FlameGraph](https://www.brendangregg.com/flamegraphs.html). Note: Note: it was more difficult for C++ to find the tools and learn to use them. ¹ [hotspot](https://github.com/KDAB/hotspot) is an excellent GUI alernative. --- ## Profiler data analysis ### C++ Used¹ `perf record` command and [FlameGraph](https://www.brendangregg.com/flamegraphs.html). Note: it was more difficult for C++ to find the tools and learn to use them. ¹ [hotspot](https://github.com/KDAB/hotspot) is an excellent GUI alernative. --- ## Flame graph <iframe src="https://pgras.web.cern.ch/codim/2022-11-09/profsvg.svg" width="1024" height="512" style="border:none;"></iframe> Red: runtime dispatch; Brown: GC; Blue: OK --- ## Review from Oliver _"I've played around with AntiKt a bit, here's a PR that reduces mem-alloc-cost by about 50% (there's room for more): Profiling looks pretty "clean" now - most of the time is spent in "search for-loops" that should take almost exactly the same amount of time in C++."_ --- ## The for-loop Looks for the two pseujets to combine at each interation. The most costly part. Peforms as well as C++: checked with profiler and by replacing it by a call to c code. --- ## Learnt Experienced --- ## Type instability --- Type instability was not difficult to handle in this exercise, apart the collections of `Union{T, Nothing}` (due to lack of experience). Can be more difficult in some use case like "recursive" data structure: see Pere's talk. --- Followed the [Julia manual guidelines](https://docs.julialang.org/en/v1/manual/performance-tips/) while writing the code. --- Used type annotation for all fields of structs. For functions, only when required for multiple dispatch (i.e. specialization of a function for a set of argument types). --- Many well-done tool to check this: `@code_warnig`, `Test.@inferred`, indication in Flame graphs. --- ## Runtime dispatch --- Related with type instability When the dispatch cannot be resolved at compile (just-in) time, it's resolved at runtime Expensive: * x24 C++ virtual call from our tests * x7 if done explicity with a "switch" on the type. --- In this use case, dispatches resolved nicely at compiled time. --- ## Memory allocations --- Tricky part for C++ developers --- ## Fast for immutable ## Slow for mutable --- ### Measurements for heap allocation of a struct with one Int/long field: * Ref. C+: 3 μs * Fast for immutable: ~20 ns * Slow for mutable: ~20 μs (numbers to be cross-checked) --- ### Immutables are faster to allocate but... ### ...changing a value requires a copy Cost increases with struct size (immutable vs mutable): * Already 0.56 ns vs 0.37 ns 16 bytes * 2.14 ns vs 0.95 ns for 160 bytes --- ## Convenient macro to do change-field-and-copy ```julia= using Accessors a = @set a.x = 3 ``` --- ## Use of an immutable is not always easy ```julia= mutable struct PseudoJet<:FourMomentum # construct a pseudojet from explicit components px::Float64 py::Float64 pz::Float64 E::Float64 _cluster_hist_index::Int _rap::Float64 # ← Cache computed at first need _phi::Float64 # ← ... _pt2::Float64 end ``` --- Immutable ``copy'' is cheap ⇒ need to think differently than with C++ --- ## Time spread --- Large spread of timing measurement >> C++ Makes timing more difficult Mitigated by the BenchmarkTools benchmark statistical tool. --- ## Conclusions --- Similar performance as original C++ code for plain N² algorithm Fair performance for Tailed N² algorithm: 20% slower than C++ → would need to find the origin of the difference to draw a final conclusion. --- ## When used in place of Python, performance comes from free. --- ## When used in place of C++ for high-performance, the high-level comes with a price. --- ### The price, <ul style="font-size:80%;"> <li>Less clear where time is spent, when comes to optimization;</li> <li>Need of different programming practise than with C++.</li> </ul> ### But, easy-to-use tools and excellent community support. ### Julia performance continuously improving.