## <span style="font-size:80%">Experience from implementing Fastjet clustering with the Julia language</span>
<span style="font-size:80%">Benedikt Hegner¹, <ins>Philippe Gras²</ins>, Graeme A. Stewart¹</span>
<span style="font-size:60%">¹CERN
²Irfu, CEA, Université Paris-Saclay
<!-- ²IPPP Durham University - TODO: not sure why that is there?-->
</span>
<span style="font-size:70%;color:CornflowerBlue">
Julia-for-HEP Discussion<br/>
HSF PyHEP working group<br/>
November 9, 2022
</span>
---
## Motivation
Jet clustering is high-CPU demanding at LHC due to the high particle multiplicity. Addressed by [Fastjet](https://fastjet.fr), that uses optimized algorithm.
<span style="color:CornflowerBlue;">**Jet clustering is a good testbench to evaluate Julia as a replacement of C++ for HEP applications.**</span>
---
## Two evaluations performed
One by Atell Krasnopolski, [IRIS-HEP fellow](https://iris-hep.org/fellows/gojakuch.html) Jun – Sep, 2022 mentored by Benedikt and Graeme. Based on the <span style="color:CadetBlue">**N² plain algorithm**</span>, adapted for low particle multiplicity, e.g. at LEP.
One by Philippe, based on the <span style="color:Chocolate">**N² tile algorithm**</span>, adapted to LHC events.
Fastjet includes more algorithms and by default selects the most suitable for the particle multiplicity of the event.
---
## Atell et al. excercise, plain N²
Work presented [here](https://indico.cern.ch/event/1195272/)
Constant profiling needed to catch performance bottlenecks not obvious to C/C++ programmers
Puzzling conclusion: Julia implementation indecently slower with respect to Fasjet.
 https://github.com/JuliaHEP/JetReconstruction.jl
---
## Update on Atell et al. excercise
Difference was due to use of different algorithms: plain N² for Julia and "Best".
_Benchmark done on 10 Pythia8 LHC QCD multijet events $\hat{p_{\textrm{T}}}>1\,$TeV_
---
## Update on Atell et al. excercise

The apples-to-apples comparison
---
## Atell et al. exercise conclusion
Julia code performance similar to the original C++ code
<span style="font-size:60%;">_Caveat: several optimizations implemented in Julia (fastmath, function value caching). Such implementation easy to implement in Julia. Would C++ profit of from the same optimization?_</span>
---
## Tiled N² exercise
Originally done for Philippe's talks at PyHEP2021, but not finished on time to be presented.
Work resumed last month.
 https://github.com/grasph/AntiKt.jl
---
## Tiled N² exercise
Started from FastJet code
Methodology:
1. Extracted code from Fastjet to produce a standalone C++ code with only the part to port to Julia;
2. Ported the code to Julia;
3. Tested, compared, improved.
---
## Tile N² algorithm
The ($\eta, \varphi$) plan is partitioned to limit the extend of the search area when looking for the nearest neighbours.
---
## Performance status
Tested on 100 QCD pp collision multijet events produced with Pythia8 at √s = 13 TeV
$$\frac{\textrm{Julia}}{\textrm{C++}} = \frac{318 ± 5 μs}{263 ± 9 μs} = 1.2$$
---
## Optimization history
---
### Preamble
Fastjet code has been optimised over years.
Every single clock tic accounted: e.g. pointer used instead of arrays because of slight better measured performance; $\Delta \varphi$ function optimised; ...
$\Rightarrow$ Not expecting to get same performance from the first implementation
---
## First debugged implementation
x5.4
---
## Added a type annotations in a `struct`
x4.4
To be considered as a bug/mistake: annotating the struct fields was part of our coding rules.
---
### Fixed type unstability of an iterator over a Union{T, Nothing} collection
x2.7
Tricky: a common Julia pattern to replace `nullptr`.
<span style="font-size:80%;">The Julia manual says _``The compiler is able to generate efficient code when working with Union{T, Nothing} arguments or fields.''_</span>
Here it was a <span style="color:crimson;">**collection of Union{T, Nothing}**</span> in a return type...
---
## Removed the `Union{Nothing, X}`types (x2)
x1.8
Even better without Union{Nothing, X} collection at all!
---
How the `Union{Nothing, X}` was removed ?
```julia=
mutable struct TiledJet
...
NN::TiledJet # ← recursive type
previous::TiledJet # ← recursive type
next::TiledJet # ← recursive type
TiledJet(::Type{Nothing}) = begin
t = new(...)
t.NN = t.previous = t.next = t # ← Not mandatory
end
end
# The "null" reference:
const noTiledJet::TiledJet = TiledJet(Nothing)
```
---
## Replaced a Vector by a static array
x1.5
A common optimization.
---
## Optimised tile neighbour navigation
x1.3
Tile navigation implemented in the C++ code with chained lists and pointers. First implementation tried to mimic it. Reimplemented using index manipulation without use of precalculated tile links.
---
## Removed several boundary checks base on profiling
(`@inbounds`, `unsafe_trunc`)
Small improvement (4%)
x1.2
The best performance so far.
---
## The remaining 20%
Code profiling has not permitted so far to find the origin
---
## Code profiling
<span style="text-align:left;">
<ul>
<li>
Run profiler from Julia standard libraries</span></li>
<ul><li><i>Used `perf` for C++</i></li></ul>
<li> Wrote small test codes when a feature was suspected to be slower in Julia.</li>
</ul>
</span>
---
## Profiler data analysis
### Julia
Easy-to-use tools to produce Flamegraph and reports
* Occurence of GC and runtime dispatch identified in Flame graph by colors.
* Report (time spent in each function and line of code) with hyperlinks produce with StatProfilerHTML.
---
## Profiler data analysis
### C++
Used¹ `perf record` command and [FlameGraph](https://www.brendangregg.com/flamegraphs.html).
Note:
Note: it was more difficult for C++ to find the tools and learn to use them.
<br/>
¹ [hotspot](https://github.com/KDAB/hotspot) is an excellent GUI alernative.
---
## Profiler data analysis
### C++
Used¹ `perf record` command and [FlameGraph](https://www.brendangregg.com/flamegraphs.html).
Note: it was more difficult for C++ to find the tools and learn to use them.
¹ [hotspot](https://github.com/KDAB/hotspot) is an excellent GUI alernative.
---
## Flame graph
<iframe src="https://pgras.web.cern.ch/codim/2022-11-09/profsvg.svg"
width="1024" height="512" style="border:none;"></iframe>
<span style="font-size:50%">Red: runtime dispatch; Brown: GC; Blue: OK</span>
---
## Review from Oliver
_<span style="font-size:70%;">"I've played around with AntiKt a bit, here's a PR that
reduces mem-alloc-cost by about 50% (there's room for more):<br/>
Profiling looks pretty "clean" now - most of the time
is spent in "search for-loops" that should take almost
exactly the same amount of time in C++."</span>_
---
## The for-loop
Looks for the two pseujets to combine at each interation.
The most costly part.
<span style="color:CornflowerBlue;">Peforms as well as C++: checked with profiler and by replacing it by a call to c code.</span>
---
## Learnt Experienced
---
## Type instability
---
Type instability was not difficult to handle in this exercise, apart the collections of `Union{T, Nothing}` (due to lack of experience).
Can be more difficult in some use case like "recursive" data structure: see Pere's talk.
---
Followed the [Julia manual guidelines](https://docs.julialang.org/en/v1/manual/performance-tips/) while writing the code.
---
Used type annotation for all fields of structs. For functions, only when required for multiple dispatch (i.e. specialization of a function for a set of argument types).
---
Many well-done tool to check this: `@code_warnig`, `Test.@inferred`, indication in Flame graphs.
---
## Runtime dispatch
---
Related with type instability
When the dispatch cannot be resolved at compile (just-in) time, it's resolved at runtime
Expensive:
* x24 C++ virtual call from our tests
* x7 if done explicity with a "switch" on the type.
---
In this use case, dispatches resolved nicely at compiled time.
---
## Memory allocations
---
Tricky part for C++ developers
---
## Fast for immutable
## Slow for mutable
---
### Measurements for heap allocation of a struct with one Int/long field:
* Ref. C+: 3 μs
* Fast for immutable: ~20 ns
* Slow for mutable: ~20 μs
<span style="font-size:60%;">(numbers to be cross-checked)</span>
---
### Immutables are faster to allocate but...
### ...changing a value requires a copy
Cost increases with struct size<br/> (immutable vs mutable):
* Already 0.56 ns vs 0.37 ns 16 bytes
* 2.14 ns vs 0.95 ns for 160 bytes
---
## Convenient macro to do change-field-and-copy
```julia=
using Accessors
a = @set a.x = 3
```
---
## Use of an immutable is not always easy
```julia=
mutable struct PseudoJet<:FourMomentum
# construct a pseudojet from explicit components
px::Float64
py::Float64
pz::Float64
E::Float64
_cluster_hist_index::Int
_rap::Float64 # ← Cache computed at first need
_phi::Float64 # ← ...
_pt2::Float64
end
```
---
Immutable ``copy'' is cheap
⇒ need to think differently than with C++
---
## Time spread
---
Large spread of timing measurement >> C++
Makes timing more difficult
Mitigated by the BenchmarkTools benchmark statistical tool.
---
## Conclusions
---
Similar performance as original C++ code for plain N² algorithm
Fair performance for Tailed N² algorithm: 20% slower than C++
→ would need to find the origin of the difference to draw a final conclusion.
---
## When used in place of Python, performance comes from free.
---
## When used in place of C++ for high-performance, the high-level comes with a price.
---
### The price,
<ul style="font-size:80%;">
<li>Less clear where time is spent, when comes to optimization;</li>
<li>Need of different programming practise than with C++.</li>
<br/>
</ul>
### But, easy-to-use tools and excellent community support.
### Julia performance continuously improving.
{"type":"slide","slideOptions":{"transition":"slide"},"title":"Experience from implementing Fasjet clustering with the Julia language"}