Cristian discovered that I made a mistake in the performance measurements when comparing v3.3.1 and v3.3.2 of SimGrid. I compared the constant time network model of the new version against the regular old version. That's quite unfair since this model (currently under test and to be put in production in 3.3.3 or later) is more than 20% faster than the regular one.
So, the new version is not 5% faster than the old one, but over 15% slower. It's not a small gain anymore, but a performance disaster! Ouch, that hurts.
(only keep reading if you want some hard core technical details about the problem analysis we did so far...)
We launched kcachegrind to understand what's going on, and the surf refactoring seem to be the reason of this huge performance loss.
It seems that the routing logic extraction causes communication actions to be created twice slower than before. The get_route function is now accessed through a pointer where it was inlined before, plus it now needs an extra pointer dereferencing to access its data. That must be part of the reason.
I still think that we need this logic extraction until we solve the scalability issue represented by the big fat routing table we have, but it's maybe useless to configure it at run time. I guess we should define this at compilation time with a pair of #defines, and get the surf_routing.c included where needed to get gcc inlining these functions again. And once we find the right way to address the routing (Arnaud has interesting ideas), only one solution should be kept.
It's always a bit difficult to compare performance with callgrind since it counts cycles and not the elapsed time directly. For example, it reports that SIMIX_get_host_by_name now takes about 2,400,000 cycles vs. 1,300,000 before where this code and its dependencies were not modified (as far as I remember). But actually, kcachegrind reports 1,000,000,000 cycles for the whole new main vs ~800,000,000 for the old one. That about the performance loss we observe, so kcachegrind is still relevant (I need another excuse).
A strange thing is that every function seem to be slower. A very troubling element is that the function xbt_dict_get only calls strcmp in 3.3.1 where it also calls the function xbt_dict_hash in 3.3.2. Using that functionality is completely fine, nothing changed here, but this function should be marked inline and thus not appear in kcachegrind investigation...
I'm now suspecting some stupid error. Something like a breakage in the compiler detection inducing that the portability layer we have decided to not inline any function and/or not pass every compilation flags to gcc...
But it's now time to sleep.