(this post is close to my mail on the simgrid-user mailing list, but a bit more detailed)
This is the "Need for Speed" release
I know we did a release yesterday, but the timings done to validate it then were faulty. Instead of being 5% faster as announced, v3.3.2 is 15% slower (compared to 3.3.1).
We got quite upset about this with Cristian, and we struggled to find where it comes from. Yesterday, we were observing a slowdown of 15%, with very strange things (such as functions marked inline but not inlined). But actually, that disappeared after a recompilation. I guess I passed some bad options to ./configure yesterday.
We really need some way to retrieve which were the compilation arguments passed to a library from the .so file. A while ago, I seem to remember that libpthread had an entry point, displaying its configuration when you tried to execute it directly (like typing ./libpthread.so). But when I tried to implement this in simgrid, libtool came in my way and prevented me to pass the right arguments to the linker. We should try it again (and succeed, if possible).
About the performance stagnation instead of improvement we had between v331 and v332, the funny thing is that it had nothing to do with all the modularity we added through refactoring. Instead, the problem was a conversion from a manually handled vector to xbt_dynar_t on the critical path. Since xbt_dynar_foreach calls functions to increment the cursor and retrieve the element, it induced a whole bunch of stack management crap (function calls).
We inlined these functions and xbt_dynar_foreach is now breath taking. We also inlined xbt_swag_belong on the way because kcachegrind meant that it was the most often called function of the simulation (about 106 times per second) and was a one-line accessor.
Here are some approximate speedup measurements (on master/slaves simulations lasting between 10s and 20s each, rounded to 5%):
From To Speedup
v3.3.1 v3.3.2 almost none
v3.3.2 v3.3.3 40%
v3.3.1 v3.3.3 40%
v3.3.1 (with inline patch) v3.3.3 30%
Our reading is that the refactoring which occurred in 3.3.2 made us suffer much more from the xbt_dynar_foreach low performance. But once we solved this, this refactoring proved to be very performance effective. From the 40% speedup between v3.3.1 and v3.3.3, I like to think that 10% are due to the inlining and 30% to the refactoring.
About inlining
That's a pitty that gcc cannot inline functions placed in other files alone. We have to choose between:
- break the encapsulation (by putting private data structures and accessors in headers files to help gcc)
- live with low performance
- switch to a decent compiler such as icc (not quite possible because of its licence).
Actually, another solution may be possible. I played a while trying to setup a "SuperNovae compilation mode" into SimGrid. The idea would be to have only one C file including all the source files composing the library using the preprocessor #include directive.
Of course, that would not be really pleasant to hack onto SimGrid since every compilation would last a few minutes, but it would give this stupid gcc the opportunity to inline all the little accessor functions we have. When in maintainer-mode, we would compile as it is now, and would only change for the users, when compiling from a tarball. Naturally, the "SuperNovae" name come from the fact that we throw all source files into only one file so that they meld and fusion together... I love stupid names, sorry.
But my attempt (20mn) did not succeed:
There is static variables and functions of the same name in several files of SimGrid (every network model defines a communicate() function). Of course, it does not work when everything is thrown in the same file. But that's easy to fix: change the names to avoid the conflicts.
We have 2 flex generated parser and they also declare static variables of the same name. We need to put them in separate object files, but they are somehow disconnected from the rest of SimGrid already. Moreover, having only 2 separate files still qualifies for the SuperNovae name since the lib is about 50,000 lines these days.
Then, the logs kinda rely on the fact that there is a variable called _simgrid_log_category__default containing the address of the variable describing the log channel used by default. This does not work anymore when everything is in the same file. Instead, we need to use the preprocessor also to set the default log category:
#define simgrid_log_default xbt_dynar
#include "xbt/log.h"
Within log.h, we could redefine every LOG1...6() macro to use this simgrid_log_default value to choose where to log into.
That would be a bit crude, but if inlining only dynars traversal and one function of swags gives a dozen of percents, such a radical approach could gain a order of magnitude I think.
Or we could also do nothing and wait until someone comes and fix the issue in GCC directly
Future releases
Next release, by Arnaud+Bruno+Pedro will bring a very cleaver improvement to the solving function laying directly on the critical path. This should bring arbitrary large performance improvement if you build your scenarios right (ie, the speedup is directly proportional to an input parameter).
In other words, it will be real algorithmical achievement leading to algorithmic complexity reduction, not a software engineering improvement like Cristian and I did last month for a few dozen of percents, nor a cheap technical hack like the one I did this afternoon to circumvent a tool limitation