The fight was hard and long, but we succeeded. I accumulated the problems: gforge was down for a while (preventing me to efficiently code on my laptop and test on a 64bits box where the bug was triggered), my connexion went mad (make this even harder) and I accidentally erased my git working directory (loosing all the branches and uncommited changes). But we fixed the data marshaling madness. So, I'll release the 3.3.3 version of SimGrid soon (when git is done integrating the svn tree again).
The bug was vicious because it is triggered in an awful part of the SimGrid code: the GRAS data marshaler. It copies a pointer forest from one host to another through the network. The code is very difficult to read because it's full of weird dereferencing, such as
void some_function(char *dst) {
*(void **) dst = l_referenced;
}
I know it's my fault, I should have written this code with the correct type declarations to functions, but it now works and it's 1000 lines of such craziness. I don't want to change it now unless forced.
The bug was triggered by the cycle detection. This is because the marshaling works even if requested to pass a pointers forest containing cycles (ie, situations where a->b->c = a). For that, when asked to pass a reference (a pointer) over the network , it checks where it was already passed and if yes, don't actually pass it. For that, it uses a hash table (actually, a xbt_dict_t) where the key is the pointer to be sent and the value the local pointer corresponding. So, when I want to pass 0x1234 for example, I check whether 0x1234 is already an entry of the hash table and if yes, I know that it is a cycle. The sender don't send the actual content of the pointer and the receiver use the content of the hash as pointer content. If that address is not already in the hash, I do pass it over the network and add it to the hash. Well, it's not very clear, but it's also because the code is complicated.
I spent an awful amount of time trying to understand again this code, adding debug logs and so on. And then Cristian had a look at it and recognized a bug he already experienced: somewhere in the dict, the key was copied with strcpy. So when using a binary key, the copy stopped on the first \0x0.
That is why cycle detection was broken from time to time, and more often on 64bits: it's not always that the pointer contain a 0x0 (because linux loads dynamic libraries at random position in memory to make harder to write overflow attacks), and since pointers are longer on 64bits, the probability to contain a 0x0 is larger there. At the end, the fix is a simple as :
- strncpy(element->key, key, key_len);
+ memcpy((void *)element->key, (void *)key, key_len);
Damn it. For the record, Cristian already suffered this bug, and even committed a fix on 2008-11-25, but on an experimental branch that was abandoned since then. Our workflow with git definitely needs improvement...
Anyway, the svn import is now finished, I go release that damn 3.3.2 version.