simgrid 3.3.2 progress

The next release of simgrid is really getting in shape. We could even have released it today if:

I didn't spent so much time this morning hacking around to get ikiwiki installed
Gforge wouldn't have had a (scheduled) downtime this evening

What's new in this release

This version achieves the following main points:

I restructured and simplified surf:

writting new models should now get easier (I mean possible for others than the author of this code )
the routing logic is now externalized from the network modelization. It already allows some people to test new ideas in this area, which represent a long-standing scalability limit. Some new models of Silas are already in.

Cristian restructured and simplified simix:

we could maybe add new context handlers (such as windows fibers) althrough it not on our todo list currently. I mean, we'd better make the windows port more robust before thinking of its performance...
This would only partially help to write more bindings: I think it would be more useful to have the C++ bindings working and add a layer of swing over it. I mean, both are mandatory, but the C++ bindings were never integrated into the build chain, and fixing this is not high in my TODO...
It was absolutely mandatory for Cristian to work on model-checking in simgrid: the prototype we have works by saving the whole unix process running the simulation. We really need to save the state of each simulated process to increase the granularity, and this simix restructuration is the first step on this road.

Stephane Genaud continued his work on SMPI.

It seems to be still some issues with it, but it's also getting in good shape. I promise Stephane to help him debugging some of the remaining issues since weeks, but didn't find the time to do so yet.

VPATH support much improved.

A while ago, automake was unable to build java source under VPATH, so I gave up this neat feature to build the pthread and ucontext backends at the same time in separate directories. But I tested again recently, and it now works. So, I fixed all the cruft accumulated with the time on our side to let this feature work again. I'm not completely done: the java tests still don't find their platform file, but I'm quite unsure of how to pass an argument to them.

Status point

Like I said, it seems in a good shape, and the only remaining issues are the following. I really need to hurry since Pedro and Bruno seem to be done with the partial updates of the linear system.

On one side, I'd like two separate releases for the simplifications and theses speedups because they are quite different, but we may merge them if I'm too slow.

teshsuite/gras/datadesc failures

From what I understand, but is in the code dealing with cycles in marshalled pointers forests. But I'm very puzzled since it only happens from time to time, and on given platforms. From my tests, it happens about 70% of the time on bob (a debian/amd64 box), but never on Cristian laptop (a ubuntu/amd64 box).

Also, it never happens in valgrind. So I thought that it may be some sort of bug in the libc (since valgrind provide a new implementation for large parts of it), but the bug also happens on mac OS X...

It may be because of the compiler, I'll have to check the versions on each boxes.

It may be because I only take parts of the pointer address into account to detect cycle. I mean, cycle detection is achieved by storing the pointer address in a dict, so if I store only the begining, I may possibly get stupidities sometimes. That seems quite unlikely.

I definitely need to dig further on this, this is the show stopper ATM.

examples/gras/pmm eratic failures

This test fails about 30% of the time on my box, and I completely fail to see why. It never fails on Cristian laptop. It's so since months (if not years), so it will stay that way for now.

Tesh wrong failures on Mac OS X

Some of the tesh self-tests fail from time to time because rm -rf dir cannot remove a directory. Since it complains that the directory is not empty, I guess they contain some sort of system metadata.

One day, I'll have to mark these commands as teardown, explaining tesh that a failure on them is not critical... For now, we'll live with it.

OpenSolaris still completely broken

I gave it a try on http://pipol.inria.fr, but this platform gets me nuts ATM. And who cares of this arch anyway?

FAQ quite outdated.

It for example still mention that we have difficulties with integrating flexml 1.7 althrough Arnaud did that maybe one year ago. I'm not sure I'm gonna find the motivation to fix it before 3.3.2.

Future

After this release, we'll have to work on:

Partial updates of the linear system (if not integrated in this release)
Fixup the DTD:
- Laurent noticed that our usage of XML is ... not regular
- Fred noticed that the code builds real crap when using route-multi (ie, clusters)
Plug memleaks: nothing critical yet, but we are definitely not clean in this area. This may be an assignment of the to-be-hired engineers.
Come up with a visualition solution: we killed Paje in 3.3... Bruno?

Plus some other still unclear points...