Model-checking efforts in SimGrid

Motivations and personal background
The multiple ways of Model-Checking
Current research efforts on Model-checking in SimGrid

I blogged recently about an article on the didactic of the informatic that I read for a working group on that thematic that we are initiating here in Nancy. I consider informatic's didactic as my third research area currently, alongside with model-checking and dynamic verification of distributed applications as a second area, and distributed applications simulation as a main area. As you guessed, this blog post aims at giving some informations about what we are doing in the second of my research areas. Since I'm currently in the plane (probably somewhere above Canada) to visit Henri in Hawai'i for three weeks, I guess that I will have several occasions to blog on my first occupation in the next days.

Motivations and personal background

We started investigating to couple model-checking with the SimGrid distributed application's simulator maybe five years ago. My motivation here comes from my main usage of the simulator: I wanted to test the code that I develop for distributed settings. That's why I developed GRAS, and how I got involved in the SimGrid effort at the beginning. Then, when I started using this for real (on the Naimi-Trehel algorithm), I discovered that simulation was only a partial answer: the experimenter have to manually determine the set of scenarios on which s/he tests the application. That explain why I came to model-checking: I'm seeking for a way to the exhaustive verification of the distributed applications.

After several years, I'm still very far from being a specialist of model-checking myself. In fact, I'm not sure that I actually want to become such a hyper-specialist. I didn't leave my previous field of specialty (on the simulation of distributed applications) on purpose: I want to apply this methodology to my main research interest. If I can contribute to pure model-checking theories on the way, I'm glad, but that's not my main goal. After co-advising the thesis of Cristian for example, I'm not a complete novice in that field either, and this post aims at helping me organizing my thoughts on it. Sorry if this post only interest me and myself... That's a blog after all

The multiple ways of Model-Checking

Actually, the expression "model-checking" covers several approaches and research communities. The basic idea is to get (i) a system (be it an pure abstract model, a real software implementation, or something in between), (ii) a set of properties, and ensure that the state of the system cannot evolve in a way that would violate the properties. It does so by exploring all the possible evolution paths of the system. The main advantage is that when a property violation is detected, we can determine the exact scenario that lead the system into this bogus state, which reveal extremely precious to debug and fix the system so that this evolution becomes impossible. The bad point is the inherent exponential complexity of the exploration, known as the state space explosion problem, which constitute a major obstacle to the practical applicability of the approach.

There is several difficulties in model-checking: first, one has to provide a system execution controller that allows to run the studied system while providing a control on any non-determinism source (such as thread/process scheduling, networking delays leading to message ordering, timeouts, failures, and other randomnesses). Then, an exploration mechanism must be provided to explore the several evolution paths that could take the system in response to these non-determinism. As stated earlier, this exploration mechanism must contain a reduction mechanism to reduce the amount of states to explore without loss of generality, so that the whole process remains applicable in practice.

Abstract vs. Implementation-Level Model-Checking

While the first efforts worked with abstract state machines to represent the system to verify, this requires a difficult, time-expensive and error-prone manual modeling of the real implementations, or an equally difficult manual conversion of the models into real implementation. Several attempts try to mechanize these conversion to remove or at least reduce these difficulties. Refinement methods were for example proposed to derive automatically an implementation from the model, but they are rarely used in practice. I guess that they pose performance issues or that they are not always applicable. On the reverse side, static analysis approaches were proposed to build automatically a model from a given implementation. But the amount of information available by considering only the application source code is very limited, and does not allow to build effective models that could be verified later on. In particular, run-time information can reveal crucial to reduce the state space to explore.

This explains why a lot of efforts are interested in verifying directly the real implementation. But this complicates greatly the already complex problem. First, a verification at the implementation level will have to deal with an increased complexity over a model-level verification, since abstract models hide certain details of real systems. Then, the system execution controller is much more complex to devise when dealing with real applications than when dealing with abstract models. It introduces system programming issues that get added to the more theoretical considerations classically considered in model-checking. Actually, I must confess that this later point is something I like very much in this research area: it mixes very practical system programming difficulties to very theoretical consideration. For me, the best way to look gifted to everyone is to be bad at many things...

Application domains: Concurrent vs. Distributed

Model-checking were initially mainly applied to concurrent / multi-threaded applications, seeking for issues such as race condition, deadlocks or fairness issues such as livelocks. I use them in distributed applications, which present specific difficulties in addition to the classical concurrent defects: the lack of global state prevent entities from having accurate global information on the system while the lack of global time makes it difficult to order events happening on differing entities.

Distributed settings are simpler to model-check than concurrent ones for several reasons. First, the amount of interleaving to consider is much smaller. That is, entities only interact through message passing in distributed settings while every memory access has to be considered in concurrent settings. Then, the lack of global time turns into an advantage in the sense that the respective interleaving of the local events of several entities have no impact on the global evolution of the system. That is, the way how local events of entity A are interleaved with the local events of entity B is not relevant. This method can greatly reduce the amount of interleavings to explore, if you can determine accurately the dependence and independence of events. That's even one of the strong results of Cristian thesis: we proposed (and proved) decision rules that are less conservative than previous works (thus leading to improved reduction) while remaining conservative enough to ensure soundness of the process.

On the other side, real distributed systems are harder to verify than pure concurrent ones because things are more intricate in real systems. In most distributed applications, each processes are multi-threaded for performance reasons, forcing to combine methods applicable to distributed systems with others applicable to concurrent ones. Note: I seem to remember that there is an impossibility result here, stating that it is impossible to verify exhaustively a system constituted of distributed multi-threaded entities without an infinitely large exploration stack, but I'll have to check it again since I just read an article on Dometer, that seem to do this anyway. Sorry, the article is quite short but being in the plane make any bibliographical investigation just impossible. This impossibility is very unfortunate, given the practical nature of distributed applications, and I'd be happy to be wrong on this one (that is, if checking distributed concurrent processes would reveal possible).

Safety vs. Liveness properties

The kind of properties that you want to check about your system fall into two main categories. Safety properties express things such as "nothing bad will ever happen". They are very easy to verify: these assertions are expressed through boolean functions that must be evaluated over every possible state of the system. So the model-checker just have to traverse all possible states. It may safely cut the exploration on paths that lead to sections that are equivalent to some already explored sections.

Liveness properties are harder to verify. They allow to express things such as "good things always show up eventually". They allow to reason not only on isolated system states, but on state paths (if I come into a state presenting this and that property, then I'm sure I will come into a(nother) state presenting that and this property). Actually, liveness property can be arbitrary complex logical formula. Their atomic propositions usually correspond to the variables of the system to verify. Marie Duflot who recently joined our research lab even use a Turing-complete formalism to express the property, with integer variables and everything. I'm really excited about the collaboration opportunities here.

Verifying liveness properties is much harder than safety properties. Since we are looking for a counter-example of the properties for the system S, the basics consist in expressing the negation of the property as a state automata P, and exploring the state space that cross-product automata SxP. Note that this is a generalization over the previous case, since safety properties can be expressed as automata with only two states: one initial state where the property to check is not verified and a final accepting state in which the property is violated. So, every model-checker able to verify liveness properties can trivially verify safety properties too.

The main difficulty posed by this generalization is that it makes the problem of state space explosion only worse since we are not exploring S alone, but SxP. S alone were exponential, and we now have to deal with the product of two exponentials. To make things worse, the reduction is harder for liveness properties than it is for safety ones. Since we are reasoning about execution paths, the verification algorithm must carefully ensure that we don't break any existing cycles or add any new cycle on these paths. As a matter of fact the reduction techniques proposed during Cristian's thesis are not applicable in this context for this very reason.

Current research efforts on Model-checking in SimGrid

The thesis of Marion that started a few months ago is naturally building upon the one of Cristian. Previously, we leveraged and extended SimGrid to get a working execution controller of distributed applications. We are somewhere between the abstract level and the bare implementation level: the program that we verify are raw C programs, but they use a specific API of ours to express their message exchanges. On the properties side, we took the easy way and restricted ourselves to local safety properties, expressed by the user as boolean functions. If these properties were made global by sharing some state between entities for verification purpose, our exploration algorithm would become unsound since the reduction would be made possibly too aggressive.

We are with Marion now working on generalizing what on several points. A long term goal (that we may or may not reach) is to verify any C program by the interception of system interactions directly at the system call level. During her master thesis, Marion implemented a prototype of such an interceptor. It could be used to implement an emulator, as it were in the MicroGrid project that inspired us, but another possibility could be to couple it to the model-checker instead of to the simulator. There is still a long path to cover for that, but that'd be so cool to achieve that we will at least try. It would open the door to a sort of systematic debugger for generic applications... Our experience with Cristian shows that you have to model the application semantic for practical reduction. This let me dream of some sort of semantic debugger for distributed applications, but that will probably need at least another thesis program to become doable. The backup plan here is to allow the emulation through simulation of legacy applications. We have good reasons to think that we can do so in a more extensible way than MicroGrid, for example.

The current front on which we are fighting to ensure that temporal properties can be verified in our framework. To the best of my knowledge, no existing solution allows to check temporal properties at implementation level. The blocking point that we are currently facing concerns the state equality detection. Indeed, the algorithm allowing to verify temporals ends when finding an accepting state that is equal to a state that were already explored on this path. But for implementation-level verification, the system state is defined by the globals, stack and heap of the OS-process executing the system. The execution controller already needs to save and restore that state to enable the exploration, and saving the state to another location and comparing the current state with an already explored one could be as simple as changing the memcpy() into memcmp().

But unfortunately, doing so leads to a tremendous amount of false negatives where the checker thinks that two states differs where they are equal from the application point of view. This is because the considered memory areas contain a lot of information which details is not important to the application semantic, such as the numerical address returned by malloc, or the port on which accept binded the socket service. This is particularly bad since it leads to undetected accepting states, and possibly to unsound explorations.

Our current investigation thus try to explore this notion of equality between states. We'd like to first reduce the amount of false negative and ultimately move on to a notion of state distance instead. For that, we compute the hamming distance between the states to help the user determining whether two states reported as different could in fact be a false negative. In addition, we use a reimplementation of malloc that stores a lot of extra meta-data, such as the full backtrace at which each malloc were done. This way, when the state comparison algorithm can give the location at which modified blocks were allocated, helping the user to diagnose false positives. From our first findings, the heap often differ only by the order of the blocks, so We are now implementing a heap canonicalization solution to make these differences irrelevant to our diff algorithm. Later, a mechanism similar to the suppressions of valgrind could be used to declare irrelevant differences that should be ignored, for example.

Arguably, our solution will never be perfect and will always present false positives, ruining the soundness of our exploration. It means that we will always miss some defects in the verified systems. But nothing is perfect in this world, unfortunately. Since I don't think feasible to prove the correctness of our tool itself, the possibility to miss bugs is something that we definitely have to live with. Even if we'd prove our tool, we'd have to prove the compiler and the standard library that it uses too, which seems less and less feasible. Even if not absolutely perfect, I'm confident in the fact that the resulting tool would reveal helpful to find and eradicate some defects, which is already something.

Side note: blogging about live science is difficult. I really want to help the dissemination of ideas here, but I don't want to prevent Marion from publishing properly her ideas. The exercise is really difficult :-/ Damn damn evaluation based solely on the publication amount. It really hinders the scientific progress. Hopefully, the paper that we are working on with Marion will soon be ready to share on an open-access platform such as HAL.