Remote-SimGrid working prototype

This article is part of a serie about my recent work on SimGrid:

Binge coding in SimGrid
Remote-Simgrid working prototype
Remote-SimGrid now with ProtoBuf and in Java
SimGrid is back on Windows

Two weeks after, it's time for a little status update on my current work on SimGrid. As expected, I did not code frantically this time but rather coded a bit instead of watching the TV at night. Here is what I've managed to do.

In SimGrid

In SimGrid v3.12, I didn't do much. I fixed a bunch of user-reported bugs, spell-checked some comments and such. I tried to start some cleanups in the simcall mess, but without much success. I cannot understand why the implementation of comm_wait and exec_wait are so different. But that's something I'll investigate another day. Converting Simix to C++ could make this easier, but C++ will not interact nicely with our simcall infrastructure.

This code is too obscure to survive as is. Yesterday, I wanted to clean the send simcall that takes the sender as first argument. Since the simcall issuer is already provided elsewhere, that's kinda useless and I wanted to sanitize it. But it's impossible because SMPI is misusing this simcall: in the context of RMA (remote memory access), the send action is sometimes initiated by the receiver, making this pimple mandatory. So I cleaned the things the other way: the recv() simcall now takes the receiver as a parameter, making it possible to initiate a receive action on a remote process. Given the current state, any trick simplifying the internals is welcome...

In S4U

In S4U (aka, SimGrid v3.13 or SimGrid pre-4.0), I changed my mind, and used the name Mailbox again. I was thinking about calling this Channel in the hope that it will less disturbing to our users, but I was not convinced so I reverted my change. The MSG processes are now named Actors. That's much better than Processes since some people search for the SimGrid Threads along with these Processes. I'm considering the word Agent instead, but it seems too vague and overused. You have a better idea? Drop me an email! I'll add your comment to this page if you wish.

The big change in S4U is the asynchronous actions. I only implemented communications so far, and some advanced parts of the existing API are not accessible yet, but I'm still happy of this functional POC. The drawback is that every S4U asynchronous actions are implemented as an additional layer over the simix synchronization, with something like this:

class simgrid::s4u::Async {
...
private:
   smx_synchro_t p_inferior = NULL;
...
}

It sounds a bit like adding an extra layer of shit to the massive tome, but as I said, I did not manage to untangle the simcalls related to the synchronizations yet. I hope that having a clean interface above will give me a clear overview of what the internals should look like. But you can already enjoy this nicer interface. Here is the code of Actor::send for example. It initializes the communication, and then fill all parameters before starting the communication. It's much better than the long litany of parameters that we have in MSG, don't you think?

void s4u::Actor::send(Mailbox &chan, void *payload, size_t simulatedSize) {
   Comm c = Comm::send_init(this,chan);
   c.setRemains(simulatedSize);
   c.setSrcData(payload);
   // c.start() is optional.
   c.wait();
}

Remote SimGrid: working prototype

My biggest achievement these days is in Remote SimGrid. I'm done with the protocol connecting the server and the representatives that are embedded in the application, and it's working. Here is a simple client application that can run within RSG:

#include <stdio.h>
#include <rsg/actor.hpp>

int main(int argc, char **argv) {
    simgrid::rsg::Actor &self = simgrid::rsg::Actor::self();

    self.sleep(42);
    self.execute(8095000000); // That's the power of my host on the used platform
    self.send("toto","message from client");
    char * msg = self.recv("toto");
    fprintf(stderr, "Client: Received message: '%s'\n",msg);

    self.quit();
}

And the corresponding server:

#include <stdio.h>
#include <rsg/actor.hpp>

int main(int argc, char **argv) {
    simgrid::rsg::Actor &self = simgrid::rsg::Actor::self();

    char * msg = self.recv("toto");
    fprintf(stderr, "Server: Received message: '%s'\n",msg);
    self.send("toto", "Message from server");
    self.quit();
}

The trick is that the RSG server sets an environment variable RSG_PORT before fork+exec()ing the processes, so that simgrid::rsg::Actor::self() knows where to connect back to the infrastructure.

And next comes the deployment file. The function name of processes is only used in the log messages, while the first and only argument is the command line that we have to fork+exec. Here, it seems over-engineered as as the command line is very simple, but I guess that this will be handy when starting real processes with tons of parameters.

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid.dtd">
<platform version="3">
   <process host="host0" function="client">
      <argument value="./dumb_client"/>
   </process>
   <process host="host1" function="server">
      <argument value="./dumb_server"/>
   </process>
</platform>

All together, this gives the following output. The lines with a regular SimGrid format (specifying the host and process names along with the timestamp) are within the RSG server, where the simulation takes place. These debug lines are probably too verbose for production, of course. The naked lines without all these information were produced directly in the client and server, which code is given above.

$ ./rsg two_hosts_platform.xml deploy.xml 9999
[host0:client:(0) 0.000000] [rsg_server/INFO] sleep(42.000000)
[host0:client:(0) 42.000000] [rsg_server/INFO] execute(8095000000.000000)       
[host0:client:(0) 43.000000] [rsg_server/INFO] send(toto,message from client)
[host1:server:(0) 43.001301] [rsg_server/INFO] recv(toto) ~> message from client
Server: Received message: 'message from client'
[host1:server:(0) 43.001301] [rsg_server/INFO] send(toto,Message from server)
[host1:server:(0) 43.002602] [rsg_server/INFO] quit()
[host0:client:(0) 43.002602] [rsg_server/INFO] recv(toto) ~> Message from server
Client: Received message: 'Message from server'
[host0:client:(0) 43.002602] [rsg_server/INFO] quit()
[43.002602] [rsg_server/INFO] Simulation done
$

So here we are, I have a fully working prototype of Remote SimGrid, at least.

Remote SimGrid: a glance at the internals

Internally, the RSG server (in charge of running the simulation) and the RSG actors (the representatives on client side noted self above) exchange json requests, such as:

{cmd:send,mailbox:"toto",content:"message from client"}
{ret:send,clock:43.001301}

I used Jasmine to parse the json, a lightweighted and blazing fast parser in raw C, and every buffers are reused, reducing the memory management overhead to the bare minimum. I think that the performance will be rather good even if I didn't test yet.

Naturally, shared memory and some semaphores for the messaging would have been faster, but this design is simpler and thus more robust. Also, it will be easier to scale further by locating some client processes on remote machines, or to write the client library for other languages. I personally need Java, and JNI seems too complicated wrt a reimplementation of this Json-based communication protocol.

Adding a new command to the protocol is really easy. Its name and prototype is defined in the source as follows. sleep takes one parameter, named "duration", and that's a double (think of the %f notation). recv takes one parameter "mailbox" that is a string, and returns a string.

command_t commands[] = {
    {CMD_SLEEP, "sleep",  1,{{"duration",'f'},NOARG,NOARG,NOARG,NOARG,NOARG},          VOID},
    {CMD_EXEC,  "execute",1,{{"flops",'f'},NOARG,NOARG,NOARG,NOARG,NOARG},             VOID},
    {CMD_QUIT,  "quit",   0,{NOARG,NOARG,NOARG,NOARG,NOARG,NOARG},                     VOID},
    {CMD_SEND,  "send",   2,{{"mailbox",'s'},{"content",'s'},NOARG,NOARG,NOARG,NOARG}, VOID},
    {CMD_RECV,  "recv",   1,{{"mailbox",'s'},NOARG,NOARG,NOARG,NOARG,NOARG},           's'}
};

The client side code is straightforward. The rsg_request() function takes care of generating the json, sending it to the server, parsing its answer, and retrieving the result.

void rsg::Actor::execute(double flops) {
    rsg_request(p_sock, p_workspace, CMD_EXEC, flops);
}
void rsg::Actor::send(const char*mailbox, const char*content) {
    rsg_request(p_sock, p_workspace, CMD_SEND, mailbox, content);
}
char *rsg::Actor::recv(const char*mailbox) {
    char *content;
    rsg_request(p_sock, p_workspace, CMD_RECV, mailbox, &content);
    return content;
}

And so is the server side code. rsg_request_getargs() parses the json sent by the client and fills the variables to retrieve the parameters while rsg_request_doanswer() generates the json and sends it to the client.

s4u::Actor *self = s4u::Actor::current();
...
switch (cmd) {
    case CMD_SEND: {
        char* mailbox, *content;
        rsg_request_getargs(parsespace, cmd, &mailbox, &content);
        XBT_INFO("send(%s,%s)",mailbox,content);
        self->send(*s4u::Mailbox::byName(mailbox), xbt_strdup(content), strlen(content));
        rsg_request_doanswer(mysock, parsespace,cmd);
        break;
    }
    case CMD_RECV: {
        char* mailbox;
        rsg_request_getargs(parsespace, cmd, &mailbox);
        char *content = (char*)self->recv(*s4u::Mailbox::byName(mailbox));
        XBT_INFO("recv(%s) ~> %s",mailbox, content);
        rsg_request_doanswer(mysock, parsespace,cmd, content);
        free(content);
        break;
    }
}

One magic thing is that there is no need for any kind of extra synchronization. The RSG clients are blocked by the socket communications until their representative within the RSG server (a S4U actor) can serve them. So, the simulation runs just as usual: the maestro dispatches the control to each S4U actor when its actions within the simulation are done. If you don't understand that paragraph, that's perfectly fine: it all works seamlessly.

What's next?

Of course, I'm still far from being done. SimGrid 3.12 still has messy internals, S4U is still lacking most of the MSG API and Remote SimGrid is really far from being a terminated.

My priorities for the next days/weeks will be port the async actions to RSG. The trick will be to create proxy objects within the RSG clients that will represent the actual S4U objects created on the RSG server. I will also develop a Java POC for Remote SimGrid, enabling to connect Java applications back to the RSG server.

Once this will be done, I think that the major difficulties of the RSG project will be solved, and porting the rest of the S4U interface will be a somewhat mechanical task. More tests will certainly show bugs that will need to be fixed. The kind of testing infrastructure that I can setup for RSG remains an open question, too.

Then, I will still have to flesh out S4U. And of course, the SimGrid internals still need a lot of cleanups, as always. Actually, I should probably not say that, but I think that I somehow like cleaning the internals of SimGrid. Some people grow their japanese garden, some others solve Sudokus, but I prefer refactoring the SimGrid internals...

(read the follow up of this article here: Remote-SimGrid now with ProtoBuf and in Java)