An ever-increasing amount of data is expected to
be generated at the individual side (e.g.,
wearables, smart-meters) (i) often as time-series
and (ii) stored on a personal device for being
analyzed and vizualized.
Chiaroscuro^{1} aims at:
Chiaroscuro avoids two major common obstacles to personal data analytics.It does not copy the personal time-series on any central server (no information about a personal time-series leaves in a non-protected form its personal host devices). And it does not use any prohibitively-costly cryptographic protocol for extracting the profiles.
Chiaroscuro rather promotes an innovative approach that discloses perturbed agregate information all along the execution sequence by intertwining encryption schemes with sanitization schemes.
1. T. Allard, G. Hébrail, F. Masseglia, and E. Pacitti, "Chiaroscuro: Transparency and privacy for massive personal time-series clustering," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15. ACM, 2015, pp. 779–794. Available here.
k-Means algorithm: iterative computation of groups of similar time-series. Representative profiles are the centroids.
ExampleGossip-based modus operandi: simple point-to-point exchanges for robust and efficient agregate computation.
Learn MoreHomomorphic encryption (HE) & probabilistic differential privacy (DP): obfuscation for simple operations & disclosures for the rest.
Learn More (HE) Learn More (DP)Chiaroscuro has been implemented within Peersim^{1} a well-known framework for simulating massively-distributed algorithms. The graphical user interface (that you are currently reading!) reads the logs of a set of previous executions and allows you :
Since the simulation is executed on a single computer, we had to shrink the number of participants to a few hundreds (i.e., the shrinkFactor parameter) and to disable the encryption scheme. This impacts the simulation as follows: first, the noise magnitude is shrunk too in order to remain proportional to the shrunk population size, second, the performances of the encryption scheme are measured outside the simulation.
Figure 2 depicts the demonstration platform. Chiaroscuro is written in Java and implements Peersim's nextCycle method by the core of its execution sequence. This method is the same for all participants and is the entry point for Peersim each time it calls a participant. The graphical user interface is based on Bootstrap and can be displayed in any web browser. It allows visualizing and interacting with the logs of the execution scenarios that have previously been captured and stored in a local MongoDB instance.
1. A. Montresor and M. Jelasity, “PeerSim: A scalable P2P
simulator,” in P2P, 2009, pp. 99–100.
Parameter | Value |
shrinkFactor | 5000 |
Population size | 3.10^{6} / shrinkFactor = 600 participants |
Key sizes | {512, 1024, 2048} bits |
Nb key shares | 10 |
Nb noise shares | 3.10^{6} / shrinkFactor = 600 |
Initial nb of centroids | 20 |
In this use-case, each times-series represents the evolution of a tumor size over twenty weeks, with one measure per week. Computing representative profiles can help individuals or health professionals identify profiles where the tumor size decreases, healthy profiles in some sense, in order to analyze then the associated medicines or life-styles for example. The set of time-series was generated synthetically from mathematical models^{1. }
1. L. Claret, M. Gupta, K. Han, A. Joshi, N. Sarapa, J. He, B. Powell, and R. Bruno. Evaluation of tumor-size response metrics to predict overall survival in western and chinese patients with first-line metastatic colorectal cancer. J. Clin. Onc., 31(17):2110–2114, 2013.
Four participants are sampled at random. We plot for each iteration, their own time-series with the closest perturbed centroid.
Current iteration : #0
Any individual obtaining the resulting set of common profiles, possibly helped by a user-friendly GUI, can use them for identifying those of particular interest. For example, in the tumor-growth use-case, Bob could benefit from the profiles that show an initial growth similar to his own time-series, but that is followed then by a steady decrease in size.
We show below the time-series of participant #213, say Bob, together with the profiles that are the closest to the sub-sequence that falls into the window (in grey). You can play with the window to select the sub-sequence that is used for similarity-matching. Extending the profiles matching with additional criterion is easy (e.g., largest distance with the complementary sub-sequence).
We illustrate the impact of the perturbation on the centroids quality by showing below the first four centroids, (1) at each iteration (2) before and after adding noise. We stress that we show them for illustrative purposes, the noise values never appear in the clear to any participant.
In the second use-case, time-series represent daily electrical consumption made of twenty-four hourly measures. Letting individuals compute, access, and use representative profiles can enable various benefits. For example, as a simple feedback. Indeed, simply visualizing the positioning of one's own electrical consumption with respect to the closest representative profile can favor virtuous behaviors. Or, similarly to well-being use-cases, one could identify the interesting profiles in order to analyze then, in an additional step, the electrical devices used at home or the materials isolating the walls for example. The time-series used for obtaining the numbers below have been extracted from the real set of time-series gathered by the Irish Commission for Energy Regulation and available at the Irish Social Science Data Archive^{1}. It contains the electricity consumption time-series of thousands of Irish homes and businesses collected between 2009 and 2010 at the rate of one sample every half hour.
1. ISSDA. The Commission for Energy Regulation, Electricity Customer Behaviour Trial, 2012. http://www.ucd.ie/issda.
Four participants are sampled at random. We plot for each iteration, their own time-series with the closest perturbed centroid.
Current iteration : #0
The quality of a given clustering can be measured by a set of objective measures including the intra-cluster inertia. Intuitively, the intra-cluster inertia quantifies the relevance of a set of centroids to a dataset. It can be defined as the weighted squared sum of the distances between each data and its closest centroid (in our context, both are time-series). The figure below displays the evolution of the intra-cluster inertia of the perturbed centroids along the iterations of Chiaroscuro.
The operations performed by the Damgard-Jurik encryption scheme (the additively-homomorphic encryption scheme used in the current version of Chiaroscuro) consist in modular multiplications and modular exponentiations, which are computation-intensive operations. They are performed by personal devices when computing the candidate centroids at each iteration. The figures below show the average time (in seconds) spent by a participant in a complete iteration, at each iteration, for encrypting, decrypting, or adding candidate centroids. We measured times for three sizes of key (512 bits, 1024 bits, and 2048 bits) and on two personal devices (a current laptop and a decade-old laptop).
It is normal to observe times that decrease along the iterations because the number of centroids decreases monotonically. Indeed, the centroids that correspond to few data are strongly impacted by the differentially private perturbations, and become more and more irrelevant with respect to the actual data. They end up being "ejected" from the set of centroids. This phenomenon exists with non-perturbed k-Means, centroids that correspond to no data at all are also "ejected", but it is made stronger by differential privacy.
Configuration 1: Mac OSX, four cores at 2.8 GHZ, 16 GB Ram, Java 1.7.
Configuration 2: Linux Ubuntu 14, two cores at 1.8 GHZ, 2 GB Ram, Java 1.7.
The homomorphic encryption also impacts the network traffic because it increases the size of the data exchanged. The figure below shows the average network traffic (in KB) per participant along the iterations for different key sizes (512 bits, 1024 bits, 2048 bits). We observe a steady reduction of the traffic due to the phenomenon of centroids loss described above.
Univ. Rennes 1 / Irisa Lab Website |
EDF Lab Website |
Inria Website |
Univ. Montpellier / Lirmm Lab Website |
SED @Inria Rennes Website |
SED @Inria Rennes Website |