PhD Thesis: LLM-Powered Continuous Evolution of Scientific Computing Software

Jan 1, 2024

PhD Description

Context

Marc Andreessen argued about “Why Software Is Eating The World” in the WSJ [1]. This is also true for scientific computing that levarages on computing capabilities to understand and solve complex problems, in science (chemistry, physics, maths, biology…), industry (health, space, aeronautics, etc.) and public authorities.

In scientific computing, there is a significant disconnect between the lifetime of physics simulation codes (~20 years), HPC programming paradigms (~10 years), and supercomputers (<5 years). This puts a heavy burden on the developers of these applications, as they are primarily physicists and numerical analysts, but nevertheless have to address software engineering and high performance computing (HPC) concerns when coding, and keep pace with advances in those fields [2].

To achieve proper separation of concerns, the use of domain-specific languages (DSLs) tailored to the needs of the domain experts [3] is a promising perspective to allow physicists and numerical analysts to address concerns specific to their domains, while the language developers address the software engineering and HPC concerns.

However, to integrate and experiment with cutting-edge advances in software engineering and HPC, it is not feasible to start from scratch or manually rewrite the existing code due to the extensive lifetime and size of physics simulation codes. Worst, as both software and hardware capabilites continously evolve, one must continously update and maintain (i.e., co-evolve) its code as a consequence. Unfortunately, this task is still manual and is a burden for developers.

Today, AI advancement showed promising results and plethora of LLMs are re-shaping developers daily activity [4]. This is no different in scientific computing and opens up several perspectives and opportunities for automation.

Objectives

The objective of this PhD thesis is to provide building blocks enabling the rapid evaluation and adoption of cutting-edge advances in software engineering and HPC, in the context of scientific computing software, and simulation codes in particular, by leveraging LLMs.

The overall objectives are, beyond a survey of the state of the art [5, 6] on this topic and in adjacent contexts (i.e., non-scientific software), to explore the feasibility of powering continuous code evolution with LLMs. Among the many existing challenges, we aim to:

Investigate the balance between human intervention and automation required for this task (e.g., rewrite some parts by hand to kickstart the automated process).
Investigate and explore what kind of “evolution harness” must be built around the application, and to what extent this can be automated.
Experiment various LLM pipelines.
Investigate and explore how to enable incremental evolution (e.g., composition of components, interoperability), in particular in the case where the target for evolution is another language: the complete application can’t be evolved all at once, and each evolution increment must be validated. This will also help in the scalability challenge of evolving large complex code.
Investigate the extraction of evolutions at the language level from evolutions at the source code level, such as identifying emerging language constructs from source code evolutions.

Environment

The candidate will be involved in the DiverSE team, joint to the CNRS (IRISA) and Inria, and in the Laboratory in High Performance Computing for Calculation and Simulation (LiHPC) of CEA DAM, affiliated to the University of Paris-Saclay. It will be supervised by Benoit Combemale ( https://people.irisa.fr/Benoit.Combemale/) and Djamel Khelladi ( http://people.irisa.fr/Djamel-Eddine.Khelladi/) from Inria, and Dorian Leroy from CEA DAM. The candidate can be either at Inria in Rennes or CEA DAM in Bruyère le chatel, and visit regularly the other site.

The PhD will be funded by the NumPeX program ( https://numpex.org/).

Prerequisites

A degree (and strong background) in computer science (esp. software engineering)
Skills in programming and modeling languages, and supporting environments
Interests in machine learning (esp. LLMs)
Professional proficiency in English
Skills for presenting and writting
Autonomy, rigor and hard worker

References

[1] Marc Andreessen, https://www.wsj.com/articles/SB10001424053111903480904576512250915629460, https://a16z.com/why-software-is-eating-the-world/

[2] Leroy, D., Sallou, J., Bourcier, J., & Combemale, B. (2021). When scientific software meets software engineering. Computer, 54(12), 60-71.

[3] Fowler, M. (2010). Domain-specific languages. Pearson Education.

[4] Ishaani, M., Omidvar-Tehrani, B., & Anubhai, A. (2024). Evaluating human-AI partnership for LLM-based code migration.

[5] Busch, D., Bainczyk, A., & Steffen, B. (2023, October). Towards LLM-Based System Migration in Language-Driven Engineering. In International Conference on Engineering of Computer-Based Systems (pp. 191-200). Cham: Springer Nature Switzerland.

[6] Almeida, A., Xavier, L., & Valente, M. T. (2024). Automatic Library Migration Using Large Language Models: First Results. arXiv preprint arXiv:2408.16151.

How to apply

Send your CV, motivation letter, and grades of your bachelor and master with the diplomas.

Benoit Combemale

Full Professor of Software Engineering

Agility and safety for wild software