We share insights from our work.

Get in touch

info@hkanalytics.io

Finetuning large language models for software development

By Roma Koulikov • 08 Dec, 2023

Project Github repo One Friday afternoon, while planning the following week's software development work, a thought crossed my mind, "Wouldn't it be nice if I could issue a set of instructions about the intended feature and have the machine take at least a first pass at writing the relevant functions for me." Large language models have gotten a lot of attention in 2023 (from hereon just referred to as LMs). So the idea was to see how well these LMs, finetuned on our company's code (which focuses on predicting energy output from PV plants), perform on a much more simplified task. First to get it out the way, I am of course familiar with Github Copilot. But Copilot is paid, and I would also like control over the internals of the LMs as opposed to just having a black box. Designing a system to create interrelated blocks of code that integrate into a functioning codebase in response to a user command is a very challenging endeavor. As such, I limited the scope to something much more manageable. Namely, generating detailed code from Python function documentation (from hereon referred to as docstrings). Background In our codebase, we strive to adhere to standards for both docstrings and functions. Every docstring has at the minimum the same sections with a description of what the function does along with its inputs and outputs. We intentionally write layperson-friendly explanations of the pertinent engineering and solar concepts (although we won't repeat these detailed explanations across functions). In our code, we follow the Google coding standard. We strive for a convention for variable names and a certain coding style (i.e. Pandas/numpy heavy vectorization, writing for humans, DRY, etc.). So we can use the docstrings, which in essence describe what the function does, to generate the function itself. The way we achieve that is by finetuning existing LMs trained on code (ideally Python). The questions What LMs can we test? How much do they improve if we finetuned them as opposed to just using them out of the box? How good (or bad) is the code they generate? Does it even run? Models I decided to use the following code-specific models. Note the values refer to the number of parameters in the model: SalesForce Codegen 350M Trained on 71.5B Python tokens Decicoder 1B Trained on 446B tokens, Python, Java, and Javascript subset of Starcoder Training Dataset CodeParrot 1.5B Based on GPT 2 My original intent was to also finetune on CodeLlama , released by Meta in August 2023. It is a 7B parameter model and has achieved top performance metrics on code generation tasks. However, I encountered memory issues training on expensive GPUs of various sizes and had to halt work temporarily. I'll detail my efforts and the results in a future article. Data preparation Our codebase consists of about 10 modules (aka Python files), some of which contain classes. In total there are approximately 200 functions. The functions are of course connected to each other semantically (i.e. pertaining to meaning). To simplify the problem, though, I basically ignored class definitions and the connections between functions. I then separated each function into an input section for the docstring and the output section for the function itself. I left out 3 functions from the dataset used for finetuning the model in order to test how well the models perform. This is a very small number to base metrics on - I sacrificed metric generalization for using the data to get the best model possible. Modeling I modeled using AzureML in order to make the experiment architecture transparent and reproducible and to leverage cloud compute. Details are in the Github repo . I finetuned on Codegen and Decicoder for 10 epochs with a batch size of 20, and CodeParrot for 6 epochs with a batch size of 100. For all models, I used a sequence length of 500 tokens with a Standard_E8s_v3 machine (64 GB RAM, 128 GB storage, 16 cores, $0.64/hr). The training took around 10.5 hours. Baseline predictions In order to ascertain that finetuning really has an effect, it's instructive to predict on our test functions using the LMs out-of-the-box. Note that for the test set, I chose 3 functions that represent the range of complexity within our code. I will share 2 of them - the 3rd one has our secret sauce for uncertainty quantification in energy losses. Function 1: Calculate PV efficiency loss

Office

357 S. McCaslin Blvd. Suite 200 Blvd.

Louisville, CO 80027

Email us