Get in touch

info@hkanalytics.io

Finetuning large language models for software development

Roma Koulikov • Dec 08, 2023

Introduction

One Friday afternoon, while planning the following week's software development work, a thought crossed my mind, "Wouldn't it be nice if I could issue a set of instructions about the intended feature and have the machine take at least a first pass at writing the relevant functions for me."

Large language models have gotten a lot of attention in 2023 (from hereon just referred to as LMs). So the idea was to see how well these LMs, finetuned on our company's code (which focuses on predicting energy output from PV plants), perform on a much more simplified task.

First to get it out the way, I am of course familiar with Github Copilot. But Copilot is paid, and I would also like control over the internals of the LMs as opposed to just having a black box.

Designing a system to create interrelated blocks of code that integrate into a functioning codebase in response to a user command is a very challenging endeavor. As such, I limited the scope to something much more manageable. Namely, generating detailed code from Python function documentation (from hereon referred to as docstrings).

Background

In our codebase, we strive to adhere to standards for both docstrings and functions. Every docstring has at the minimum the same sections with a description of what the function does along with its inputs and outputs. We intentionally write layperson-friendly explanations of the pertinent engineering and solar concepts (although we won't repeat these detailed explanations across functions).

In our code, we follow the Google coding standard. We strive for a convention for variable names and a certain coding style (i.e. Pandas/numpy heavy vectorization, writing for humans, DRY, etc.).

So we can use the docstrings, which in essence describe what the function does, to generate the function itself. The way we achieve that is by finetuning existing LMs trained on code (ideally Python).

The questions

What LMs can we test?
How much do they improve if we finetuned them as opposed to just using them out of the box?
How good (or bad) is the code they generate? Does it even run?

Models

I decided to use the following code-specific models. Note the values refer to the number of parameters in the model:

SalesForce Codegen 350M
Trained on 71.5B Python tokens
Decicoder 1B
Trained on 446B tokens, Python, Java, and Javascript subset of Starcoder Training Dataset
CodeParrot 1.5B
Based on GPT 2

My original intent was to also finetune on CodeLlama, released by Meta in August 2023. It is a 7B parameter model and has achieved top performance metrics on code generation tasks. However, I encountered memory issues training on expensive GPUs of various sizes and had to halt work temporarily. I'll detail my efforts and the results in a future article.

Data preparation

Our codebase consists of about 10 modules (aka Python files), some of which contain classes. In total there are approximately 200 functions. The functions are of course connected to each other semantically (i.e. pertaining to meaning).

To simplify the problem, though, I basically ignored class definitions and the connections between functions.

I then separated each function into an input section for the docstring and the output section for the function itself.

I left out 3 functions from the dataset used for finetuning the model in order to test how well the models perform. This is a very small number to base metrics on - I sacrificed metric generalization for using the data to get the best model possible.

Modeling

I modeled using AzureML in order to make the experiment architecture transparent and reproducible and to leverage cloud compute. Details are in the Github repo. I finetuned on Codegen and Decicoder for 10 epochs with a batch size of 20, and CodeParrot for 6 epochs with a batch size of 100. For all models, I used a sequence length of 500 tokens with a Standard_E8s_v3 machine (64 GB RAM, 128 GB storage, 16 cores, $0.64/hr). The training took around 10.5 hours.

Baseline predictions

In order to ascertain that finetuning really has an effect, it's instructive to predict on our test functions using the LMs out-of-the-box.

Note that for the test set, I chose 3 functions that represent the range of complexity within our code. I will share 2 of them - the 3rd one has our secret sauce for uncertainty quantification in energy losses.

Function 1: Calculate PV efficiency loss

*** # Calculating LFPs per day ######################################################## --------------------------- ---------------------- ------------------------------ ------------------------------------ ------------------------------------- ------------------------------------------------------------- ------------------------------------------------------------------------ -------------------------------------------------------- ---------- ================================================= ======================================================================= ************************************************** __________________________________________________________ ___________________________________________________________________________________ ................................................................................................................... ////////////////////////////////////// .......................... .... ................. .................. ..... ....... ...... ............. .# ## ######## ##... ###..................... ##### | | || || |||| │ │ ┌─╸̅͘═´¯`·‿»¬¶•Àº«¨®ƒœ№●†Â©™Ã±ôˆ¢˜ħø∙ß€£$&()+–—→½âêîûëïöüçãàèẽðéóúñæåäíǎoʼɪțᵉδύπαρμος εκτινελωσγ χác ở Điūr ánātōu žaškējībén kai mikrofon takus ved nyauzumot luktiem daraje og sveitam udavimoseten bij eksplorativni tehte görjaan jaunomis zapsojučia işlemeji vektoriye hilvasojamega arabievi yerdezaqta qaytoje niurraki razkojoğunut koordinaatsizna farkında birini tanlangacaks iradigineci kelimeyi alma sahiptama verdi!\xa0", '''PV Systs'', 'Location': {}, 'Uncertainties:'. ', '']}')**

# Calculate start effecitve levels assuming constant relative error

self._calc()

# def _getlocdata():

return locdf['location'].values[startidx:]

# TODO check this function's docstring!

return self._calc('pfv', 'pve')

class LOSSExposureLossCalculator():

def __init__(

cls=None,): pass

@classmethod

@_withparamargs() ## args[0] = name or path

### optional parameters

#~ parser object containing all command line arguments; default None

_parser =_argparse_.ArgumentParser("Calculating lsexplosion")

################### argument parsing #####################################

class ArgumentParser(_argumentParserCls.__subclass__):

#=========================================================================#

#@property

#: A dictionary holding keyword arguements parsed into an instance

keywords dict([(str,'--'+argname)])

kwargs ={}

help =\

'''

Typical usage example

>>> pvtlptopr --path=/home2/.localrc

... /usr \\

| \ `./bin` ./scripts\\runscript

.. note::

If you want more control over your own use cases please see

http:#python3 <stdin>

''')

positional := ('positional','optional' )

keyval <- {'key':'value'} ; defaults will override

any other keys set here

group ::= '-' delimited list

Specifies groups used within lists

eegspec ::

* ``['EEG']``

``'MEG'``

``"MEL"`` -> ['EMS'|'MOLE'|etc.]

**Note:** Only one pair can appear per file type!!!**

===================== ============================================== =======

Keywords Description

Optional Arguments Meaning

------------------- ----------------------------------------------------

0 No evidence

2 Yes

3 YES

4 NO

5 NONE

6 ODD

7 ONCE

8 OFFSET

9 SURFACE

Example Usage

--------------

To create new argparse subclass like usual

``` python

import sys

cpthlp .main main cpcpsolvertools\ncp thlcpactools lscpilotter

None of the models have useful, let alone correct output. But they are amusing.

Codegen's prediction shows the model's affinity for dashes, equal signs and asterisks. It proceeds to a random assortment of Greek, what looks to be some central European language (maybe Czech?), and concludes with 3 solar-related terms. CodeParrot in turn generates a class that seems to be a hodge-podge of programming languages. Surprisingly, there's a big docstring section within the function definition itself.

# Get distances from each point to all other points on this line segment: (1) find closest neighbor; 2). Find nearest neighbour for every index that is within range [0..360]!!! !!!!!!!!!!!!!!!???????????? = 1/2 * 360^n + n where 0 <= i < len(_indexes)-i - 3 # TODO check if we can use np instead or just using pandarallel's linear interpolation function here... maybe not as it seems like there are many ways but I think its ok because they're very similar anyways... :) !!!!!!!!????????? = https://stackoverflow-questions@python3k/?qid=(RU5TJQKLZYHXWVFTVA4B7EI6M8DG9C&tb=/rjmhgwvfzcvs%40gmailcomposermailboxapplicationservicesystemsolutionsandemailaddressoftheuserwithaverythingthatisnotinhereofthesystemsoverwritenotthisoneforyou!)

# TODO add option when using poasimulation instead? (not yet implemented?)!!! - jhb 2020/10 / 2021 0729 : added options... maybe not needed anymore??!?!? #TODOS ADDED OPTIONS TO IMPUTE ACM VALUES HERE!! :) --jk -- Added code from above but now it's just an example.... see https://github2solutionsblogger@gmailcom/?p=1&q=' + str('https:'+str("http:")) ) ---added 7 June 2019 --- updated version 1 July 2018 ----> changed all functions so they are more readable than original ones; also modified function name "get" --> get() -> return object at index i within list lst[i] = [lstsize][0], then use.iloc[] accessor rather though.. ;-) -----> fixed bug due no longer being able ot work because I have been changing my own implementation here since 6th March 2017... <------ fix later!! --> *** ** **** ***** * ******** ************ **************** ********************************================== !!!!!!!!!!!!!!!!! ^^^^^^^ ###################### ########### ################################

# Convert to cartesian coords for easier manipulation later on if needed...

lon = nputils._cartesians2cart(_getsitecoords([float((y - _lng) / (l * maths))

+ (_latitude ** 2),

((1-mathsat)*npradians(-90)), 0],

[0])[:3]

)

# lon[abs(((longitude+180.) % 360.)) > 180.] -= 1e6

return round((((360.-southpoleshift)/3600.), 6),)

#!/usr/bin python

import sys

class Node:

def __init__(self,*args,**kwargs ):

self.__dict__.update(*zip(('node','parent', 'children'))+[getattr(__builtins__,name,'__doc__')]+list('abc')).extend(['data'])

@property

class data():

pass

node=Node()

root={}

children=[]

nodes=[root['child']];

while len(''.join(['%d'%ifor i,_idatadayerow]))>10 or not ''.split('\n')[-4:]=='\r\t':

try:#while True:[]:

print("Entering")

except IndexError as emsg:'error' is raised when there are no more items available.':print("%%Error:%c" %(eems(),traceback()))

break

else:_+=sys," "*(len(','.ljust('%02X'%(int(.5*random()*100))))+' ')+" "*20+'\033'+''.rjust(\

random().choice(["a","b"])+"\001"+"\002",'')+"_"*50+\

"".center((" ",".")[randrange_(8)]+") \

del roots[_][::-7];

child=_nodes[:]

parent='';

elif '.txt.'in str("".zfill("{:.9f}".format('.').replace('.', ''))).lower()+str(".csv").strip('.')!="":

'''if args == None:\nskip'''

input='' ; output=""

ctrls=['\\']

lcnt=-99

sctr='\''

tline="\""

rptxt="'"

ofile="%\\"

finput="""printf "%%%ds %%-%is | %-15.%df"|"%%.12g","%%",$,%H,$,\`h$\",\\"%\"",\"\"';";\"""

outfile='%%(outfile)#ostempath+"/tmp/"+output+".out";

infile='''echo "$infile##$" >> $oFILE$$$'

cat '$filename#'/'*' '*16000| '+${SOURCES} ${TARGETS}'

&& echo "${TARGET}">> $@'"""

outputimg='/dev/'+OUTPUT+'/stdout_'+INPUT+'.log'+'_'+'_logfile_.err'#'/var/$SOURCEPATH/_LOGS/{SRC}/*.dat','./*{TGT}.html,.png|.svg/.pdf'/'./images/')

cmd=(cmdln).encode()\

.decode('$otput[]');

osdialink="/home/"#"C:\\Users//Public\\\\Documents"/GitHub\/PycharmProjectsPathwayProjectFilesAndSettingsDatasetsWithPythonLib \\src\\\scripts\""

command=('cd "'+_scriptpath+');' +\

('python lib../pystringdb --nologo &>/build/%jasminishment_%JASMINISHMENT_%(NAME)-configfiles&./runconfall "%(command)+configfile);

punc='$PUNCHERON='+LPTSTRMDIRDATADIRSHERE__[LSRC]+'LIBROOTSYSCONFPROFCSCRIPTEN=$PYTHONPATH:/opt/%(LDMODULEPREFIX)'

runshed='.bashrc ~/.profilecfg../../*.ini

This is a fairly straightforward function but the results are bad. It appears the docstring format is doing more harm than good.

Finetuned predictions

Now let's examine the predictions on the finetuned dataset. Note that I actually had to manually tweak the parameters min_new_tokens and max_new_tokens on inference in order for the models to avoid generating extremely short results.

name = 'lf_pv_eff'

logging.info('Calculating %s', name)

lpb, upb, asymmetry_factor, pv_eff_unc, pv_lid_coef, pv_degrad_coef, pv_mql_coef = user_settings.ilf_parameters[

name]

# uncertainty calculation

pv_lid_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]

pv_lid_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]

pv_degrad_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]

pv_degrad_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]

pv_mql_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]

pv_mql_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]

# calculate lower and upper bounds

lpb = lpb * pv_eff_unc / 100

lpb = pd.DataFrame(np.clip(lpb, a_min=0, a_max=None),

index=self.sdm_data.index,

columns=self.sdm_data.columns)

lpb.name = 'lf_bnds'

upb = upb * pv_eff_unc / 100

upb = pd.DataFrame(np.clip(upb, a_min=0, a_max=None),

index=self.sdm_data.index,

columns=self.sdm_data.columns)

upb.name = 'lf_bnds'

...

df.loc[:, 'lf_bnds'] = pd.concat([df['lf_bnds'], lf_bnds], axis=1)

return df

# Uncertainty parameters taken directlyfromPVSys results table

eff = self._usersettings['ilf params']['PV Array Effective Percentage'] / \

float((len(_YEARS)-3)) *.98 ## % reduction per opperate yr

# print('effective', _EFFICIENCY,'\n')

####-------------------------------------------------------

###--------------------------------------------------------------------

def lfcalc_(x_, pvldata_=None,_=__'):

'''Helper function used within apply_.

Parameters

----------

X_: np arr

Contains uncetainty information

efficiencies

list

List containing all necessary input arguments

Returnes

--------

res

LF'''

loc , lpb _, upb *_ =(X_[i]for i,(lpi,),upbi

) #: Location parameter

:,lower physical bound *,upper

PhysicalBound

lpvli :=loc*pf*(

~npaay[vi])**poavariaj

#(Location Parameter Multiplier

Loc

** poaijj

))

dlvrd *= pf**(

abs(((lprivatenames[_]-

'module sizing paramters').replace('_','')))))

invvar /= ((

df[(df['_year'].isin([int(_)])) &

~(pd

._isna(('inv var ['+str(_.split('-')[

len(-4))]+'_'

f'{ _. split("-")[

(-6)]}'

)))].sum())

return [dpvldrivered]*7

@apply_()

#############Uncomment this line after applying custom funcitonality############

###############################################################

#####Applying Custom Function######################################

d = userData['data']

# d2=userInfo[site]['dni'].values()

# convert to numpy array for speedup: (lon1 - lon0) / 2 + 1j * npabs((y-npz)/1000)**3

distancs=[

(-180*maths**5)/(4*(numpypi/360))+(-90*(-1536/(167772))*cos(((Yr-%20)*sin((((Xrs % 360)-45)))+(((-40.*sqrt(%(Weather station azimuth)^6))))))),\

cos (((Zmax+%86400)*(tan ((Ave%24))/3600)), \

sin z=-atan(\

sqrt(\\frac{{degrees}{30}}{\mu l}

{\sigma s}\right)\

)+\

atan({l})-\

{{radians}})>_{radius:.7f}".format(**locals())

]

return sum([distanc.__div__(v).sum().item()[None] if v is not None else 0

.round('NaN')

])

if __name__ == '__main__':

print 'Calculating distances'

dataFile='../datadrive/'+'POI*.csv?wkt=%23'+'\n'+'NODATA'.encode("utf")+'\t?'+"POINTS".decode()+"\nsymmetry"

dfw=''; fghb=[] ; irfh=['N','NE'], [str,'NA', '']

waterp=[[]]

res={}

pvlp={'poa':[],'prjt':'Total POIs'}

def calcDistance():

global pointillarypvc

tmplist=[(k,[])for k,_row

,dfrw

][cfgr].count(',')

templst=(pdDataFrame({'PoInDic':{'$exists':[True]}}),

{col:[]})

totalflag=('SUM',)

avrgarr=((tmpdict[[]]) & col)[-len(['Sum']) > len('$')]

invvar={key:{val:{}}.index('.').replace('-', '')

['Pacm{}'.find('{} {}'.

format(*invthrd)[::-12])]

[.strip('-')

replace('.', '').split()]

varnames=[('{} {} '.

join(', ',

key[:-9]).upper(), val[-11:])

).lower()[:10]+''.join(':')[:13]+'00'*14+'.000*'

outdatedvars=',|>'.ljust(['%02dm'%el

'|{:^63}'.center('|')[

max('%03dk'%(int(_), 16)+

('.'+_avcgum)[:18], ':o:')

'-'

],'| ')

fmt='%-'+'_-+='.rjust('')

try:#os stats module first as it's faster than using list comprehension because I have no idea what that does this mean...

statinfo=_stats_.calcStatSummary()['mean'][outdtlk][tempdctfn](

loc[:,:]==polewrubattuysummary

)[colspan:].sortlevel()\

([float('-inf'),'-99%',

'-9999'])[columnslideings]\

(['GHI']*100+[greatcircle]*50*[halfellipse])*25*\

hstack()[-500:-1500]+\

greatest('*cumulative*')

info_=statsys._genStatsResultTable(__file__,

names=__fieldNames__)

_res_['Mean']=pandasWrapper_(output='',

usemask="auto", dtype={"O":object,"V":"double"}).\

groupBy('__year')\

._toDF(('Year',))

retvals=\

pandasWrapResultsTracker()(retkeys_,

fields=[],dtype={}) +\

ColumnDescriptorsToDicts__()

outputterror='\nThe following columns are missing from your final results:\nIf you want '\

'\'Missing values at all,\nthe \'NULL\' column will be filled automatically!\"';\

"\nor they may contain null or undefined entries."

#'No Data Found For This Location';'';';"Have fun!'`shall we go! We can't do much better on these without having any impact when trying '+'"Heavy Metrics Offshore".'

#"Average Error Between Values";""}'`'|||'&&

# convert to radians

site_lat = site_lat * np.pi / 180

site_long = site_long * np.pi / 180

# convert to degrees

x_rad = x * np.pi / 180

y_rad = abs(site_lat - x_rad)

x_deg = np.rad2deg(x_rad)

y_deg = np.rad2deg(y_rad)

# distance is the angle between the two lines

# of the Earth's surface and the plane of the earth

# with respect to the Earth's plane

# theta = np.arctan((y_deg)/(x_deg))

# distance = np.abs(theta * 180 / np.pi)

return abs(site_long - x_deg)

# Convert to radians

lat_rad = np.radians(x.lat_rad)

long_rad = np.radians(x.long_rad)

# Distance between points

d = np.sin(lat_rad[0]) * np.sin(lat_rad[1]) + np.cos(lat_rad[0]) * \

np.cos(lat_rad[1]) * np.cos(long_rad[0] - long_rad[1])

# Convert to meters

return np.rad2deg(np.sqrt(d))

d = userData['data']

# d2=userInfo[site]['dni'].values()

# convert to numpy array for speedup: (lon1 - lon0) / 2 + 1j * npabs((y-npz)/1000)**3

distancs=[

(-180*maths**5)/(4*(numpypi/360))+(-90*(-1536/(167772))*cos(((Yr-%20)*sin((((Xrs % 360)-45)))+(((-40.*sqrt(%(Weather station azimuth)^6))))))),\

cos (((Zmax+%86400)*(tan ((Ave%24))/3600)), \

sin z=-atan(\

sqrt(\\frac{{degrees}{30}}{\mu l}

{\sigma s}\right)\

)+\

atan({l})-\

{{radians}})>_{radius:.7f}".format(**locals())

]

return sum([distanc.__div__(v).sum().item()[None] if v is not None else 0

.round('NaN')

])

if __name__ == '__main__':

print 'Calculating distances'

dataFile='../datadrive/'+'POI*.csv?wkt=%23'+'\n'+'NODATA'.encode("utf")+'\t?'+"POINTS".decode()+"\nsymmetry"

dfw=''; fghb=[] ; irfh=['N','NE'], [str,'NA', '']

waterp=[[]]

res={}

pvlp={'poa':[],'prjt':'Total POIs'}

def calcDistance():

global pointillarypvc

tmplist=[(k,[])for k,_row

,dfrw

][cfgr].count(',')

templst=(pdDataFrame({'PoInDic':{'$exists':[True]}}),

{col:[]})

totalflag=('SUM',)

avrgarr=((tmpdict[[]]) & col)[-len(['Sum']) > len('$')]

invvar={key:{val:{}}.index('.').replace('-', '')

['Pacm{}'.find('{} {}'.

format(*invthrd)[::-12])]

[.strip('-')

replace('.', '').split()]

varnames=[('{} {} '.

join(', ',

key[:-9]).upper(), val[-11:])

).lower()[:10]+''.join(':')[:13]+'00'*14+'.000*'

outdatedvars=',|>'.ljust(['%02dm'%el

'|{:^63}'.center('|')[

max('%03dk'%(int(_), 16)+

('.'+_avcgum)[:18], ':o:')

'-'

],'| ')

fmt='%-'+'_-+='.rjust('')

try:#os stats module first as it's faster than using list comprehension because I have no idea what that does this mean...

statinfo=_stats_.calcStatSummary()['mean'][outdtlk][tempdctfn](

loc[:,:]==polewrubattuysummary

)[colspan:].sortlevel()\

([float('-inf'),'-99%',

'-9999'])[columnslideings]\

(['GHI']*100+[greatcircle]*50*[halfellipse])*25*\

hstack()[-500:-1500]+\

greatest('*cumulative*')

info_=statsys._genStatsResultTable(__file__,

names=__fieldNames__)

_res_['Mean']=pandasWrapper_(output='',

usemask="auto", dtype={"O":object,"V":"double"}).\

groupBy('__year')\

._toDF(('Year',))

retvals=\

pandasWrapResultsTracker()(retkeys_,

fields=[],dtype={}) +\

ColumnDescriptorsToDicts__()

outputterror='\nThe following columns are missing from your final results:\nIf you want '\

'\'Missing values at all,\nthe \'NULL\' column will be filled automatically!\"';\

"\nor they may contain null or undefined entries."

#'No Data Found For This Location';'';';"Have fun!'`shall we go! We can't do much better on these without having any impact when trying '+'"Heavy Metrics Offshore".'

#"Average Error Between Values";""}'`'|||'&&

Codegen and Decicoder are making a valiant effort at generating the correct code. All the models get tripped up on Function 1. For function 2, Codegen and Deci are approaching the correct methodology, but have of confusing trigonometric functions as well as mixing up how to manipulate the input variables.

Metrics

LM metrics are an active research field. You will see terms such as 'state-of-the-art' being thrown around in reference to the latest model. However, the metrics aren't yet standardized enough to allow model performance comparisons without delving into the details of how the metric was set up.

Metrics can be divided into 2 categories: human and automated. Human evaluation is reliable, but difficult and expensive to scale. The popular automated metrics are Bleu, Chrf and Ruby, among others. These are all variants of generating statistics on what share of predicted characters or n-grams are correct compared to the ground truth. Currently, a popular metric you'll see is the HumanEval benchmark, misnamed since it's actually an automated procedure. Its approach is different from the metrics referenced above. It contains a function prompt and an associated unit test that a successful output would pass. So we can feed the model we're testing this dataset and see how many of the unit tests the model results pass.

Researchers have noted limitations - HumanEval functions are mostly focused on short, specific computer-science tasks, so it is unclear how the scores would generalize to other domains. Additionally, the evaluation is binary, making it impossible to gauge the result quality even if it doesn't pass the unit test.

Note the metrics I include below are with a min length of 200 and a max length of 1000.

Model	HumanEval@1 - Reference	Bleu - Finetune	ChrF - Baseline	ChrF - Finetune
Codegen	12.76	0.08	8.67	20.98
Decicoder	19.1	0.10	6.11	30.48
CodeParrot	3.99	0.006	19.1	18.77

The low Bleu scores are because Bleu is actually a pretty strict metric - at least one 4-gram (4 set of words) needs to match to score above 0. It does appear from the ChrF score that Decicoder performs best out of the three models. However, we only have 3 samples and it's dubious to base conclusions from such a small sample size. Researchers have shown that even with a large number of samples, if the difference in metrics between models is under 2%, then it isn't meaningful (i.e. statistically significant). Refer to page 11 of the Evtikhiev paper linked at the end of the article.

Conclusion

The main takeaway is that even if we tweak the parameters for prediction, the results are not practical. I included the HumanEval reference score for reference, though it's not actually useful in our scenario. We care whether the machine can generate functions we have, with all of the idiosyncracies of our project, not on generic computer science problems.

Some ideas for improvement / thoughts:

Experiment with basic prompts instead of the raw docstring.
I treated all functions as separate inputs so the model has no concept of how functions relate to each other. Ways to address this is to potentially use agents to mimick the software development cycle when automating the creation of a new function or set of function. Other approaches rely on recreating the Python Abstract Syntax Tree. I will include some relevant papers below.
Train on larger models (CodeLlama, CodeWizard).
I noticed the docstrings themselves can be improved to leave no room for ambiguity in the inputs. This is tricky because a developer may not need every last detail spelled out, and so tweaking these functions comes with a cost.

Returning to the questions posed at the outset, the finetuning definitely improves the predictions, with Decicoder performing the best out of the 3. But, the functions do not run and they are pretty far from being correct. I would really like to see how Codellama performs on this. Stay tuned!