info@hkanalytics.io
HK Analytics | Data science enhanced by human intelligence
HK Analytics | Data science enhanced by human intelligence
One Friday afternoon, while planning the following week's software development work, a thought crossed my mind, "Wouldn't it be nice if I could issue a set of instructions about the intended feature and have the machine take at least a first pass at writing the relevant functions for me."
Large language models have gotten a lot of attention in 2023 (from hereon just referred to as LMs). So the idea was to see how well these LMs, finetuned on our company's code (which focuses on predicting energy output from PV plants), perform on a much more simplified task.
First to get it out the way, I am of course familiar with Github Copilot. But Copilot is paid, and I would also like control over the internals of the LMs as opposed to just having a black box.
Designing a system to create interrelated blocks of code that integrate into a functioning codebase in response to a user command is a very challenging endeavor. As such, I limited the scope to something much more manageable. Namely, generating detailed code from Python function documentation (from hereon referred to as docstrings).
In our codebase, we strive to adhere to standards for both docstrings and functions. Every docstring has at the minimum the same sections with a description of what the function does along with its inputs and outputs. We intentionally write layperson-friendly explanations of the pertinent engineering and solar concepts (although we won't repeat these detailed explanations across functions).
In our code, we follow the Google coding standard. We strive for a convention for variable names and a certain coding style (i.e. Pandas/numpy heavy vectorization, writing for humans, DRY, etc.).
So we can use the docstrings, which in essence describe what the function does, to generate the function itself. The way we achieve that is by finetuning existing LMs trained on code (ideally Python).
I decided to use the following code-specific models. Note the values refer to the number of parameters in the model:
My original intent was to also finetune on CodeLlama, released by Meta in August 2023. It is a 7B parameter model and has achieved top performance metrics on code generation tasks. However, I encountered memory issues training on expensive GPUs of various sizes and had to halt work temporarily. I'll detail my efforts and the results in a future article.
Our codebase consists of about 10 modules (aka Python files), some of which contain classes. In total there are approximately 200 functions. The functions are of course connected to each other semantically (i.e. pertaining to meaning).
To simplify the problem, though, I basically ignored class definitions and the connections between functions.
I then separated each function into an input section for the docstring and the output section for the function itself.
I left out 3 functions from the dataset used for finetuning the model in order to test how well the models perform. This is a very small number to base metrics on - I sacrificed metric generalization for using the data to get the best model possible.
I modeled using AzureML in order to make the experiment architecture transparent and reproducible and to leverage cloud compute. Details are in the Github repo. I finetuned on Codegen and Decicoder for 10 epochs with a batch size of 20, and CodeParrot for 6 epochs with a batch size of 100. For all models, I used a sequence length of 500 tokens with a Standard_E8s_v3 machine (64 GB RAM, 128 GB storage, 16 cores, $0.64/hr). The training took around 10.5 hours.
In order to ascertain that finetuning really has an effect, it's instructive to predict on our test functions using the LMs out-of-the-box.
Note that for the test set, I chose 3 functions that represent the range of complexity within our code. I will share 2 of them - the 3rd one has our secret sauce for uncertainty quantification in energy losses.
*** # Calculating LFPs per day ######################################################## --------------------------- ---------------------- ------------------------------ ------------------------------------ ------------------------------------- ------------------------------------------------------------- ------------------------------------------------------------------------ -------------------------------------------------------- ---------- ================================================= ======================================================================= ************************************************** __________________________________________________________ ___________________________________________________________________________________ ................................................................................................................... ////////////////////////////////////// .......................... .... ................. .................. ..... ....... ...... ............. .# ## ######## ##... ###..................... ##### | | || || |||| │ │ ┌─╸̅͘═´¯`·‿»¬¶•Àº«¨®ƒœ№●†Â©™Ã±ôˆ¢˜ħø∙߀£$&()+–—→½âêîûëïöüçãàèẽðéóúñæåäíǎoʼɪțᵉδύπαρμος εκτινελωσγ χác ở Điūr ánātōu žaškējībén kai mikrofon takus ved nyauzumot luktiem daraje og sveitam udavimoseten bij eksplorativni tehte görjaan jaunomis zapsojučia işlemeji vektoriye hilvasojamega arabievi yerdezaqta qaytoje niurraki razkojoğunut koordinaatsizna farkında birini tanlangacaks iradigineci kelimeyi alma sahiptama verdi!\xa0", '''PV Systs'', 'Location': {}, 'Uncertainties:'. ', '']}')**
# Calculate start effecitve levels assuming constant relative error
self._calc()
# def _getlocdata():
return locdf['location'].values[startidx:]
# TODO check this function's docstring!
return self._calc('pfv', 'pve')
class LOSSExposureLossCalculator():
def __init__(
cls=None,): pass
@classmethod
@_withparamargs() ## args[0] = name or path
### optional parameters
#~ parser object containing all command line arguments; default None
_parser =_argparse_.ArgumentParser("Calculating lsexplosion")
################### argument parsing #####################################
class ArgumentParser(_argumentParserCls.__subclass__):
#=========================================================================#
#@property
#: A dictionary holding keyword arguements parsed into an instance
keywords dict([(str,'--'+argname)])
kwargs ={}
help =\
'''
Typical usage example
>>> pvtlptopr --path=/home2/.localrc
... /usr \\
| \ `./bin` ./scripts\\runscript
.. note::
If you want more control over your own use cases please see
http:#python3 <stdin>
''')
positional := ('positional','optional' )
keyval <- {'key':'value'} ; defaults will override
any other keys set here
group ::= '-' delimited list
Specifies groups used within lists
eegspec ::
* ``['EEG']``
``'MEG'``
``"MEL"`` -> ['EMS'|'MOLE'|etc.]
**Note:** Only one pair can appear per file type!!!**
===================== ============================================== =======
Keywords Description
Optional Arguments Meaning
------------------- ----------------------------------------------------
0 No evidence
2 Yes
3 YES
4 NO
5 NONE
6 ODD
7 ONCE
8 OFFSET
9 SURFACE
Example Usage
--------------
To create new argparse subclass like usual
``` python
import sys
cpthlp .main main cpcpsolvertools\ncp thlcpactools lscpilotter
None of the models have useful, let alone correct output. But they are amusing.
Codegen's prediction shows the model's affinity for dashes, equal signs and asterisks. It proceeds to a random assortment of Greek, what looks to be some central European language (maybe Czech?), and concludes with 3 solar-related terms. CodeParrot in turn generates a class that seems to be a hodge-podge of programming languages. Surprisingly, there's a big docstring section within the function definition itself.
# Get distances from each point to all other points on this line segment: (1) find closest neighbor; 2). Find nearest neighbour for every index that is within range [0..360]!!! !!!!!!!!!!!!!!!???????????? = 1/2 * 360^n + n where 0 <= i < len(_indexes)-i - 3 # TODO check if we can use np instead or just using pandarallel's linear interpolation function here... maybe not as it seems like there are many ways but I think its ok because they're very similar anyways... :) !!!!!!!!????????? = https://stackoverflow-questions@python3k/?qid=(RU5TJQKLZYHXWVFTVA4B7EI6M8DG9C&tb=/rjmhgwvfzcvs%40gmailcomposermailboxapplicationservicesystemsolutionsandemailaddressoftheuserwithaverythingthatisnotinhereofthesystemsoverwritenotthisoneforyou!)
# TODO add option when using poasimulation instead? (not yet implemented?)!!! - jhb 2020/10 / 2021 0729 : added options... maybe not needed anymore??!?!? #TODOS ADDED OPTIONS TO IMPUTE ACM VALUES HERE!! :) --jk -- Added code from above but now it's just an example.... see https://github2solutionsblogger@gmailcom/?p=1&q=' + str('https:'+str("http:")) ) ---added 7 June 2019 --- updated version 1 July 2018 ----> changed all functions so they are more readable than original ones; also modified function name "get" --> get() -> return object at index i within list lst[i] = [lstsize][0], then use.iloc[] accessor rather though.. ;-) -----> fixed bug due no longer being able ot work because I have been changing my own implementation here since 6th March 2017... <------ fix later!! --> *** ** **** ***** * ******** ************ **************** ********************************================== !!!!!!!!!!!!!!!!! ^^^^^^^ ###################### ########### ################################
# Convert to cartesian coords for easier manipulation later on if needed...
lon = nputils._cartesians2cart(_getsitecoords([float((y - _lng) / (l * maths))
+ (_latitude ** 2),
((1-mathsat)*npradians(-90)), 0],
[0])[:3]
)
# lon[abs(((longitude+180.) % 360.)) > 180.] -= 1e6
return round((((360.-southpoleshift)/3600.), 6),)
#!/usr/bin python
import sys
class Node:
def __init__(self,*args,**kwargs ):
self.__dict__.update(*zip(('node','parent', 'children'))+[getattr(__builtins__,name,'__doc__')]+list('abc')).extend(['data'])
@property
class data():
pass
node=Node()
root={}
children=[]
nodes=[root['child']];
while len(''.join(['%d'%ifor i,_idatadayerow]))>10 or not ''.split('\n')[-4:]=='\r\t':
try:#while True:[]:
print("Entering")
except IndexError as emsg:'error' is raised when there are no more items available.':print("%%Error:%c" %(eems(),traceback()))
break
else:_+=sys," "*(len(','.ljust('%02X'%(int(.5*random()*100))))+' ')+" "*20+'\033'+''.rjust(\
random().choice(["a","b"])+"\001"+"\002",'')+"_"*50+\
"".center((" ",".")[randrange_(8)]+") \
del roots[_][::-7];
child=_nodes[:]
parent='';
elif '.txt.'in str("".zfill("{:.9f}".format('.').replace('.', ''))).lower()+str(".csv").strip('.')!="":
'''if args == None:\nskip'''
input='' ; output=""
ctrls=['\\']
lcnt=-99
sctr='\''
tline="\""
rptxt="'"
ofile="%\\"
finput="""printf "%%%ds %%-%is | %-15.%df"|"%%.12g","%%",$,%H,$,\`h$\",\\"%\"",\"\"';";\"""
outfile='%%(outfile)#ostempath+"/tmp/"+output+".out";
infile='''echo "$infile##$" >> $oFILE$$$'
cat '$filename#'/'*' '*16000| '+${SOURCES} ${TARGETS}'
&& echo "${TARGET}">> $@'"""
outputimg='/dev/'+OUTPUT+'/stdout_'+INPUT+'.log'+'_'+'_logfile_.err'#'/var/$SOURCEPATH/_LOGS/{SRC}/*.dat','./*{TGT}.html,.png|.svg/.pdf'/'./images/')
cmd=(cmdln).encode()\
.decode('$otput[]');
osdialink="/home/"#"C:\\Users//Public\\\\Documents"/GitHub\/PycharmProjectsPathwayProjectFilesAndSettingsDatasetsWithPythonLib \\src\\\scripts\""
command=('cd "'+_scriptpath+');' +\
('python lib../pystringdb --nologo &>/build/%jasminishment_%JASMINISHMENT_%(NAME)-configfiles&./runconfall "%(command)+configfile);
punc='$PUNCHERON='+LPTSTRMDIRDATADIRSHERE__[LSRC]+'LIBROOTSYSCONFPROFCSCRIPTEN=$PYTHONPATH:/opt/%(LDMODULEPREFIX)'
runshed='.bashrc ~/.profilecfg../../*.ini
This is a fairly straightforward function but the results are bad. It appears the docstring format is doing more harm than good.
Now let's examine the predictions on the finetuned dataset. Note that I actually had to manually tweak the parameters min_new_tokens and max_new_tokens on inference in order for the models to avoid generating extremely short results.
name = 'lf_pv_eff'
logging.info('Calculating %s', name)
lpb, upb, asymmetry_factor, pv_eff_unc, pv_lid_coef, pv_degrad_coef, pv_mql_coef = user_settings.ilf_parameters[
name]
# uncertainty calculation
pv_lid_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]
pv_lid_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]
pv_degrad_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]
pv_degrad_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]
pv_mql_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]
pv_mql_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]
# calculate lower and upper bounds
lpb = lpb * pv_eff_unc / 100
lpb = pd.DataFrame(np.clip(lpb, a_min=0, a_max=None),
index=self.sdm_data.index,
columns=self.sdm_data.columns)
lpb.name = 'lf_bnds'
upb = upb * pv_eff_unc / 100
upb = pd.DataFrame(np.clip(upb, a_min=0, a_max=None),
index=self.sdm_data.index,
columns=self.sdm_data.columns)
upb.name = 'lf_bnds'
...
df.loc[:, 'lf_bnds'] = pd.concat([df['lf_bnds'], lf_bnds], axis=1)
return df
# Uncertainty parameters taken directlyfromPVSys results table
eff = self._usersettings['ilf params']['PV Array Effective Percentage'] / \
float((len(_YEARS)-3)) *.98 ## % reduction per opperate yr
# print('effective', _EFFICIENCY,'\n')
####-------------------------------------------------------
###--------------------------------------------------------------------
def lfcalc_(x_, pvldata_=None,_=__'):
'''Helper function used within apply_.
Parameters
----------
X_: np arr
Contains uncetainty information
efficiencies
list
List containing all necessary input arguments
Returnes
--------
res
LF'''
loc , lpb _, upb *_ =(X_[i]for i,(lpi,),upbi
) #: Location parameter
:,lower physical bound *,upper
PhysicalBound
lpvli :=loc*pf*(
~npaay[vi])**poavariaj
#(Location Parameter Multiplier
*(
Loc
** poaijj
))
dlvrd *= pf**(
abs(((lprivatenames[_]-
'module sizing paramters').replace('_','')))))
invvar /= ((
df[(df['_year'].isin([int(_)])) &
~(pd
._isna(('inv var ['+str(_.split('-')[
len(-4))]+'_'
f'{ _. split("-")[
(-6)]}'
)))].sum())
return [dpvldrivered]*7
@apply_()
#############Uncomment this line after applying custom funcitonality############
###############################################################
#####Applying Custom Function######################################
d = userData['data']
# d2=userInfo[site]['dni'].values()
# convert to numpy array for speedup: (lon1 - lon0) / 2 + 1j * npabs((y-npz)/1000)**3
distancs=[
(-180*maths**5)/(4*(numpypi/360))+(-90*(-1536/(167772))*cos(((Yr-%20)*sin((((Xrs % 360)-45)))+(((-40.*sqrt(%(Weather station azimuth)^6))))))),\
cos (((Zmax+%86400)*(tan ((Ave%24))/3600)), \
sin z=-atan(\
sqrt(\\frac{{degrees}{30}}{\mu l}
{\sigma s}\right)\
)+\
atan({l})-\
{{radians}})>_{radius:.7f}".format(**locals())
]
return sum([distanc.__div__(v).sum().item()[None] if v is not None else 0
.round('NaN')
])
if __name__ == '__main__':
print 'Calculating distances'
dataFile='../datadrive/'+'POI*.csv?wkt=%23'+'\n'+'NODATA'.encode("utf")+'\t?'+"POINTS".decode()+"\nsymmetry"
dfw=''; fghb=[] ; irfh=['N','NE'], [str,'NA', '']
waterp=[[]]
res={}
pvlp={'poa':[],'prjt':'Total POIs'}
def calcDistance():
global pointillarypvc
tmplist=[(k,[])for k,_row
,dfrw
][cfgr].count(',')
templst=(pdDataFrame({'PoInDic':{'$exists':[True]}}),
{col:[]})
totalflag=('SUM',)
avrgarr=((tmpdict[[]]) & col)[-len(['Sum']) > len('$')]
invvar={key:{val:{}}.index('.').replace('-', '')
['Pacm{}'.find('{} {}'.
format(*invthrd)[::-12])]
[.strip('-')
replace('.', '').split()]
varnames=[('{} {} '.
join(', ',
key[:-9]).upper(), val[-11:])
).lower()[:10]+''.join(':')[:13]+'00'*14+'.000*'
outdatedvars=',|>'.ljust(['%02dm'%el
''
'|{:^63}'.center('|')[
max('%03dk'%(int(_), 16)+
('.'+_avcgum)[:18], ':o:')
'-'
],'| ')
fmt='%-'+'_-+='.rjust('')
try:#os stats module first as it's faster than using list comprehension because I have no idea what that does this mean...
statinfo=_stats_.calcStatSummary()['mean'][outdtlk][tempdctfn](
loc[:,:]==polewrubattuysummary
)[colspan:].sortlevel()\
([float('-inf'),'-99%',
'-9999'])[columnslideings]\
(['GHI']*100+[greatcircle]*50*[halfellipse])*25*\
hstack()[-500:-1500]+\
greatest('*cumulative*')
info_=statsys._genStatsResultTable(__file__,
names=__fieldNames__)
_res_['Mean']=pandasWrapper_(output='',
usemask="auto", dtype={"O":object,"V":"double"}).\
groupBy('__year')\
._toDF(('Year',))
retvals=\
pandasWrapResultsTracker()(retkeys_,
fields=[],dtype={}) +\
ColumnDescriptorsToDicts__()
outputterror='\nThe following columns are missing from your final results:\nIf you want '\
'\'Missing values at all,\nthe \'NULL\' column will be filled automatically!\"';\
"\nor they may contain null or undefined entries."
#'No Data Found For This Location';'';';"Have fun!'`shall we go! We can't do much better on these without having any impact when trying '+'"Heavy Metrics Offshore".'
#"Average Error Between Values";""}'`'|||'&&
# convert to radians
site_lat = site_lat * np.pi / 180
site_long = site_long * np.pi / 180
# convert to degrees
x_rad = x * np.pi / 180
y_rad = abs(site_lat - x_rad)
x_deg = np.rad2deg(x_rad)
y_deg = np.rad2deg(y_rad)
# distance is the angle between the two lines
# of the Earth's surface and the plane of the earth
# with respect to the Earth's plane
# theta = np.arctan((y_deg)/(x_deg))
# distance = np.abs(theta * 180 / np.pi)
return abs(site_long - x_deg)
# Convert to radians
lat_rad = np.radians(x.lat_rad)
long_rad = np.radians(x.long_rad)
# Distance between points
d = np.sin(lat_rad[0]) * np.sin(lat_rad[1]) + np.cos(lat_rad[0]) * \
np.cos(lat_rad[1]) * np.cos(long_rad[0] - long_rad[1])
# Convert to meters
return np.rad2deg(np.sqrt(d))
d = userData['data']
# d2=userInfo[site]['dni'].values()
# convert to numpy array for speedup: (lon1 - lon0) / 2 + 1j * npabs((y-npz)/1000)**3
distancs=[
(-180*maths**5)/(4*(numpypi/360))+(-90*(-1536/(167772))*cos(((Yr-%20)*sin((((Xrs % 360)-45)))+(((-40.*sqrt(%(Weather station azimuth)^6))))))),\
cos (((Zmax+%86400)*(tan ((Ave%24))/3600)), \
sin z=-atan(\
sqrt(\\frac{{degrees}{30}}{\mu l}
{\sigma s}\right)\
)+\
atan({l})-\
{{radians}})>_{radius:.7f}".format(**locals())
]
return sum([distanc.__div__(v).sum().item()[None] if v is not None else 0
.round('NaN')
])
if __name__ == '__main__':
print 'Calculating distances'
dataFile='../datadrive/'+'POI*.csv?wkt=%23'+'\n'+'NODATA'.encode("utf")+'\t?'+"POINTS".decode()+"\nsymmetry"
dfw=''; fghb=[] ; irfh=['N','NE'], [str,'NA', '']
waterp=[[]]
res={}
pvlp={'poa':[],'prjt':'Total POIs'}
def calcDistance():
global pointillarypvc
tmplist=[(k,[])for k,_row
,dfrw
][cfgr].count(',')
templst=(pdDataFrame({'PoInDic':{'$exists':[True]}}),
{col:[]})
totalflag=('SUM',)
avrgarr=((tmpdict[[]]) & col)[-len(['Sum']) > len('$')]
invvar={key:{val:{}}.index('.').replace('-', '')
['Pacm{}'.find('{} {}'.
format(*invthrd)[::-12])]
[.strip('-')
replace('.', '').split()]
varnames=[('{} {} '.
join(', ',
key[:-9]).upper(), val[-11:])
).lower()[:10]+''.join(':')[:13]+'00'*14+'.000*'
outdatedvars=',|>'.ljust(['%02dm'%el
''
'|{:^63}'.center('|')[
max('%03dk'%(int(_), 16)+
('.'+_avcgum)[:18], ':o:')
'-'
],'| ')
fmt='%-'+'_-+='.rjust('')
try:#os stats module first as it's faster than using list comprehension because I have no idea what that does this mean...
statinfo=_stats_.calcStatSummary()['mean'][outdtlk][tempdctfn](
loc[:,:]==polewrubattuysummary
)[colspan:].sortlevel()\
([float('-inf'),'-99%',
'-9999'])[columnslideings]\
(['GHI']*100+[greatcircle]*50*[halfellipse])*25*\
hstack()[-500:-1500]+\
greatest('*cumulative*')
info_=statsys._genStatsResultTable(__file__,
names=__fieldNames__)
_res_['Mean']=pandasWrapper_(output='',
usemask="auto", dtype={"O":object,"V":"double"}).\
groupBy('__year')\
._toDF(('Year',))
retvals=\
pandasWrapResultsTracker()(retkeys_,
fields=[],dtype={}) +\
ColumnDescriptorsToDicts__()
outputterror='\nThe following columns are missing from your final results:\nIf you want '\
'\'Missing values at all,\nthe \'NULL\' column will be filled automatically!\"';\
"\nor they may contain null or undefined entries."
#'No Data Found For This Location';'';';"Have fun!'`shall we go! We can't do much better on these without having any impact when trying '+'"Heavy Metrics Offshore".'
#"Average Error Between Values";""}'`'|||'&&
Codegen and Decicoder are making a valiant effort at generating the correct code. All the models get tripped up on Function 1. For function 2, Codegen and Deci are approaching the correct methodology, but have of confusing trigonometric functions as well as mixing up how to manipulate the input variables.
LM metrics are an active research field. You will see terms such as 'state-of-the-art' being thrown around in reference to the latest model. However, the metrics aren't yet standardized enough to allow model performance comparisons without delving into the details of how the metric was set up.
Metrics can be divided into 2 categories: human and automated. Human evaluation is reliable, but difficult and expensive to scale. The popular automated metrics are Bleu, Chrf and Ruby, among others. These are all variants of generating statistics on what share of predicted characters or n-grams are correct compared to the ground truth. Currently, a popular metric you'll see is the HumanEval benchmark, misnamed since it's actually an automated procedure. Its approach is different from the metrics referenced above. It contains a function prompt and an associated unit test that a successful output would pass. So we can feed the model we're testing this dataset and see how many of the unit tests the model results pass.
Researchers have noted limitations - HumanEval functions are mostly focused on short, specific computer-science tasks, so it is unclear how the scores would generalize to other domains. Additionally, the evaluation is binary, making it impossible to gauge the result quality even if it doesn't pass the unit test.
Note the metrics I include below are with a min length of 200 and a max length of 1000.
Model | HumanEval@1 - Reference | Bleu - Baseline | Bleu - Finetune | ChrF - Baseline | ChrF - Finetune |
---|---|---|---|---|---|
Codegen | 12.76 | 0 | 0.08 | 8.67 | 20.98 |
Decicoder | 19.1 | 0 | 0.10 | 6.11 | 30.48 |
CodeParrot | 3.99 | 0 | 0.006 | 19.1 | 18.77 |
The low Bleu scores are because Bleu is actually a pretty strict metric - at least one 4-gram (4 set of words) needs to match to score above 0. It does appear from the ChrF score that Decicoder performs best out of the three models. However, we only have 3 samples and it's dubious to base conclusions from such a small sample size. Researchers have shown that even with a large number of samples, if the difference in metrics between models is under 2%, then it isn't meaningful (i.e. statistically significant). Refer to page 11 of the Evtikhiev paper linked at the end of the article.
The main takeaway is that even if we tweak the parameters for prediction, the results are not practical. I included the HumanEval reference score for reference, though it's not actually useful in our scenario. We care whether the machine can generate functions we have, with all of the idiosyncracies of our project, not on generic computer science problems.
Some ideas for improvement / thoughts:
Returning to the questions posed at the outset, the finetuning definitely improves the predictions, with Decicoder performing the best out of the 3. But, the functions do not run and they are pretty far from being correct. I would really like to see how Codellama performs on this. Stay tuned!
A Syntactic Neural Model for General-Purpose Code Generation. Yin, et al. 2017.04.06
Can AI Code metrics on HuggingFace
Evaluating Large Language Models Trained on Code. Chen, et al. 2021.07.14
Out of the Bleu, How Should We Assess Quality of the Code Generation Models. Evtikhiev, et al. 2023.05.10
Sk Coder: A Sketch-based Approach for Automatic Code Generation. Li, et al. 2023.07.09
WizardCoder: Empowering Code Large Language Models with Evol-Instruct. Luo, et al. 2023.06.14