Get in touch

info@hkanalytics.io

HK Analytics  |   Data science enhanced by human intelligence

HK Analytics   |      Data science enhanced by human intelligence

Finetuning large language models for software development

Roma Koulikov • Dec 08, 2023

Introduction

Project Github repo


One Friday afternoon, while planning the following week's software development work, a thought crossed my mind, "Wouldn't it be nice if I could issue a set of instructions about the intended feature and have the machine take at least a first pass at writing the relevant functions for me."


Large language models have gotten a lot of attention in 2023 (from hereon just referred to as LMs). So the idea was to see how well these LMs, finetuned on our company's code (which focuses on predicting energy output from PV plants), perform on a much more simplified task.

First to get it out the way, I am of course familiar with Github Copilot. But Copilot is paid, and I would also like control over the internals of the LMs as opposed to just having a black box.

Designing a system to create interrelated blocks of code that integrate into a functioning codebase in response to a user command is a very challenging endeavor.  As such, I limited the scope to something much more manageable. Namely, generating detailed code from Python function documentation (from hereon referred to as docstrings).


Background

In our codebase, we strive to adhere to standards for both docstrings and functions. Every docstring has at the minimum the same sections with a description of what the function does along with its inputs and outputs. We intentionally write layperson-friendly explanations of the pertinent engineering and solar concepts (although we won't repeat these detailed explanations across functions).

In our code, we follow the Google coding standard. We strive for a convention for variable names and a certain coding style (i.e. Pandas/numpy heavy vectorization, writing for humans, DRY, etc.).

So we can use the docstrings, which in essence describe what the function does, to generate the function itself.  The way we achieve that is by finetuning existing LMs trained on code (ideally Python). 


The questions

  • What LMs can we test?   
  • How much do they improve if we finetuned them as opposed to just using them out of the box?
  • How good (or bad) is the code they generate? Does it even run?


Models

I decided to use the following code-specific models.  Note the values refer to the number of parameters in the model:

My original intent was to also finetune on CodeLlama, released by Meta in August 2023.  It is a 7B parameter model and has achieved top performance metrics on code generation tasks.  However, I encountered memory issues training on expensive GPUs of various sizes and had to halt work temporarily.  I'll detail my efforts and the results in a future article.


Data preparation

Our codebase consists of about 10 modules (aka Python files), some of which contain classes.  In total there are approximately 200 functions.  The functions are of course connected to each other semantically (i.e. pertaining to meaning).

To simplify the problem, though, I basically ignored class definitions and the connections between functions. 

I then separated each function into an input section for the docstring and the output section for the function itself. 

I left out 3 functions from the dataset used for finetuning the model in order to test how well the models perform.  This is a very small number to base metrics on - I sacrificed metric generalization for using the data to get the best model possible. 


Modeling

I modeled using AzureML in order to make the experiment architecture transparent and reproducible and to leverage cloud compute.  Details are in the Github repo.  I finetuned on Codegen and Decicoder for 10 epochs with a batch size of 20, and CodeParrot for 6 epochs with a batch size of 100.  For all models, I used a sequence length of 500 tokens with a Standard_E8s_v3 machine (64 GB RAM, 128 GB storage, 16 cores, $0.64/hr).  The training took around 10.5 hours. 


Baseline predictions

In order to ascertain that finetuning really has an effect, it's instructive to predict on our test functions using the LMs out-of-the-box.

Note that for the test set, I chose 3 functions that represent the range of complexity within our code.  I will share 2 of them - the 3rd one has our secret sauce for uncertainty quantification in energy losses. 


Function 1: Calculate PV efficiency loss


This function calculates the losses and uncertainties associated with the PV modules. It includes both base PV cell efficiency as well as natural and light-induced degradation losses. def calc_lf_pv_eff(self): """Calculate photovoltaic conversion losses. Assumes uncertainty consists of a fixed fractional component of the location value, and a variable component that increases with the age of the plant. As degradation proceeds, we have decreasing confidence in its actual level. Both the actual degradation and degradation uncertainty are re-calculated at a daily level so avoid quantum jumps in values at the beginning of every year. Note that the user should also specifiy a positive asymmetry factor. This ensures that even though our uncertainty increases with time, the increase in uncertainty is asymmetrical - it is higher above the location (estimated value), since below the location, we have a floor on the value, since we know that losses due to PV efficiency cannot be lower than 1 - (pv efficiency + some uncertainty). The efficiency and degradation methodology when running on simulated POA data entails calculating the starting efficiency of the prediction time period based on the number of operational years. Since we run trials to predict future outcomes, we then sample from a normal distribution using the calculated starting efficiency as the mean. For the standard deviation we currently assume a normal distribution and assume that the location value +/ the uncertainty covers x% of the data, where x is the confidence level specified by the user. We calculate the value of 1 standard deviation under these assumptions. We then also sample for the degradation and its associated uncertainty in the same way. Returns ------- pd dataframe For each timestamp, contains the location, lower and upper uncertainty bounds, and probabilities at lower and upper bounds. Notes ------ "In PVsyst, the evaluation of the "Losses" of a PV array (as for the definition of the normalized performance ratio), takes as starting point the energy which would be produced if the system worked always at STC conditions (1000 W/m², 25°C, AM1.5)." Source : https://www.pvsyst.com/help/irradiance_loss.html Loss is 1 - sum(efficiency factors) Efficiency factors: pv_eff - base efficiency pv_lid_coef - light induced degredation pv_degrad_coef - degredation coefficient pv_mql - module quality loss Future ------ Create IV curve to compare V-dc mes to V_dc theoretical (10hrs) """ name = 'lf_pv_eff' logging.info('Calculating %s', name) pv_eff = sui_configs.module_info.pv_eff pv_eff_unc = sui_configs.module_info.pv_eff_unc pv_lid_coef = sui_configs.module_info.pv_lid_coef pv_lid_coef_unc = sui_configs.module_info.pv_lid_coef_unc pv_degrad_coef = sui_configs.module_info.pv_degrad_coef pv_degrad_coef_unc = sui_configs.module_info.pv_degrad_coef_unc pv_mql = sui_configs.module_info.pv_mql pv_mql_unc = sui_configs.module_info.pv_mql_unc conf_level = user_settings.variable_uncertainty.general.conf_level #### ------------------------------------------------------------------- if self.calc_type == 'measured': day_diff = self.index - first_op_day else: # simulated option day_diff = self.index - first_sim_day # note first_sim and first_opday are global variables created in # config n_years = (first_sim_day - first_op_day) / np.timedelta64( 1, 'D') / 365 pv_eff_mu = pv_eff - (pv_degrad_coef * n_years) # Number of standard deviations # from mean to cover conf_level% of data assuming normal distribution zscore = stats.norm.ppf(1 - (1 - conf_level) / 2) # Divide the uncertainty by # of std # to get the value of 1 standard deviation # uncertainty is two way and zscore is one way divide by 2 sigma_eff = pv_eff_unc / zscore / 2 pv_eff = stats.norm.rvs(pv_eff_mu, sigma_eff) # eff at beginning of simulation period sigma_degrad = pv_degrad_coef_unc / zscore pv_degrad_coef = stats.norm.rvs(pv_degrad_coef, sigma_degrad) time_degrad = day_diff / np.timedelta64(1, 'D') * pv_degrad_coef / 365 pv_eff = pv_eff - pv_lid_coef - pv_mql - time_degrad lf = pd.Series(1 - pv_eff, index=self.index, name=name) # multiply number of days by the daily increase in uncertainty # based on temp value unc_var = (self.index - first_op_day) / np.timedelta64( 1, 'D') * pv_degrad_coef_unc / 365 unc_fixed = pv_eff_unc + pv_lid_coef_unc + pv_mql_unc lf_unc = pd.Series(unc_var + unc_fixed, index=self.index, name='lf_unc') self.validate_assert_write_helper(name, lf, lf_unc)

*** # Calculating LFPs per day ######################################################## --------------------------- ---------------------- ------------------------------ ------------------------------------ ------------------------------------- ------------------------------------------------------------- ------------------------------------------------------------------------ -------------------------------------------------------- ---------- ================================================= ======================================================================= ************************************************** __________________________________________________________ ___________________________________________________________________________________ ................................................................................................................... ////////////////////////////////////// .......................... .... ................. .................. ..... ....... ...... ............. .# ## ######## ##... ###..................... ##### | | || || |||| │ │ ┌─╸̅͘═´¯`·‿»¬¶•Àº«¨®ƒœ№●†Â©™Ã±ôˆ¢˜ħø∙߀£$&()+–—→½âêîûëïöüçãàèẽðéóúñæåäíǎoʼɪțᵉδύπαρμος εκτινελωσγ χác ở Điūr ánātōu žaškējībén kai mikrofon takus ved nyauzumot luktiem daraje og sveitam udavimoseten bij eksplorativni tehte görjaan jaunomis zapsojučia işlemeji vektoriye hilvasojamega arabievi yerdezaqta qaytoje niurraki razkojoğunut koordinaatsizna farkında birini tanlangacaks iradigineci kelimeyi alma sahiptama verdi!\xa0", '''PV Systs'', 'Location': {}, 'Uncertainties:'. ', '']}')**

 # Calculate start effecitve levels assuming constant relative error


     self._calc()



#    def _getlocdata():

         return locdf['location'].values[startidx:]

 # TODO check this function's docstring!


     return self._calc('pfv', 'pve')



class LOSSExposureLossCalculator():

  def __init__(

      cls=None,): pass


  @classmethod

   @_withparamargs()   ## args[0] = name or path

       ### optional parameters

           #~ parser object containing all command line arguments; default None


   _parser   =_argparse_.ArgumentParser("Calculating lsexplosion")



   ################### argument parsing #####################################



  class ArgumentParser(_argumentParserCls.__subclass__):


      #=========================================================================# 

      #@property

   #: A dictionary holding keyword arguements parsed into an instance

          keywords            dict([(str,'--'+argname)])

         kwargs              ={}


         help                  =\

            '''

             Typical usage example


             >>> pvtlptopr           --path=/home2/.localrc

             ...     /usr             \\

               |              \      `./bin`               ./scripts\\runscript


             .. note::

                  If you want more control over your own use cases please see 

                  http:#python3 <stdin>


            ''')


          positional          := ('positional','optional' ) 

          keyval                 <- {'key':'value'} ; defaults will override

                          any other keys set here



            group                   ::= '-' delimited list

                        Specifies groups used within lists

                    eegspec                     ::

                     

                         * ``['EEG']``

                      

                     ``'MEG'``

                  

                 ``"MEL"``       -> ['EMS'|'MOLE'|etc.]


                    **Note:** Only one pair can appear per file type!!!**


           ===================== ============================================== =======

                Keywords        Description


                Optional Arguments         Meaning                                                      

                                        ------------------- ----------------------------------------------------

                                      

                                 0                           No evidence

                                         2                         Yes

                                                    

                             3                          YES

                                                

                            4                        NO

                                                 

                           5                            NONE

                                               

                              6                             ODD

                                           

                               7                               ONCE

                                            

                                8                              OFFSET

                                                   

                                                     

                                                                                                               

                                                        

                                                       

                                    9                                  SURFACE

                                                           

                                                      

                                                                    

        Example Usage


        --------------

        

            To create new argparse subclass like usual

            

            ``` python 

            import sys

     cpthlp                     .main                    main                                                                                                   cpcpsolvertools\ncp thlcpactools lscpilotter

None of the models have useful, let alone correct output.  But they are amusing. 


Codegen's prediction shows the model's affinity for dashes, equal signs and asterisks.  It proceeds to a random assortment of Greek, what looks to be some central European language (maybe Czech?), and concludes with 3 solar-related terms.  CodeParrot in turn generates a class that seems to be a hodge-podge of programming languages.  Surprisingly, there's a big docstring section within the function definition itself. 

Function 2: Get distance

def get_distance(x, site_lat, site_long): """Get distance between two geographical coordinates. Parameters ---------- x : pd Series Pandas Series containing information about a neighbouring site. site_lat : float Site latitude, in decimal degrees. site_long : float Site longitude, in decimal degrees. Returns ------- float Distance between site and neighbouring site. See Also -------- data_import.legacy_get_nearest_site_nrel_info """ d_lat = math.radians(site_lat) - x['lat_rad'] d_lng = math.radians(site_long) - x['long_rad'] temp = (math.sin(d_lat / 2)**2 + math.cos(x['lat_rad']) * math.cos(32.03914409) * math.sin(d_lng / 2)**2) return 6373.0 * (2 * math.atan2(math.sqrt(temp), math.sqrt(1 - temp)))
This function calculates the distance between 2 points. def get_distance(x, site_lat, site_long): """Get distance between two geographical coordinates. Parameters ---------- x : pd Series Pandas Series containing the indices lat_rad and long_rad with the latitude and longitude, respectively, in radians of a comparison site. site_lat : float Site latitude, in decimal degrees. site_long : float Site longitude, in decimal degrees. Returns ------- float Distance between site and neighboring site. See Also -------- data_import.legacy_get_nearest_site_nrel_info """ d_lat = math.radians(site_lat) - x['lat_rad'] d_lng = math.radians(site_long) - x['long_rad'] temp = (math.sin(d_lat / 2)**2 + math.cos(x['lat_rad']) * math.cos(32.03914409) * math.sin(d_lng / 2)**2) return 6373.0 * (2 * math.atan2(math.sqrt(temp), math.sqrt(1 - temp)))

# Get distances from each point to all other points on this line segment: (1) find closest neighbor; 2). Find nearest neighbour for every index that is within range [0..360]!!! !!!!!!!!!!!!!!!???????????? = 1/2 * 360^n + n where 0 <= i < len(_indexes)-i - 3 # TODO check if we can use np instead or just using pandarallel's linear interpolation function here... maybe not as it seems like there are many ways but I think its ok because they're very similar anyways... :)  !!!!!!!!?????????     = https://stackoverflow-questions@python3k/?qid=(RU5TJQKLZYHXWVFTVA4B7EI6M8DG9C&tb=/rjmhgwvfzcvs%40gmailcomposermailboxapplicationservicesystemsolutionsandemailaddressoftheuserwithaverythingthatisnotinhereofthesystemsoverwritenotthisoneforyou!)

# TODO add option when using poasimulation instead? (not yet implemented?)!!!   - jhb 2020/10 / 2021 0729 : added options... maybe not needed anymore??!?!? #TODOS ADDED OPTIONS TO IMPUTE ACM VALUES HERE!! :)     --jk -- Added code from above but now it's just an example.... see https://github2solutionsblogger@gmailcom/?p=1&q=' + str('https:'+str("http:")) )      ---added 7 June 2019 --- updated version 1 July 2018 ----> changed all functions so they are more readable than original ones; also modified function name "get" --> get() -> return object at index i within list lst[i] = [lstsize][0], then use.iloc[] accessor rather though.. ;-)                               -----> fixed bug due no longer being able ot work because I have been changing my own implementation here since 6th March 2017...       <------ fix later!!                -->            ***          **           ****             *****         *               ********              ************                       ****************                      ********************************==================                    !!!!!!!!!!!!!!!!!                   ^^^^^^^                    ######################                  ###########                 ################################


 # Convert to cartesian coords for easier manipulation later on if needed...

  lon = nputils._cartesians2cart(_getsitecoords([float((y - _lng) / (l * maths)) 

                                                 + (_latitude ** 2),

                                                  ((1-mathsat)*npradians(-90)), 0],

                                                [0])[:3]

                         )

     

#    lon[abs(((longitude+180.) % 360.)) > 180.] -= 1e6

    

    return round((((360.-southpoleshift)/3600.), 6),)


#!/usr/bin python

import sys



 class Node:

    def __init__(self,*args,**kwargs ):

        self.__dict__.update(*zip(('node','parent', 'children'))+[getattr(__builtins__,name,'__doc__')]+list('abc')).extend(['data'])


    @property 

    class data():

      pass

    

    node=Node()

    

  root={}

  


  children=[]  


      

    

      

          

        

         

            

             

           

               

                

              

                    

                   

                 

                       

        

    nodes=[root['child']];



    while len(''.join(['%d'%ifor i,_idatadayerow]))>10 or not ''.split('\n')[-4:]=='\r\t':


        try:#while True:[]:


            print("Entering")

        

        except IndexError as emsg:'error' is raised when there are no more items available.':print("%%Error:%c" %(eems(),traceback()))        

      

     break 

        else:_+=sys," "*(len(','.ljust('%02X'%(int(.5*random()*100))))+' ')+" "*20+'\033'+''.rjust(\

                            random().choice(["a","b"])+"\001"+"\002",'')+"_"*50+\

                                "".center((" ",".")[randrange_(8)]+") \

   

         del roots[_][::-7];            


            


                child=_nodes[:]                

          

           

                     

                                       

 

                parent='';                          

                                     

                        

                      

                              

              


               


                                                                

                            

                                       

                                          

                                   

                                      

                                        

                                    

                                  

                                         

                                

                                elif '.txt.'in str("".zfill("{:.9f}".format('.').replace('.', ''))).lower()+str(".csv").strip('.')!="":

                                           

                                                       

                                                   

                                                    

                                                        

                                                           

                                                                

                                                     

                                                               

                                                            

                                                                   

                                                                                                          

                                                                  

                                                         

                                                          

                                                             

                                                              

                                                                    

                                                                 

                                                      

                                                                                                     

                                      


               

               

                   

                    

                   

                       

           

                

               

                

                    

              

                 

        

        

        

        '''if args == None:\nskip'''

        

        input='' ; output=""


        ctrls=['\\'] 

        lcnt=-99

    

        sctr='\''    

        tline="\""        

        rptxt="'"  

        ofile="%\\"

      

         

          

                  

             

            

            

           


          

            

        finput="""printf "%%%ds %%-%is | %-15.%df"|"%%.12g","%%",$,%H,$,\`h$\",\\"%\"",\"\"';";\""" 

            outfile='%%(outfile)#ostempath+"/tmp/"+output+".out";


            infile='''echo "$infile##$" >> $oFILE$$$' 

                cat '$filename#'/'*' '*16000| '+${SOURCES} ${TARGETS}' 

                 && echo "${TARGET}">> $@'"""

        

            outputimg='/dev/'+OUTPUT+'/stdout_'+INPUT+'.log'+'_'+'_logfile_.err'#'/var/$SOURCEPATH/_LOGS/{SRC}/*.dat','./*{TGT}.html,.png|.svg/.pdf'/'./images/')

                

            cmd=(cmdln).encode()\

                     .decode('$otput[]');


            

                         

                           

                

        osdialink="/home/"#"C:\\Users//Public\\\\Documents"/GitHub\/PycharmProjectsPathwayProjectFilesAndSettingsDatasetsWithPythonLib \\src\\\scripts\""

        


        command=('cd "'+_scriptpath+');' +\

     ('python lib../pystringdb --nologo &>/build/%jasminishment_%JASMINISHMENT_%(NAME)-configfiles&./runconfall "%(command)+configfile);



        punc='$PUNCHERON='+LPTSTRMDIRDATADIRSHERE__[LSRC]+'LIBROOTSYSCONFPROFCSCRIPTEN=$PYTHONPATH:/opt/%(LDMODULEPREFIX)' 


        runshed='.bashrc ~/.profilecfg../../*.ini

This is a fairly straightforward function but the results are bad.  It appears the docstring format is doing more harm than good.


Finetuned predictions

Now let's examine the predictions on the finetuned dataset.  Note that I actually had to manually tweak the parameters min_new_tokens and max_new_tokens on inference in order for the models to avoid generating extremely short results.

Function 1 - Calculate PV efficiency loss

def calc_lf_pv_eff(self): """Calculate photovoltaic conversion losses. Assumes uncertainty consists of a fixed fractional component of the location value, and a variable component that increases with the age of the plant. As degradation proceeds, we have decreasing confidence in its actual level. Both the actual degradation and degradation uncertainty are re-calculated at a daily level so avoid quantum jumps in values at the beginning of every year. Note that the user should also specifiy a positive asymmetry factor. This ensures that even though our uncertainty increases with time, the increase in uncertainty is asymmetrical - it is higher above the location (estimated value), since below the location, we have a floor on the value, since we know that losses due to PV efficiency cannot be lower than 1 - (pv efficiency + some uncertainty). The efficiency and degradation methodology when running on simulated POA data entails calculating the starting efficiency of the prediction time period based on the number of operational years. Since we run trials to predict future outcomes, we then sample from a normal distribution using the calculated starting efficiency as the mean. For the standard deviation we currently assume a normal distribution and assume that the location value +/ the uncertainty covers x% of the data, where x is the confidence level specified by the user. We calculate the value of 1 standard deviation under these assumptions. We then also sample for the degradation and its associated uncertainty in the same way. Returns ------- pd dataframe For each timestamp, contains the location, lower and upper uncertainty bounds, and probabilities at lower and upper bounds. Notes ------ "In PVsyst, the evaluation of the "Losses" of a PV array (as for the definition of the normalized performance ratio), takes as starting point the energy which would be produced if the system worked always at STC conditions (1000 W/m², 25°C, AM1.5)." Source : https://www.pvsyst.com/help/irradiance_loss.html Loss is 1 - sum(efficiency factors) Efficiency factors: pv_eff - base efficiency pv_lid_coef - light induced degredation pv_degrad_coef - degredation coefficient pv_mql - module quality loss Future ------ Create IV curve to compare V-dc mes to V_dc theoretical (10hrs) """ name = 'lf_pv_eff' logging.info('Calculating %s', name) pv_eff = sui_configs.module_info.pv_eff pv_eff_unc = sui_configs.module_info.pv_eff_unc pv_lid_coef = sui_configs.module_info.pv_lid_coef pv_lid_coef_unc = sui_configs.module_info.pv_lid_coef_unc pv_degrad_coef = sui_configs.module_info.pv_degrad_coef pv_degrad_coef_unc = sui_configs.module_info.pv_degrad_coef_unc pv_mql = sui_configs.module_info.pv_mql pv_mql_unc = sui_configs.module_info.pv_mql_unc conf_level = user_settings.variable_uncertainty.general.conf_level #### ------------------------------------------------------------------- if self.calc_type == 'measured': day_diff = self.index - first_op_day else: # simulated option day_diff = self.index - first_sim_day # note first_sim and first_opday are global variables created in # config n_years = (first_sim_day - first_op_day) / np.timedelta64( 1, 'D') / 365 pv_eff_mu = pv_eff - (pv_degrad_coef * n_years) # Number of standard deviations # from mean to cover conf_level% of data assuming normal distribution zscore = stats.norm.ppf(1 - (1 - conf_level) / 2) # Divide the uncertainty by # of std # to get the value of 1 standard deviation # uncertainty is two way and zscore is one way divide by 2 sigma_eff = pv_eff_unc / zscore / 2 pv_eff = stats.norm.rvs(pv_eff_mu, sigma_eff) # eff at beginning of simulation period sigma_degrad = pv_degrad_coef_unc / zscore pv_degrad_coef = stats.norm.rvs(pv_degrad_coef, sigma_degrad) time_degrad = day_diff / np.timedelta64(1, 'D') * pv_degrad_coef / 365 pv_eff = pv_eff - pv_lid_coef - pv_mql - time_degrad lf = pd.Series(1 - pv_eff, index=self.index, name=name) # multiply number of days by the daily increase in uncertainty # based on temp value unc_var = (self.index - first_op_day) / np.timedelta64( 1, 'D') * pv_degrad_coef_unc / 365 unc_fixed = pv_eff_unc + pv_lid_coef_unc + pv_mql_unc lf_unc = pd.Series(unc_var + unc_fixed, index=self.index, name='lf_unc') self.validate_assert_write_helper(name, lf, lf_unc)

name = 'lf_pv_eff'

    logging.info('Calculating %s', name)

    lpb, upb, asymmetry_factor, pv_eff_unc, pv_lid_coef, pv_degrad_coef, pv_mql_coef = user_settings.ilf_parameters[

        name]


    # uncertainty calculation

    pv_lid_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]

    pv_lid_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]

    pv_degrad_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]

    pv_degrad_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]

    pv_mql_var = self.sdm_est[self.var_names.index('PV Meter, Real Power')]

    pv_mql_unc = self.sdm_est_unc_half[self.var_names.index('PV Meter, Real Power')]


    # calculate lower and upper bounds

    lpb = lpb * pv_eff_unc / 100

    lpb = pd.DataFrame(np.clip(lpb, a_min=0, a_max=None),

                        index=self.sdm_data.index,

                        columns=self.sdm_data.columns)

    lpb.name = 'lf_bnds'


    upb = upb * pv_eff_unc / 100

    upb = pd.DataFrame(np.clip(upb, a_min=0, a_max=None),

                        index=self.sdm_data.index,

                        columns=self.sdm_data.columns)

    upb.name = 'lf_bnds'

...

    df.loc[:, 'lf_bnds'] = pd.concat([df['lf_bnds'], lf_bnds], axis=1)


    return df

 # Uncertainty parameters taken directlyfromPVSys results table


     eff = self._usersettings['ilf params']['PV Array Effective Percentage'] / \

         float((len(_YEARS)-3)) *.98   ## % reduction per opperate yr

#     print('effective', _EFFICIENCY,'\n')

      ####-------------------------------------------------------


      ###--------------------------------------------------------------------


        def lfcalc_(x_, pvldata_=None,_=__'):

            '''Helper function used within apply_.


            Parameters

             ----------

                X_: np arr

                    Contains uncetainty information


                efficiencies

                 list

                     List containing all necessary input arguments


             Returnes

              --------

               res

                   LF'''



            loc   , lpb      _, upb       *_         =(X_[i]for i,(lpi,),upbi

                                            )          #: Location parameter

                                                    :,lower physical bound *,upper

                                                            PhysicalBound


                                            lpvli        :=loc*pf*(

                                                ~npaay[vi])**poavariaj

                                             #(Location Parameter Multiplier

                                                                            *(

                                                                                Loc

                                                                                    ** poaijj

                                                                        ))



                            dlvrd            *= pf**(

                                    abs(((lprivatenames[_]-

                                       'module sizing paramters').replace('_','')))))


                            invvar           /= ((

                                df[(df['_year'].isin([int(_)])) &

                                 ~(pd

                                    ._isna(('inv var ['+str(_.split('-')[

                                         len(-4))]+'_'

                                                                f'{ _. split("-")[

                                                                    (-6)]}'

                                                        )))].sum())


                        return [dpvldrivered]*7



        @apply_()

       #############Uncomment this line after applying custom funcitonality############


       ###############################################################


         #####Applying Custom Function######################################

 d = userData['data']

#  d2=userInfo[site]['dni'].values()


        # convert to numpy array for speedup: (lon1 - lon0) / 2 + 1j * npabs((y-npz)/1000)**3



        distancs=[

            (-180*maths**5)/(4*(numpypi/360))+(-90*(-1536/(167772))*cos(((Yr-%20)*sin((((Xrs % 360)-45)))+(((-40.*sqrt(%(Weather station azimuth)^6))))))),\

                cos (((Zmax+%86400)*(tan ((Ave%24))/3600)), \

                    sin z=-atan(\

                        sqrt(\\frac{{degrees}{30}}{\mu l}

                            {\sigma s}\right)\

                      )+\

                  atan({l})-\

                     {{radians}})>_{radius:.7f}".format(**locals())


            ]



    return sum([distanc.__div__(v).sum().item()[None] if v is not None else 0

            .round('NaN')

         ])


if __name__ == '__main__':

   print 'Calculating distances'


   dataFile='../datadrive/'+'POI*.csv?wkt=%23'+'\n'+'NODATA'.encode("utf")+'\t?'+"POINTS".decode()+"\nsymmetry"



   dfw=''; fghb=[] ; irfh=['N','NE'], [str,'NA', '']

  waterp=[[]]


  res={}

  

  pvlp={'poa':[],'prjt':'Total POIs'}



  def calcDistance():

      global pointillarypvc


      tmplist=[(k,[])for k,_row

                  ,dfrw

                 ][cfgr].count(',')

              templst=(pdDataFrame({'PoInDic':{'$exists':[True]}}),

                       {col:[]})


              totalflag=('SUM',)

               avrgarr=((tmpdict[[]]) & col)[-len(['Sum']) > len('$')]


          invvar={key:{val:{}}.index('.').replace('-', '')

                                ['Pacm{}'.find('{} {}'.

                                                  format(*invthrd)[::-12])]

                           [.strip('-')

                                 replace('.', '').split()]

          varnames=[('{} {} '.

                              join(', ',

                                      key[:-9]).upper(), val[-11:])

                         ).lower()[:10]+''.join(':')[:13]+'00'*14+'.000*'


       outdatedvars=',|>'.ljust(['%02dm'%el

                                                ''

                                             '|{:^63}'.center('|')[

                                                    max('%03dk'%(int(_), 16)+

                                              ('.'+_avcgum)[:18], ':o:')

                                                 '-'

                                            ],'| ')

       fmt='%-'+'_-+='.rjust('')


     try:#os stats module first as it's faster than using list comprehension because I have no idea what that does this mean...


         statinfo=_stats_.calcStatSummary()['mean'][outdtlk][tempdctfn](

                                    loc[:,:]==polewrubattuysummary

                                                      )[colspan:].sortlevel()\

                                   ([float('-inf'),'-99%',

                                        '-9999'])[columnslideings]\

                               (['GHI']*100+[greatcircle]*50*[halfellipse])*25*\

                                         hstack()[-500:-1500]+\

                                       greatest('*cumulative*')


             info_=statsys._genStatsResultTable(__file__,

                                                   names=__fieldNames__)

           _res_['Mean']=pandasWrapper_(output='',

                                          usemask="auto", dtype={"O":object,"V":"double"}).\

                          groupBy('__year')\

                            ._toDF(('Year',))


           retvals=\

     pandasWrapResultsTracker()(retkeys_,

                                           fields=[],dtype={}) +\

     ColumnDescriptorsToDicts__()



            outputterror='\nThe following columns are missing from your final results:\nIf you want '\

                                  '\'Missing values at all,\nthe \'NULL\' column will be filled automatically!\"';\

                                                                             "\nor they may contain null or undefined entries."


                    #'No Data Found For This Location';'';';"Have fun!'`shall we go! We can't do much better on these without having any impact when trying '+'"Heavy Metrics Offshore".'


  #"Average Error Between Values";""}'`'|||'&&


Function 2 - Get distance

def get_distance(x, site_lat, site_long): """Get distance between two geographical coordinates. Parameters ---------- x : pd Series Pandas Series containing the indices lat_rad and long_rad with the latitude and longitude, respectively, in radians of a comparison site. site_lat : float Site latitude, in decimal degrees. site_long : float Site longitude, in decimal degrees. Returns ------- float Distance between site and neighboring site. See Also -------- data_import.legacy_get_nearest_site_nrel_info """ d_lat = math.radians(site_lat) - x['lat_rad'] d_lng = math.radians(site_long) - x['long_rad'] temp = (math.sin(d_lat / 2)**2 + math.cos(x['lat_rad']) * math.cos(32.03914409) * math.sin(d_lng / 2)**2) return 6373.0 * (2 * math.atan2(math.sqrt(temp), math.sqrt(1 - temp)))

# convert to radians

    site_lat = site_lat * np.pi / 180

    site_long = site_long * np.pi / 180


    # convert to degrees

    x_rad = x * np.pi / 180

    y_rad = abs(site_lat - x_rad)

    x_deg = np.rad2deg(x_rad)

    y_deg = np.rad2deg(y_rad)


    # distance is the angle between the two lines

    # of the Earth's surface and the plane of the earth

    # with respect to the Earth's plane

    # theta = np.arctan((y_deg)/(x_deg))

    # distance = np.abs(theta * 180 / np.pi)


    return abs(site_long - x_deg)

 # Convert to radians

    lat_rad = np.radians(x.lat_rad)

    long_rad = np.radians(x.long_rad)


    # Distance between points

    d = np.sin(lat_rad[0]) * np.sin(lat_rad[1]) + np.cos(lat_rad[0]) * \

        np.cos(lat_rad[1]) * np.cos(long_rad[0] - long_rad[1])


    # Convert to meters

    return np.rad2deg(np.sqrt(d))

 d = userData['data']

#  d2=userInfo[site]['dni'].values()


        # convert to numpy array for speedup: (lon1 - lon0) / 2 + 1j * npabs((y-npz)/1000)**3



        distancs=[

            (-180*maths**5)/(4*(numpypi/360))+(-90*(-1536/(167772))*cos(((Yr-%20)*sin((((Xrs % 360)-45)))+(((-40.*sqrt(%(Weather station azimuth)^6))))))),\

                cos (((Zmax+%86400)*(tan ((Ave%24))/3600)), \

                    sin z=-atan(\

                        sqrt(\\frac{{degrees}{30}}{\mu l}

                            {\sigma s}\right)\

                      )+\

                  atan({l})-\

                     {{radians}})>_{radius:.7f}".format(**locals())


            ]



    return sum([distanc.__div__(v).sum().item()[None] if v is not None else 0

            .round('NaN')

         ])


if __name__ == '__main__':

   print 'Calculating distances'


   dataFile='../datadrive/'+'POI*.csv?wkt=%23'+'\n'+'NODATA'.encode("utf")+'\t?'+"POINTS".decode()+"\nsymmetry"



   dfw=''; fghb=[] ; irfh=['N','NE'], [str,'NA', '']

  waterp=[[]]


  res={}

  

  pvlp={'poa':[],'prjt':'Total POIs'}



  def calcDistance():

      global pointillarypvc


      tmplist=[(k,[])for k,_row

                  ,dfrw

                 ][cfgr].count(',')

              templst=(pdDataFrame({'PoInDic':{'$exists':[True]}}),

                       {col:[]})


              totalflag=('SUM',)

               avrgarr=((tmpdict[[]]) & col)[-len(['Sum']) > len('$')]


          invvar={key:{val:{}}.index('.').replace('-', '')

                                ['Pacm{}'.find('{} {}'.

                                                  format(*invthrd)[::-12])]

                           [.strip('-')

                                 replace('.', '').split()]

          varnames=[('{} {} '.

                              join(', ',

                                      key[:-9]).upper(), val[-11:])

                         ).lower()[:10]+''.join(':')[:13]+'00'*14+'.000*'


       outdatedvars=',|>'.ljust(['%02dm'%el

                                                ''

                                             '|{:^63}'.center('|')[

                                                    max('%03dk'%(int(_), 16)+

                                              ('.'+_avcgum)[:18], ':o:')

                                                 '-'

                                            ],'| ')

       fmt='%-'+'_-+='.rjust('')


     try:#os stats module first as it's faster than using list comprehension because I have no idea what that does this mean...


         statinfo=_stats_.calcStatSummary()['mean'][outdtlk][tempdctfn](

                                    loc[:,:]==polewrubattuysummary

                                                      )[colspan:].sortlevel()\

                                   ([float('-inf'),'-99%',

                                        '-9999'])[columnslideings]\

                               (['GHI']*100+[greatcircle]*50*[halfellipse])*25*\

                                         hstack()[-500:-1500]+\

                                       greatest('*cumulative*')


             info_=statsys._genStatsResultTable(__file__,

                                                   names=__fieldNames__)

           _res_['Mean']=pandasWrapper_(output='',

                                          usemask="auto", dtype={"O":object,"V":"double"}).\

                          groupBy('__year')\

                            ._toDF(('Year',))


           retvals=\

     pandasWrapResultsTracker()(retkeys_,

                                           fields=[],dtype={}) +\

     ColumnDescriptorsToDicts__()



            outputterror='\nThe following columns are missing from your final results:\nIf you want '\

                                  '\'Missing values at all,\nthe \'NULL\' column will be filled automatically!\"';\

                                                                             "\nor they may contain null or undefined entries."


                    #'No Data Found For This Location';'';';"Have fun!'`shall we go! We can't do much better on these without having any impact when trying '+'"Heavy Metrics Offshore".'


  #"Average Error Between Values";""}'`'|||'&&



Codegen and Decicoder are making a valiant effort at generating the correct code.  All the models get tripped up on Function 1.  For function 2,  Codegen and Deci are approaching the correct methodology, but have of confusing trigonometric functions as well as mixing up how to manipulate the input variables.

Metrics

LM metrics are an active research field.  You will see terms such as 'state-of-the-art' being thrown around in reference to the latest model.  However, the metrics aren't yet standardized enough to allow model performance comparisons without delving into the details of how the metric was set up.   


Metrics can be divided into 2 categories: human and automated.  Human evaluation is reliable, but difficult and expensive to scale.  The popular automated metrics are Bleu, Chrf and Ruby, among others.  These are all variants of generating statistics on what share of predicted characters or n-grams are correct compared to the ground truth.  Currently, a popular metric you'll see is the HumanEval benchmark, misnamed since it's actually an automated procedure.  Its approach is different from the metrics referenced above.  It contains a function prompt and an associated unit test that a successful output would pass.  So we can feed the model we're testing this dataset and see how many of the unit tests the model results pass.


Researchers have noted limitations - HumanEval functions are mostly focused on short, specific computer-science tasks, so it is unclear how the scores would generalize to other domains.  Additionally, the evaluation is binary, making it impossible to gauge the result quality even if it doesn't pass the unit test. 


Note the metrics I include below are with a min length of 200 and a max length of 1000.

Model HumanEval@1 - Reference Bleu - Baseline Bleu - Finetune ChrF - Baseline ChrF - Finetune
Codegen 12.76 0 0.08 8.67 20.98
Decicoder 19.1 0 0.10 6.11 30.48
CodeParrot 3.99 0 0.006 19.1 18.77

The low Bleu scores are because Bleu is actually a pretty strict metric - at least one 4-gram (4 set of words) needs to match to score above 0.  It does appear from the ChrF score that Decicoder performs best out of the three models.  However, we only have 3 samples and it's dubious to base conclusions from such a small sample size.  Researchers have shown that even with a large number of samples, if the difference in metrics between models is under 2%, then it isn't meaningful (i.e. statistically significant).  Refer to page 11 of the Evtikhiev paper linked at the end of the article.

Conclusion

The main takeaway is that even if we tweak the parameters for prediction, the results are not practical.  I included the HumanEval reference score for reference, though it's not actually useful in our scenario.  We care whether the machine can generate functions we have, with all of the idiosyncracies of our project, not on generic computer science problems.    


Some ideas for improvement / thoughts:

  • Experiment with basic prompts instead of the raw docstring.
  • I treated all functions as separate inputs so the model has no concept of how functions relate to each other.  Ways to address this is to potentially use agents to mimick the software development cycle when automating the creation of a new function or set of function.  Other approaches rely on recreating the Python Abstract Syntax Tree.  I will include some relevant papers below.
  • Train on larger models (CodeLlama, CodeWizard).
  • I noticed the docstrings themselves can be improved to leave no room for ambiguity in the inputs.  This is tricky because a developer may not need every last detail spelled out, and so tweaking these functions comes with a cost. 



Returning to the questions posed at the outset, the finetuning definitely improves the predictions, with Decicoder performing the best out of the 3. But, the functions do not run and they are pretty far from being correct. I would really like to see how Codellama performs on this. Stay tuned!


References

Share by: