Skip to content

BigCode & StarCoder: Large Language Model Projects Specialized in Programming

Last Updated on 2023-05-10 by Clay

Introduction

BigCode

First, let's introduce BigCode! BigCode is an open science collaboration project co-led by Hugging Face and ServiceNow, with the goal of jointly code large language models (LLMs) that can be applied to "programming."

Recently (2023/05/04 - 2023/05/10), I stumbled upon news about StarCoder and was recommended by a colleague that it is a fantastic model. This is how I learned about this large language model dedicated to code.

As an AI/NLP engineer, I am naturally very interested in this model.


StarCoder

StarCoder can already be found on Hugging Face Model Hub, which includes:

Both are large language models targeting code design and development, trained on data authorized by GitHub (is there such authorization? My code is welcome to be used for training if you don't mind). In summary, these training data include 80+ programming languages and cover Git commits, GitHub issues, and Jupyter notebooks, so we can expect them to have some ability to solve bugs.

Similar to the recently popular open-source large language model LLaMA for research purposes, the development team trained a model with approximately 15.5 billion parameters, processing one trillion tokens.

What is the difference between StarCoderBase and StarCoder? In fact, StarCoder is a new model obtained by fine-tuning StarCoderBase with 35 billion Python tokens. So, we can imagine that StarCoder should have a more proficient generation capability for Python.

Next, let's briefly discuss the model's performance. The development team found that StarCoderBase outperforms existing Open Code LLMs in popular programming benchmarks and may, in some aspects (emphasis on some aspects), be on par with or exceed OpenAI's code-cushman-001, the original Codex model.

Additionally, the StarCoder series models can handle more than 8k tokens in context length, supporting longer inputs than many open-source LLMs and thus better handling longer code content. Furthermore, we can design different prompts to make the StarCoder model a technical assistant for explaining or modifying code.

Lastly, the StarCoder series is licensed under OpenRAIL, which seems to support commercial licensing, no wonder the documentation says such licensing simplifies the process of integrating models into products for enterprises.


Training Data

The StarCoder series models are trained on The Stack 1.2, and this dataset only contains code with permissions. If you have concerns about this and are a contributor to the code, you can choose to remove your code from the dataset (refer to "Am I in The Stack?").

Interestingly, I was surprised to find that my GitHub repo was actually included.

However, I naturally wouldn't choose to opt out. If you don't mind, please use my other repos as datasets as well!


Model Evaluation

As mentioned earlier, the development team conducted a comprehensive evaluation of StarCoder and several similar models, as well as various benchmarks.

A popular Python benchmark is HumanEval, which tests the effectiveness of large language models trained on code. The training content is based on function naming or docstrings to design and complete the function. The development team stated that even though the StarCoder series models are smaller in scale, they still outperform larger language models like Google's PaLM, LaMDA, and LLaMA, as well as CodeGen-16B-Mono and OpenAI's code-cushman-001 (12B) models.

An odd issue is that the model automatically generates comments like "# Solution here," which may be because such comments appear in example and solution code.

To generate actual solutions, the development team added a prompt:

<filename>solutions/solution_1.py\n# Here is the correct implementation of the code exercise

This significantly improved the StarCoder series' scores in HumanEval, raising them from 34% to even over 40%, setting a new record.

ModelHumanEvalMBPP
LLaMA-7B10.517.7
LaMDA-137B14.014.8
LLaMA-13B15.822.0
CodeGen-16B-Multi18.320.9
LLaMA-33B21.730.2
CodeGeeX22.924.4
LLaMA-65B23.737.7
PaLM-540B26.236.8
CodeGen-16B-Mono29.335.3
StarCoderBase30.449.0
code-cushman-00133.545.9
StarCoder33.652.7
StarCoder-Prompted40.849.5

(Technical Assistant)

In addition to being great at writing code, StarCoder can also be used as a technical assistant by constructing prompts to answer programming-related questions.


References


Read More

Leave a Reply