Last Updated on 2024-08-15 by Clay
Introduction
CodeBERT is a pre-trained model based on transformer architecture, it is proposed from CodeBERT: A Pre-Trained Model for Programming and Natural Languages.
Like most BERT pre-trained models, the "input information" can be used to generate a set of "feature representation" for downstream tasks.
The applicable downstream tasks of CodeBERT include:
- Natural Language Code Search
- NL-PL Probing
- Code Documentation Generation
- Generalization to Programming Languages Not in Pre-training
Tasks (Experiments)
Natural Language Code Search
Given a natural language text narrative, find the most semantically relevant code from some codes.
In order to be able to compare with different retrieval methods, they select CodeSearchNet dataset (https://github.com/github/CodeSearchNet) for training, and use MRR as the metrics score.
Mean Reciprocal Rank (MRR) is an evaluation method that evaluates the quality of data retrieval and return results.
Assume that the query returns several results, the earlier the answer match, the higher the score. If no correct answer appears, the score is 0. The average evaluation score is the MRR score.
NL-PL Probing
Fix the model parameters and study which domain knowledge CodeBERT has learned.
Code Documentation Generation (divided into training and not training)
This is a good habit that has been advocated in recent years: development and document are written in the same time.
If your system or API have a good document description, so that you can avoid the need for others to track the source code. In other words, this document is so clear that you can understand how to operate the system without reading the source code.
Including yourself a few months later.
This downstream task of CodeBERT is to automatically generate the description file of the program code.
There are 6 programming languages used in training:
- Go
- Java
- JavaScript
- PHP
- Python
- Ruby
The programming language was not used in training is C#.
By the way, the dataset in the experiment is Codenn (https://github.com/sriniiyer/codenn),
Paper Summary
CodeBERT is the first pre-trained model for both programming language and natural language, and has achieved SOTA (state-of-the-art) effects in the downstream tasks in the experiment.
The paper put forward that adding the AST structure information of the programming language is a promising research direction. As far as I know, they also realized the promise of the research direction in the GraphCodeBERT paper (Comment-Code-DataFlow).
How to use CodeBERT (Code Documentation Generation)
The detailed use method you can refer to CodeBERT paper and GitHub repository. In here I briefly introduce how to use CodeBERT and take Code Documentation Generation as an example.
Installation
pip3 install torch==1.4.0
pip3 install transformers==2.5.0
pip3 install filelock
Data Preprocessing
The data preprocessing in this task as follows:
- Remove comments in the code
- Remove examples that codes cannot be parsed into an abstract syntax tree.
- Remove examples that #tokens of documents is < 3 or >256
- Remove examples that documents contain special tokens (e.g. <img ...> or https:...)
- Remove examples that documents are not English.
You can download the dataset from here, or use the following commands:
pip3 install gdown
mkdir -p data/code2nl
cd data/code2nl
gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h
unzip Cleaned_CodeSearchNet.zip
rm Cleaned_CodeSearchNet.zip
cd ../..
Use the tree
command you can see the following directory structure:
tree data/code2nl/CodeSearchNet/
Output:
data/code2nl/CodeSearchNet/
├── go
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── java
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── javascript
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── php
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── python
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
└── ruby
├── test.jsonl
├── train.jsonl
└── valid.jsonl
6 directories, 18 files
As you can see, every language has their own training, validation, testing data files.
Run the program
Go to https://github.com/microsoft/CodeBERT/tree/master/CodeBERT/code2nl and clone the run.py、bleu.py、model.py files and put them into data/code2nl folder.
Run the following command. I change the batch_size=128
to batch_size=4
, This is because the memory problem of my GPU.
lang=php #programming language
beam_size=10
batch_size=4
source_length=256
target_length=128
output_dir=model/$lang
data_dir=../data/code2nl/CodeSearchNet
dev_file=$data_dir/$lang/valid.jsonl
test_file=$data_dir/$lang/test.jsonl
test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test
python run.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size
How to call CodeBERT
If you just only want to use the feature representations convert from CodeBERT, you can refer the following sample code:
from transformers import AutoTokenizer, AutoModel
import torch
# Init
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")
# Tokenization
nl_tokens = tokenizer.tokenize("return maximum value")
code_tokens = tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens = [tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]
# Convert tokens to ids
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
context_embeddings = model(torch.tensor(tokens_ids)[None,:])[0]
# Print
print(context_embeddings)
Output:
tensor([[-0.1423, 0.3766, 0.0443, ..., -0.2513, -0.3099, 0.3183],
[-0.5739, 0.1333, 0.2314, ..., -0.1240, -0.1219, 0.2033],
[-0.1579, 0.1335, 0.0291, ..., 0.2340, -0.8801, 0.6216],
...,
[-0.4042, 0.2284, 0.5241, ..., -0.2046, -0.2419, 0.7031],
[-0.3894, 0.4603, 0.4797, ..., -0.3335, -0.6049, 0.4730],
[-0.1433, 0.3785, 0.0450, ..., -0.2527, -0.3121, 0.3207]],
grad_fn=<SelectBackward>)
If you want to refer to more HuggingFace team's transformers package usage method, you can refer to [PyTorch] How to Use HuggingFace Transformers Package (With BERT Example)