[Machine Learning] CodeBERT Introduction (With Example)

Introduction

CodeBERT is a pre-trained model based on transformer architecture, it is proposed from CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Like most BERT pre-trained models, the “input information” can be used to generate a set of “feature representation” for downstream tasks.

The applicable downstream tasks of CodeBERT include:

Natural Language Code Search
NL-PL Probing
Code Documentation Generation
Generalization to Programming Languages Not in Pre-training

Tasks (Experiments)

Natural Language Code Search

Given a natural language text narrative, find the most semantically relevant code from some codes.

In order to be able to compare with different retrieval methods, they select CodeSearchNet dataset (https://github.com/github/CodeSearchNet) for training, and use MRR as the metrics score.

Mean Reciprocal Rank (MRR) is an evaluation method that evaluates the quality of data retrieval and return results.

Assume that the query returns several results, the earlier the answer match, the higher the score. If no correct answer appears, the score is 0. The average evaluation score is the MRR score.

NL-PL Probing

Fix the model parameters and study which domain knowledge CodeBERT has learned.

Code Documentation Generation (divided into training and not training)

This is a good habit that has been advocated in recent years: development and document are written in the same time.

If your system or API have a good document description, so that you can avoid the need for others to track the source code. In other words, this document is so clear that you can understand how to operate the system without reading the source code.

Including yourself a few months later.

This downstream task of CodeBERT is to automatically generate the description file of the program code.

There are 6 programming languages used in training:

Go
Java
JavaScript
PHP
Python
Ruby

The programming language was not used in training is C#.

By the way, the dataset in the experiment is Codenn (https://github.com/sriniiyer/codenn),

Paper Summary

CodeBERT is the first pre-trained model for both programming language and natural language, and has achieved SOTA (state-of-the-art) effects in the downstream tasks in the experiment.

The paper put forward that adding the AST structure information of the programming language is a promising research direction. As far as I know, they also realized the promise of the research direction in the GraphCodeBERT paper (Comment-Code-DataFlow).

How to use CodeBERT (Code Documentation Generation)

The detailed use method you can refer to CodeBERT paper and GitHub repository. In here I briefly introduce how to use CodeBERT and take Code Documentation Generation as an example.

Installation

pip3 install torch==1.4.0
pip3 install transformers==2.5.0
pip3 install filelock

Data Preprocessing

The data preprocessing in this task as follows:

Remove comments in the code
Remove examples that codes cannot be parsed into an abstract syntax tree.
Remove examples that #tokens of documents is < 3 or >256
Remove examples that documents contain special tokens (e.g. <img …> or https:…)
Remove examples that documents are not English.

You can download the dataset from here, or use the following commands:

pip3 install gdown
mkdir -p data/code2nl
cd data/code2nl
gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h
unzip Cleaned_CodeSearchNet.zip
rm Cleaned_CodeSearchNet.zip
cd ../..

Use the tree command you can see the following directory structure:

tree data/code2nl/CodeSearchNet/

Output:

data/code2nl/CodeSearchNet/
├── go
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── java
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── javascript
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── php
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── python
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
└── ruby
    ├── test.jsonl
    ├── train.jsonl
    └── valid.jsonl

6 directories, 18 files

As you can see, every language has their own training, validation, testing data files.

Run the program

Go to https://github.com/microsoft/CodeBERT/tree/master/CodeBERT/code2nl and clone the run.py、bleu.py、model.py files and put them into data/code2nl folder.

Run the following command. I change the batch_size=128 to batch_size=4, This is because the memory problem of my GPU.

lang=php #programming language
beam_size=10
batch_size=4
source_length=256
target_length=128
output_dir=model/$lang
data_dir=../data/code2nl/CodeSearchNet
dev_file=$data_dir/$lang/valid.jsonl
test_file=$data_dir/$lang/test.jsonl
test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test

python run.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size

How to call CodeBERT

If you just only want to use the feature representations convert from CodeBERT, you can refer the following sample code:

from transformers import AutoTokenizer, AutoModel
import torch


# Init
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")


# Tokenization 
nl_tokens = tokenizer.tokenize("return maximum value")
code_tokens = tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens = [tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]


# Convert tokens to ids
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
context_embeddings = model(torch.tensor(tokens_ids)[None,:])[0]


# Print
print(context_embeddings)

Output:

tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],
        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],
        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],
        ...,
        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],
        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],
        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],
       grad_fn=<SelectBackward>)

If you want to refer to more HuggingFace team’s transformers package usage method, you can refer to [PyTorch] How to Use HuggingFace Transformers Package (With BERT Example)

References

[PyTorch] How to Use HuggingFace Transformers Package (With BERT Example)

[Machine Learning] CodeBERT Introduction (With Example)

Introduction

Tasks (Experiments)

Natural Language Code Search

NL-PL Probing

Code Documentation Generation (divided into training and not training)

Paper Summary

How to use CodeBERT (Code Documentation Generation)

Installation

Data Preprocessing

Run the program

How to call CodeBERT

References

Read More

Related

Leave a ReplyCancel reply

[Machine Learning] CodeBERT Introduction (With Example)

Introduction

Tasks (Experiments)

Natural Language Code Search

NL-PL Probing

Code Documentation Generation (divided into training and not training)

Paper Summary

How to use CodeBERT (Code Documentation Generation)

Installation

Data Preprocessing

Run the program

How to call CodeBERT

References

Read More

Share this:

Related

Leave a ReplyCancel reply