bigcode
/

starencoder

Model card Files Files and versions

starencoder / README.md

loubnabnl's picture

loubnabnl HF Staff

Update README.md

ad54e8b about 3 years ago

|

history blame contribute delete

2.87 kB

	---
	language:
	- code
	extra_gated_prompt: >-
	## Model License Agreement

	Please read the BigCode [OpenRAIL-M
	license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)
	agreement before accepting it.

	extra_gated_fields:
	I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox
	---
	# StarEnCoder

	## Table of Contents

	1. [Model Summary](##model-summary)
	3. [Training](##training)
	4. [Use](##use)
	5. [Limitations](##limitations)
	6. [License](##license)

	## Model Summary

	This is an encoder-only model (i.e., bi-directionally self-attentive Transformers) trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset.

	- Project Website: [bigcode-project.org](https://www.bigcode-project.org)
	- Point of Contact: [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
	- Languages: 80+ Programming languages


	We leveraged the :
	- Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) objectives from [BERT](https://arxiv.org/abs/1810.04805).
	- Predicted masked-out tokens from an input sentence and whether a pair of sentences occur as neighbors in a document.

	## Training

	We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B~tokens are observed. This takes roughly two days using 64 NVIDIA A100 GPUs.
	Details about the model architecture are reported in the table below.

	\| Hyperparameter \| Value \|
	\|--------------------------\|-----------\|
	\| Hidden size \| 768 \|
	\| Intermediate size \| 3072 \|
	\| Max. position embeddings \| 1024 \|
	\| Num. of attention heads \| 12 \|
	\| Num. of hidden layers \| 12 \|
	\| Attention \| Multi-head\|
	\| Num. of parameters \| ≈125M \|


	## Use

	This model is trained on 86 programming languages from GitHub code including GitHub issues and Git Commits, and can be efficiently fine-tuned for both code- and text-related tasks.
	We fine-tuned on a token classification task to detect PII and have released [StaPII](https://huggingface.co/bigcode/starpii) model.


	## Limitations
	There are limitations to consider when using StarEncoder. It is an encoder-only model, which limits its flexibility in certain code generation or completion tasks,
	and it was trained on data containing PII, which could pose privacy concerns. Performance may vary across the 80+ supported programming languages,
	particularly for less common ones, and the model might struggle with understanding domains outside programming languages.

	## License

	The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).