Instructions to use bigcode/starencoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigcode/starencoder with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForPreTraining tokenizer = AutoTokenizer.from_pretrained("bigcode/starencoder") model = AutoModelForPreTraining.from_pretrained("bigcode/starencoder") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - code | |
| extra_gated_prompt: >- | |
| ## Model License Agreement | |
| Please read the BigCode [OpenRAIL-M | |
| license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) | |
| agreement before accepting it. | |
| extra_gated_fields: | |
| I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox | |
| # StarEnCoder | |
| ## Table of Contents | |
| 1. [Model Summary](##model-summary) | |
| 3. [Training](##training) | |
| 4. [Use](##use) | |
| 5. [Limitations](##limitations) | |
| 6. [License](##license) | |
| ## Model Summary | |
| This is an encoder-only model (i.e., bi-directionally self-attentive Transformers) trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset. | |
| - **Project Website:** [bigcode-project.org](https://www.bigcode-project.org) | |
| - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org) | |
| - **Languages:** 80+ Programming languages | |
| We leveraged the : | |
| - Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) objectives from [BERT](https://arxiv.org/abs/1810.04805). | |
| - Predicted masked-out tokens from an input sentence and whether a pair of sentences occur as neighbors in a document. | |
| ## Training | |
| We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B~tokens are observed. This takes roughly two days using 64 NVIDIA A100 GPUs. | |
| Details about the model architecture are reported in the table below. | |
| | Hyperparameter | Value | | |
| |--------------------------|-----------| | |
| | Hidden size | 768 | | |
| | Intermediate size | 3072 | | |
| | Max. position embeddings | 1024 | | |
| | Num. of attention heads | 12 | | |
| | Num. of hidden layers | 12 | | |
| | Attention | Multi-head| | |
| | Num. of parameters | ≈125M | | |
| ## Use | |
| This model is trained on 86 programming languages from GitHub code including GitHub issues and Git Commits, and can be efficiently fine-tuned for both code- and text-related tasks. | |
| We fine-tuned on a token classification task to detect PII and have released [StaPII](https://huggingface.co/bigcode/starpii) model. | |
| ## Limitations | |
| There are limitations to consider when using StarEncoder. It is an encoder-only model, which limits its flexibility in certain code generation or completion tasks, | |
| and it was trained on data containing PII, which could pose privacy concerns. Performance may vary across the 80+ supported programming languages, | |
| particularly for less common ones, and the model might struggle with understanding domains outside programming languages. | |
| ## License | |
| The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). |