Martin Müller commited on
Commit
78e44d1
1 Parent(s): bb1e236

initial commit

Browse files
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ license: mit
4
+ tags:
5
+ - pytorch
6
+ - causal-lm
7
+ datasets:
8
+ - c4
9
+ ---
10
+
11
+ # Cedille AI
12
+ Cedille is a project to bring large language models to non-English languages.
13
+
14
+ ## Boris
15
+ Boris is a 6B parameter autoregressive language model based on the GPT-J architecture and trained using the (mesh-transformer-jax)[https://github.com/kingoflolz/mesh-transformer-jax] codebase.
16
+
17
+ Boris was trained on around 78B tokens of French text from the [C4](https://huggingface.co/datasets/c4) dataset. We started training from GPT-J, which has been trained on [The Pile](https://pile.eleuther.ai/). As a consequence the model still has good performance in English language. Boris makes use of the unmodified GPT-2 tokenizer.
18
+
19
+ Boris is named after the great French writer [Boris Vian](https://en.wikipedia.org/wiki/Boris_Vian).
20
+
21
+ # How do I test Cedille?
22
+ For the time being, the easiest way to test the model is to use our [publicly accessible playground](https://en.cedille.ai/).
23
+
24
+ Cedille is a relatively large model and running it in production can get expensive. Consider contacting us for API access.
25
+
26
+ # How do I cite Cedille?
27
+ Thanks for citing our work in case you build on top of Cedille. For the time being, please reference our work like so:
28
+ ```bibtex
29
+ @misc{cedille2021,
30
+ author = {Laurent, F. and Muller, M.},
31
+ title = {Cedille AI},
32
+ year = {2021},
33
+ publisher = {GitHub},
34
+ journal = {GitHub repository},
35
+ howpublished = {\url{https://github.com/coteries/cedille-ai}}
36
+ }
37
+ ```
38
+
39
+ ## Contact us
40
+ For any custom development please contact us at [email protected].
41
+
42
+ ## Links
43
+ * [Official website](https://en.cedille.ai/)
44
+ * [Blog](https://en.cedille.ai/blog)
45
+ * [GitHub](https://github.com/coteries/cedille-ai)
46
+ * [Twitter](https://twitter.com/CedilleAI)
47
+
config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": ".",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPTJForCausalLM"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.0,
10
+ "eos_token_id": 50256,
11
+ "gradient_checkpointing": false,
12
+ "initializer_range": 0.02,
13
+ "layer_norm_epsilon": 1e-05,
14
+ "model_type": "gptj",
15
+ "n_ctx": 2048,
16
+ "n_embd": 4096,
17
+ "n_head": 16,
18
+ "n_inner": null,
19
+ "n_layer": 28,
20
+ "n_positions": 2048,
21
+ "resid_pdrop": 0.0,
22
+ "rotary_dim": 64,
23
+ "scale_attn_weights": true,
24
+ "summary_activation": null,
25
+ "summary_first_dropout": 0.1,
26
+ "summary_proj_to_labels": true,
27
+ "summary_type": "cls_index",
28
+ "summary_use_proj": true,
29
+ "task_specific_params": {
30
+ "text-generation": {
31
+ "do_sample": true,
32
+ "max_length": 50,
33
+ "temperature": 1.0
34
+ }
35
+ },
36
+ "tokenizer_class": "GPT2Tokenizer",
37
+ "torch_dtype": "float32",
38
+ "transformers_version": "4.12.5",
39
+ "use_cache": true,
40
+ "vocab_size": 50400
41
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f7816e3333f19cd82f088aacd5298bd6e86f2a1a14b36d4a9da6b8ec97bc317
3
+ size 24321098883
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "<|endoftext|>", "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "add_prefix_space": false, "model_max_length": 1024, "special_tokens_map_file": null, "name_or_path": "gpt2", "tokenizer_class": "GPT2Tokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff