system HF staff commited on
Commit
1a7edc5
0 Parent(s):

add models

Browse files
.gitattributes ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
17
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
18
+
19
+ *.gensim filter=lfs diff=lfs merge=lfs -text
20
+ *.npy filter=lfs diff=lfs merge=lfs -text
21
+
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pl
3
+ tags:
4
+ - word2vec
5
+ datasets:
6
+ - KGR10
7
+ ---
8
+
9
+ # KGR10 word2vec Polish word embeddings
10
+
11
+ Distributional language models for Polish trained on the KGR10 corpora.
12
+
13
+ ## Models
14
+
15
+ In the repository you can find two selected models, that were selected after evaluation (see table below).
16
+ A model that performed the best is the default model/config (see `default_config.json`).
17
+
18
+ |method|dimension|hs|mwe||
19
+ |---|---|---|---| --- |
20
+ |cbow|300|false|true| <-- default |
21
+ |skipgram|300|true|true|
22
+
23
+
24
+ ## Usage
25
+
26
+ To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings).
27
+
28
+ ```bash
29
+ pip install clarinpl-embeddings
30
+ ```
31
+
32
+ ### Utilising the default model (the easiest way)
33
+
34
+ Word embedding:
35
+
36
+ ```python
37
+ from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
38
+ from flair.data import Sentence
39
+
40
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
41
+
42
+ embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/word2vec-kgr10")
43
+ embedding.embed([sentence])
44
+
45
+ for token in sentence:
46
+ print(token)
47
+ print(token.embedding)
48
+ ```
49
+
50
+ Document embedding (averaged over words):
51
+
52
+ ```python
53
+ from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
54
+ from flair.data import Sentence
55
+
56
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
57
+
58
+ embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10")
59
+ embedding.embed([sentence])
60
+
61
+ print(sentence.embedding)
62
+ ```
63
+
64
+ ### Customisable way
65
+
66
+ Word embedding:
67
+
68
+ ```python
69
+ from embeddings.embedding.static.embedding import AutoStaticWordEmbedding
70
+ from embeddings.embedding.static.word2vec import KGR10Word2VecConfig
71
+ from flair.data import Sentence
72
+
73
+ config = KGR10Word2VecConfig(method='skipgram', hs=False)
74
+ embedding = AutoStaticWordEmbedding.from_config(config)
75
+
76
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
77
+ embedding.embed([sentence])
78
+
79
+ for token in sentence:
80
+ print(token)
81
+ print(token.embedding)
82
+ ```
83
+
84
+ Document embedding (averaged over words):
85
+
86
+ ```python
87
+ from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding
88
+ from embeddings.embedding.static.word2vec import KGR10Word2VecConfig
89
+ from flair.data import Sentence
90
+
91
+ config = KGR10Word2VecConfig(method='skipgram', hs=False)
92
+ embedding = AutoStaticDocumentEmbedding.from_config(config)
93
+
94
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
95
+ embedding.embed([sentence])
96
+
97
+ print(sentence.embedding)
98
+ ```
99
+
100
+ ## Citation
101
+
102
+ ```
103
+ Piasecki, Maciej; Janz, Arkadiusz; Kaszewski, Dominik; et al., 2017, Word Embeddings for Polish, CLARIN-PL digital repository, http://hdl.handle.net/11321/442.
104
+ ```
105
+
106
+ or
107
+
108
+
109
+ ```
110
+ @misc{11321/442,
111
+ title = {Word Embeddings for Polish},
112
+ author = {Piasecki, Maciej and Janz, Arkadiusz and Kaszewski, Dominik and Czachor, Gabriela},
113
+ url = {http://hdl.handle.net/11321/442},
114
+ note = {{CLARIN}-{PL} digital repository},
115
+ copyright = {{GNU} {GPL3}},
116
+ year = {2017}
117
+ }
118
+ ```
119
+
cbow.v300.m8.hs.mwe.w2v.gensim ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7354e9b6ca9f5e5ab285a1e486ad94475d4d815cb3522cb0850d88aa6f9affcc
3
+ size 138822242
cbow.v300.m8.hs.mwe.w2v.gensim.vectors.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c972cacbe38ce49a5e7430432290cf951f301740f2c49badc305e99c37e67f18
3
+ size 2740052528
default_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "method": "cbow",
3
+ "dimension": 300,
4
+ "hs": true,
5
+ "mwe": true
6
+ }
module.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "type": "embeddings.embedding.static.word2vec.KGR10Word2VecEmbedding"
3
+ }
skipgram.v300.m8.ns.mwe.w2v.gensim ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7354e9b6ca9f5e5ab285a1e486ad94475d4d815cb3522cb0850d88aa6f9affcc
3
+ size 138822242
skipgram.v300.m8.ns.mwe.w2v.gensim.vectors.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36da9ee479a7858f4b0a5a80ac3fe8298026e8edae0ea47036833fcf3eaec53f
3
+ size 2740052528
test/dummy.model.gensim ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d162cf4354c63549da5b47745a17538ca869b69bfface2c020fb5bf782d2638
3
+ size 28197