|
--- |
|
language: pl |
|
tags: |
|
- word2vec |
|
datasets: |
|
- KGR10 |
|
--- |
|
|
|
# KGR10 word2vec Polish word embeddings |
|
|
|
Distributional language models for Polish trained on the KGR10 corpora. |
|
|
|
## Models |
|
|
|
In the repository you can find two selected models, that were selected after evaluation (see table below). |
|
A model that performed the best is the default model/config (see `default_config.json`). |
|
|
|
|method|dimension|hs|mwe|| |
|
|---|---|---|---| --- | |
|
|cbow|300|false|true| <-- default | |
|
|skipgram|300|true|true| |
|
|
|
|
|
## Usage |
|
|
|
To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings). |
|
|
|
```bash |
|
pip install clarinpl-embeddings |
|
``` |
|
|
|
### Utilising the default model (the easiest way) |
|
|
|
Word embedding: |
|
|
|
```python |
|
from embeddings.embedding.auto_flair import AutoFlairWordEmbedding |
|
from flair.data import Sentence |
|
|
|
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
|
|
|
embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/word2vec-kgr10") |
|
embedding.embed([sentence]) |
|
|
|
for token in sentence: |
|
print(token) |
|
print(token.embedding) |
|
``` |
|
|
|
Document embedding (averaged over words): |
|
|
|
```python |
|
from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding |
|
from flair.data import Sentence |
|
|
|
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
|
|
|
embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10") |
|
embedding.embed([sentence]) |
|
|
|
print(sentence.embedding) |
|
``` |
|
|
|
### Customisable way |
|
|
|
Word embedding: |
|
|
|
```python |
|
from embeddings.embedding.static.embedding import AutoStaticWordEmbedding |
|
from embeddings.embedding.static.word2vec import KGR10Word2VecConfig |
|
from flair.data import Sentence |
|
|
|
config = KGR10Word2VecConfig(method='skipgram', hs=False) |
|
embedding = AutoStaticWordEmbedding.from_config(config) |
|
|
|
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
|
embedding.embed([sentence]) |
|
|
|
for token in sentence: |
|
print(token) |
|
print(token.embedding) |
|
``` |
|
|
|
Document embedding (averaged over words): |
|
|
|
```python |
|
from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding |
|
from embeddings.embedding.static.word2vec import KGR10Word2VecConfig |
|
from flair.data import Sentence |
|
|
|
config = KGR10Word2VecConfig(method='skipgram', hs=False) |
|
embedding = AutoStaticDocumentEmbedding.from_config(config) |
|
|
|
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
|
embedding.embed([sentence]) |
|
|
|
print(sentence.embedding) |
|
``` |
|
|
|
## Citation |
|
|
|
``` |
|
Piasecki, Maciej; Janz, Arkadiusz; Kaszewski, Dominik; et al., 2017, Word Embeddings for Polish, CLARIN-PL digital repository, http://hdl.handle.net/11321/442. |
|
``` |
|
|
|
or |
|
|
|
|
|
``` |
|
@misc{11321/442, |
|
title = {Word Embeddings for Polish}, |
|
author = {Piasecki, Maciej and Janz, Arkadiusz and Kaszewski, Dominik and Czachor, Gabriela}, |
|
url = {http://hdl.handle.net/11321/442}, |
|
note = {{CLARIN}-{PL} digital repository}, |
|
copyright = {{GNU} {GPL3}}, |
|
year = {2017} |
|
} |
|
``` |
|
|
|
|