Model Card for Maken Books
Maken Books is a Doc2Vec model trained using Gensim on almost 600,000 books from the National Library of Norway.
Model Details
Model Description
- Developed by: Rolv-Arild Braaten and Javier de la Rosa
- Shared by: Javier de la Rosa
- Model type: Doc2Vec (
gensim==4.3.2
) - Language(s) (NLP): Norwegian and English
- License: Apache 2.0
Model Sources
- Repository: https://nb.no/
- Demo: https://nb.no/maken
Uses
Direct Use
It allows to cluster book-length texts or find similarities between long-form texts using a embedding space of 1024 dimensions. The model is used in production at the Maken site.
import re
from pathlib import Path
from gensim.models.doc2vec import Doc2Vec
from huggingface_hub import snapshot_download
model = Doc2Vec.load(str(
Path(snapshot_download("NbAiLab/maken-books")) / "model.bin"
))
book = "A long text"
words = [c for c in re.split(r"\W+", book) if len(c) > 0]
embedding = model.infer_vector(words)
# array([ 0.01048528, -0.00491689, 0.01981961, ..., 0.00250911,
# -0.00657777, -0.01207202], dtype=float32)
Bias, Risks, and Limitations
The majority of books used for training are written in the Norwegian languages, either Bokmål or Nynorsk. As such, the semantics of the embedding space might not work as expected with books in other languages, as no work has been done to align those.
Training Details
Training Data
Books from the National Library of Norway up to Nov 21st 2022.
Training Procedure
Preprocessing [optional]
Plain text files split on white spaces with re.split(r"\W+", book)
.
Training Hyperparameters
- Training regime: 12 workers for 8 hours
- Epochs: 10
- Learning rate:
alpha
0.025,min_alpha
0.0001 - Embedding space:
dims
1024
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: Intel Xeon Gold 6248R
- Hours used: 8 hours (TDP of 150W)
- Cloud Provider: On premises
- Compute Region: Norway
- Carbon Emitted: 0.52 kgCO$_2$eq (0.432 kgCO$_2$eq/kWh)
Technical Specifications [optional]
Model Architecture and Objective
Doc2Vec using distributed memory (PV-DM) with 1024 dimensions, while ignoring all words with total frequency lower than 1000.
Compute Infrastructure
Hardware
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 24
Software
- Python, version 3.10.6.
gensim==4.3.2
numpy==1.24.3
huggingface-hub==0.17.1
(for downloading the model)
Citation
Coming soon...
Model Card Authors and Contact
Javier de la Rosa ([email protected])
Disclaimer
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models (The National Library of Norway) be liable for any results arising from the use made by third parties of these models.