Model Card for Maken Books

Maken Books is a Doc2Vec model trained using Gensim on almost 600,000 books from the National Library of Norway.

Model Details

Model Description

Developed by: Rolv-Arild Braaten and Javier de la Rosa
Shared by: Javier de la Rosa
Model type: Doc2Vec (gensim==4.3.2)
Language(s) (NLP): Norwegian and English
License: Apache 2.0

Model Sources

Repository: https://nb.no/
Demo: https://nb.no/maken

Uses

Direct Use

It allows to cluster book-length texts or find similarities between long-form texts using a embedding space of 1024 dimensions. The model is used in production at the Maken site.

import re
from pathlib import Path

from gensim.models.doc2vec import Doc2Vec
from huggingface_hub import snapshot_download

model = Doc2Vec.load(str(
    Path(snapshot_download("NbAiLab/maken-books")) / "model.bin"
))

book = "A long text"
words = [c for c in re.split(r"\W+", book) if len(c) > 0]
embedding = model.infer_vector(words)
# array([ 0.01048528, -0.00491689,  0.01981961, ...,  0.00250911,
#        -0.00657777, -0.01207202], dtype=float32)

Bias, Risks, and Limitations

The majority of books used for training are written in the Norwegian languages, either Bokmål or Nynorsk. As such, the semantics of the embedding space might not work as expected with books in other languages, as no work has been done to align those.

Training Details

Training Data

Books from the National Library of Norway up to Nov 21st 2022.

Training Procedure

Preprocessing [optional]

Plain text files split on white spaces with re.split(r"\W+", book).

Training Hyperparameters

Training regime: 12 workers for 8 hours
Epochs: 10
Learning rate: alpha 0.025, min_alpha 0.0001
Embedding space: dims 1024

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: Intel Xeon Gold 6248R
Hours used: 8 hours (TDP of 150W)
Cloud Provider: On premises
Compute Region: Norway
Carbon Emitted: 0.52 kgCO$_2$eq (0.432 kgCO$_2$eq/kWh)

Technical Specifications [optional]

Model Architecture and Objective

Doc2Vec using distributed memory (PV-DM) with 1024 dimensions, while ignoring all words with total frequency lower than 1000.

Compute Infrastructure

Hardware

Architecture:           x86_64
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        46 bits physical, 48 bits virtual
  Byte Order:           Little Endian
CPU(s):                 96
  On-line CPU(s) list:  0-95
Vendor ID:              GenuineIntel
  Model name:           Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
    CPU family:         6
    Model:              85
    Thread(s) per core: 2
    Core(s) per socket: 24

Software

Python, version 3.10.6.
gensim==4.3.2
numpy==1.24.3
huggingface-hub==0.17.1 (for downloading the model)

Citation

Coming soon...

Model Card Authors and Contact

Javier de la Rosa ([email protected])

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models (The National Library of Norway) be liable for any results arising from the use made by third parties of these models.