metadata

license: apache-2.0
language:
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh

AffilGood-AffilXLM

For the first two tasks, we fine-tuned two RoBERTa and XLM-RoBERTa models for (predominantly) English and multilingual datasets, respectively. Gururangan et al. (2020) show that continuing pre-training language models on task-relevant unlabeled data might contribute to improve the performance of final fine-tuned task-specific models-in particular, in low-resource situations. Considering the fact that the affiliation strings' grammar has its own structure, which is different from the one that would be expected to be found in free natural language, we explore whether our affiliation span identification and NER models would benefit from being fine-tuned from models that have been further pre-trained on raw affiliation strings for the masked token prediction task.

We adatap models to 10 million random raw affiliation strings from OpenAlex, reporting perplexity on 50k randomly held-out affiliation strings. In what follows, we refer to our adapted models as AffilRoBERTa (adapted RoBERTa model) and AffilXLM (adapted XLM-RoBERTa).

Specific details of the adaptive pre-training procedure can be found in Duran-Silva et al. (2024).

Evaluation

We report masked language modeling loss as perplexity measure (PPL) on 50k randomly sampled held-out raw affiliation strings.

Model	PPL_base	PPL_adapt
RoBERTa	1.972	1.106
XLM-RoBERTa	1.997	1.101

AffilGood-AffilRoBERTa achieves competitive performance to 2 tasks in processing affiliation strings, compared to base models

Task	RoBERTa	XLM	AffilRoBERTa	AffilXLM (this model)
AffilGood-NER	.910	.915	.920	.925
AffilGood-SPAN	.929	.931	.938	.927

Citation

@inproceedings{duran-silva-etal-2024-affilgood,
    title = "{A}ffil{G}ood: Building reliable institution name disambiguation tools to improve scientific literature analysis",
    author = "Duran-Silva, Nicolau  and
      Accuosto, Pablo  and
      Przyby{\l}a, Piotr  and
      Saggion, Horacio",
    editor = "Ghosal, Tirthankar  and
      Singh, Amanpreet  and
      Waard, Anita  and
      Mayr, Philipp  and
      Naik, Aakanksha  and
      Weller, Orion  and
      Lee, Yoonjoo  and
      Shen, Shannon  and
      Qin, Yanxia",
    booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.sdp-1.13",
    pages = "135--144",
}

Disclaimer

Click to expand

The model published in this repository is intended for a generalist purpose and is made available to third parties under a Apache v2.0 License.

Please keep in mind that the model may have bias and/or any other undesirable distortions. When third parties deploy or provide systems and/or services to other parties using this model (or a system based on it) or become users of the model itself, they should note that it is under their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.