Summary of the tokenizers
ãã®ããŒãžã§ã¯ãããŒã¯ãã€ãŒãŒã·ã§ã³ã«ã€ããŠè©³ããèŠãŠãããŸãã
ååŠçã®ãã¥ãŒããªã¢ã«ã§èŠãããã«ãããã¹ããããŒã¯ã³åããããšã¯ããããåèªãŸãã¯ãµãã¯ãŒãã«åå²ããããããã«ãã¯ã¢ããããŒãã«ãä»ããŠIDã«å€æããããšã§ããåèªãŸãã¯ãµãã¯ãŒããIDã«å€æããããšã¯ç°¡åã§ãã®ã§ããã®èŠçŽã§ã¯ããã¹ããåèªãŸãã¯ãµãã¯ãŒãã«åå²ããïŒã€ãŸããããã¹ããããŒã¯ãã€ãºããïŒããšã«çŠç¹ãåœãŠãŸããå ·äœçã«ã¯ãð€ Transformersã§äœ¿çšããã3ã€ã®äž»èŠãªããŒã¯ãã€ã¶ãByte-Pair EncodingïŒBPEïŒãWordPieceãããã³SentencePieceãèŠãŠãã©ã®ã¢ãã«ãã©ã®ããŒã¯ãã€ã¶ã¿ã€ãã䜿çšããŠãããã®äŸã瀺ããŸãã
åã¢ãã«ããŒãžã§ã¯ãäºåãã¬ãŒãã³ã°æžã¿ã¢ãã«ãã©ã®ããŒã¯ãã€ã¶ã¿ã€ãã䜿çšããŠããããç¥ãããã«ãé¢é£ããããŒã¯ãã€ã¶ã®ããã¥ã¡ã³ãã確èªã§ããŸããäŸãã°ãBertTokenizerãèŠããšãã¢ãã«ãWordPieceã䜿çšããŠããããšãããããŸãã
Introduction
ããã¹ããããå°ããªãã£ã³ã¯ã«åå²ããããšã¯ãèŠãã以äžã«é£ããã¿ã¹ã¯ã§ãããè€æ°ã®æ¹æ³ããããŸããäŸãã°ã次ã®æãèããŠã¿ãŸãããããâDonât you love ð€ Transformers? We sure do.âã
ãã®ããã¹ããããŒã¯ã³åããç°¡åãªæ¹æ³ã¯ãã¹ããŒã¹ã§åå²ããããšã§ããããã«ããã以äžã®ããã«ãªããŸãïŒ
["Don't", "you", "love", "ð€", "Transformers?", "We", "sure", "do."]
ããã¯åççãªç¬¬äžæ©ã§ãããããŒã¯ã³ âTransformers?â ãš âdo.â ãèŠããšãå¥èªç¹ãåèª âTransformerâ ãš âdoâ ã«çµåãããŠããããšãããããããã¯æé©ã§ã¯ãããŸãããå¥èªç¹ãèæ ®ã«å ¥ããã¹ãã§ãã¢ãã«ãåèªãšããã«ç¶ãå¯èœæ§ã®ãããã¹ãŠã®å¥èªç¹èšå·ã®ç°ãªãè¡šçŸãåŠã°ãªããã°ãªããªãããšãé¿ããã¹ãã§ããããã«ãããã¢ãã«ãåŠã°ãªããã°ãªããªãè¡šçŸã®æ°ãççºçã«å¢å ããŸããå¥èªç¹ãèæ ®ã«å ¥ããå ŽåãäŸæã®ããŒã¯ã³åã¯æ¬¡ã®ããã«ãªããŸãïŒ
["Don", "'", "t", "you", "love", "ð€", "Transformers", "?", "We", "sure", "do", "."]
ãã ããåèªãâDonâtâããããŒã¯ã³åããæ¹æ³ã«é¢ããŠã¯ãäžå©ãªåŽé¢ããããŸãã ãâDonâtâãã¯ãâdo notâããè¡šããŠãããããã[âDoâ, ânâtâ]ããšããŠããŒã¯ã³åããæ¹ãé©ããŠããŸããããããäºæãè€éã«ãªããåã¢ãã«ãç¬èªã®ããŒã¯ãã€ã¶ãŒã¿ã€ããæã€çç±ã®äžéšã§ããããŸããããã¹ããããŒã¯ã³åããããã«é©çšããã«ãŒã«ã«å¿ããŠãåãããã¹ãã«å¯ŸããŠç°ãªãããŒã¯ãã€ãºãããåºåãçæãããŸããäºåãã¬ãŒãã³ã°æžã¿ã¢ãã«ã¯ããã¬ãŒãã³ã°ããŒã¿ãããŒã¯ãã€ãºããã®ã«äœ¿çšãããã«ãŒã«ãšåãã«ãŒã«ã§ããŒã¯ãã€ãºãããå ¥åãæäŸããå Žåã«ã®ã¿æ£åžžã«æ©èœããŸãã
spaCyãšMosesã¯ã2ã€ã®äººæ°ã®ããã«ãŒã«ããŒã¹ã®ããŒã¯ãã€ã¶ãŒã§ããããããç§ãã¡ã®äŸã«é©çšãããšãspaCyãšMosesã¯æ¬¡ã®ãããªåºåãçæããŸãïŒ
["Do", "n't", "you", "love", "ð€", "Transformers", "?", "We", "sure", "do", "."]
空çœãšå¥èªç¹ã®ããŒã¯ã³åãããã³ã«ãŒã«ããŒã¹ã®ããŒã¯ã³åã䜿çšãããŠããããšãããããŸãã空çœãšå¥èªç¹ã®ããŒã¯ã³åãããã³ã«ãŒã«ããŒã¹ã®ããŒã¯ã³åã¯ãæãåèªã«åå²ããããšãããããã«å®çŸ©ãããåèªããŒã¯ã³åã®äŸã§ããããã¹ããããå°ããªãã£ã³ã¯ã«åå²ããããã®æãçŽæçãªæ¹æ³ã§ããäžæ¹ããã®ããŒã¯ã³åæ¹æ³ã¯å€§èŠæš¡ãªããã¹ãã³ãŒãã¹ã«å¯ŸããŠåé¡ãåŒãèµ·ããããšããããŸãããã®å Žåã空çœãšå¥èªç¹ã®ããŒã¯ã³åã¯éåžžãéåžžã«å€§ããªèªåœïŒãã¹ãŠã®äžæãªåèªãšããŒã¯ã³ã®ã»ããïŒãçæããŸããäŸãã°ãTransformer XLã¯ç©ºçœãšå¥èªç¹ã®ããŒã¯ã³åã䜿çšããŠãããèªåœãµã€ãºã¯267,735ã§ãïŒ
ãã®ãããªå€§ããªèªåœãµã€ãºã¯ãã¢ãã«ã«éåžžã«å€§ããªåã蟌ã¿è¡åãå ¥åããã³åºåã¬ã€ã€ãŒãšããŠæãããããšã匷å¶ããã¡ã¢ãªããã³æéã®è€éãã®å¢å ãåŒãèµ·ãããŸããäžè¬çã«ããã©ã³ã¹ãã©ãŒããŒã¢ãã«ã¯ãç¹ã«åäžã®èšèªã§äºåãã¬ãŒãã³ã°ãããå Žåã50,000ãè¶ ããèªåœãµã€ãºãæã€ããšã¯ã»ãšãã©ãããŸããã
ãããã£ãŠãã·ã³ãã«ãªç©ºçœãšå¥èªç¹ã®ããŒã¯ã³åãäžååãªå Žåããªãåã«æååäœã§ããŒã¯ã³åããªãã®ããšããçåãçããŸããïŒ
æååäœã®ããŒã¯ã³åã¯éåžžã«ã·ã³ãã«ã§ãããã¡ã¢ãªãšæéã®è€éããå€§å¹ ã«åæžã§ããŸãããã¢ãã«ã«æå³ã®ããå ¥åè¡šçŸãåŠç¿ãããããšãéåžžã«é£ãããªããŸããããšãã°ãæåãâtâãã®ããã®æå³ã®ããã³ã³ããã¹ãç¬ç«ã®è¡šçŸãåŠç¿ããããšã¯ãåèªãâtodayâãã®ããã®ã³ã³ããã¹ãç¬ç«ã®è¡šçŸãåŠç¿ãããããã¯ããã«é£ããã§ãããã®ãããæååäœã®ããŒã¯ã³åã¯ãã°ãã°ããã©ãŒãã³ã¹ã®äœäžã䌎ããŸãããããã£ãŠããã©ã³ã¹ãã©ãŒããŒã¢ãã«ã¯åèªã¬ãã«ãšæåã¬ãã«ã®ããŒã¯ã³åã®ãã€ããªããã§ãããµãã¯ãŒãããŒã¯ã³åã䜿çšããŠãäž¡æ¹ã®äžçã®å©ç¹ã掻ãããŸãã
Subword tokenization
ãµãã¯ãŒãããŒã¯ã³åã¢ã«ãŽãªãºã ã¯ãé »ç¹ã«äœ¿çšãããåèªãããå°ããªãµãã¯ãŒãã«åå²ãã¹ãã§ã¯ãªãããçããåèªã¯æå³ã®ãããµãã¯ãŒãã«å解ããããšããååã«äŸåããŠããŸããããšãã°ããâannoyinglyâãã¯çããåèªãšèŠãªããããã®åèªã¯ãâannoyingâããšãâlyâãã«å解ããããããããŸãããç¬ç«ãããâannoyingâããšãâlyâãã¯ããé »ç¹ã«çŸããŸããããâannoyinglyâãã®æå³ã¯ãâannoyingâããšãâlyâãã®åæçãªæå³ã«ãã£ãŠä¿æãããŸããããã¯ç¹ã«ãã«ã³èªãªã©ã®çµåèšèªã§åœ¹ç«ã¡ãŸããããã§ã¯ãµãã¯ãŒããé£çµããŠïŒã»ãŒïŒä»»æã®é·ãè€éãªåèªã圢æã§ããŸãã
ãµãã¯ãŒãããŒã¯ã³åã«ãããã¢ãã«ã¯åççãªèªåœãµã€ãºãæã€ããšãã§ããæå³ã®ããã³ã³ããã¹ãç¬ç«ã®è¡šçŸãåŠç¿ã§ããŸããããã«ããµãã¯ãŒãããŒã¯ã³åã«ãããã¢ãã«ã¯ä»¥åã«èŠãããšã®ãªãåèªãåŠçããããããæ¢ç¥ã®ãµãã¯ãŒãã«å解ããããšãã§ããŸããäŸãã°ãBertTokenizerã¯"I have a new GPU!"
ã以äžã®ããã«ããŒã¯ã³åããŸãïŒ
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> tokenizer.tokenize("I have a new GPU!")
["i", "have", "a", "new", "gp", "##u", "!"]
ãuncasedãã¢ãã«ãèæ ®ããŠããããããŸãæãå°æåã«å€æããŸãããããŒã¯ãã€ã¶ã®èªåœã«ã[âiâ, âhaveâ, âaâ, ânewâ]ããšããåèªãååšããããšãããããŸããããâgpuâããšããåèªã¯ååšããŸããããããã£ãŠãããŒã¯ãã€ã¶ã¯ãâgpuâããæ¢ç¥ã®ãµãã¯ãŒãã[âgpâãâ##uâ]ãã«åå²ããŸããããã§ãâ##âãã¯ãããŒã¯ã³ã®ãã³ãŒããŸãã¯ããŒã¯ãã€ãŒãŒã·ã§ã³ã®é転ã®ããã«ãããŒã¯ã³ã®åã®éšåã«ã¹ããŒã¹ãªãã§æ¥ç¶ããå¿ èŠãããããšãæå³ããŸãã
å¥ã®äŸãšããŠãXLNetTokenizer
ã¯ä»¥äžã®ããã«ä»¥åã®ãµã³ãã«ããã¹ããããŒã¯ã³åããŸãïŒ
>>> from transformers import XLNetTokenizer
>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased")
>>> tokenizer.tokenize("Don't you love ð€ Transformers? We sure do.")
["âDon", "'", "t", "âyou", "âlove", "â", "ð€", "â", "Transform", "ers", "?", "âWe", "âsure", "âdo", "."]
ãããã®ãâãã®æå³ã«ã€ããŠã¯ãSentencePieceãèŠããšãã«è©³ãã説æããŸããã芧ã®éãããTransformersããšããçããåèªã¯ãããé »ç¹ã«çŸãããµãã¯ãŒããTransformããšãersãã«åå²ãããŠããŸãã
ããŠãç°ãªããµãã¯ãŒãããŒã¯ã³åã¢ã«ãŽãªãºã ãã©ã®ããã«åäœããããèŠãŠã¿ãŸãããããããã®ããŒã¯ãã€ãŒãŒã·ã§ã³ã¢ã«ãŽãªãºã ã¯ãã¹ãŠãéåžžã¯å¯Ÿå¿ããã¢ãã«ããã¬ãŒãã³ã°ãããã³ãŒãã¹ã§è¡ããã圢åŒã®ãã¬ãŒãã³ã°ã«äŸåããŠããŸãã
Byte-Pair EncodingïŒBPEïŒ
Byte-Pair EncodingïŒBPEïŒã¯ãNeural Machine Translation of Rare Words with Subword UnitsïŒSennrich et al., 2015ïŒã§å°å ¥ãããŸãããBPEã¯ããã¬ãŒãã³ã°ããŒã¿ãåèªã«åå²ããããªããŒã¯ãã€ã¶ã«äŸåããŠããŸããããªããŒã¯ãã€ãŒãŒã·ã§ã³ã¯ã空çœã®ããŒã¯ãã€ãŒãŒã·ã§ã³ãªã©ãéåžžã«åçŽãªãã®ã§ããããšããããŸããäŸãã°ãGPT-2ãRoBERTaã§ããããé«åºŠãªããªããŒã¯ãã€ãŒãŒã·ã§ã³ã«ã¯ãã«ãŒã«ããŒã¹ã®ããŒã¯ãã€ãŒãŒã·ã§ã³ïŒXLMãFlauBERTãªã©ã倧éšåã®èšèªã«Mosesã䜿çšïŒããGPTïŒSpacyãšftfyã䜿çšããŠãã¬ãŒãã³ã°ã³ãŒãã¹å ã®ååèªã®é »åºŠãæ°ããïŒãªã©ãå«ãŸããŸãã
ããªããŒã¯ãã€ãŒãŒã·ã§ã³ã®åŸãäžæã®åèªã»ãããäœæãããååèªããã¬ãŒãã³ã°ããŒã¿ã§åºçŸããé »åºŠã決å®ãããŸãã次ã«ãBPEã¯ããŒã¹èªåœãäœæããããŒã¹èªåœã®äºã€ã®ã·ã³ãã«ããæ°ããã·ã³ãã«ã圢æããããã®ããŒãžã«ãŒã«ãåŠç¿ããŸãããã®ããã»ã¹ã¯ãèªåœãææã®èªåœãµã€ãºã«éãããŸã§ç¶ããããŸãããªããææã®èªåœãµã€ãºã¯ããŒã¯ãã€ã¶ããã¬ãŒãã³ã°ããåã«å®çŸ©ãããã€ããŒãã©ã¡ãŒã¿ã§ããããšã«æ³šæããŠãã ããã
äŸãšããŠãããªããŒã¯ãã€ãŒãŒã·ã§ã³ã®åŸã次ã®ã»ããã®åèªãšãã®åºçŸé »åºŠã決å®ããããšä»®å®ããŸãããïŒ
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
ãããã£ãŠãããŒã¹èªåœã¯ã[âbâ, âgâ, âhâ, ânâ, âpâ, âsâ, âuâ]ãã§ãããã¹ãŠã®åèªãããŒã¹èªåœã®ã·ã³ãã«ã«åå²ãããšã次ã®ããã«ãªããŸãïŒ
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
ãã®åŸãBPEã¯å¯èœãªãã¹ãŠã®ã·ã³ãã«ãã¢ã®é »åºŠãæ°ããæãé »ç¹ã«çºçããã·ã³ãã«ãã¢ãéžæããŸããäžèšã®äŸã§ã¯ã"h"
ã®åŸã«"u"
ã15åïŒ"hug"
ã®10åã"hugs"
ã®5åïŒåºçŸããŸããããããæãé »ç¹ãªã·ã³ãã«ãã¢ã¯ãåèšã§20åïŒ"u"
ã®10åã"g"
ã®5åã"u"
ã®5åïŒåºçŸãã"u"
ã®åŸã«"g"
ãç¶ãã·ã³ãã«ãã¢ã§ãããããã£ãŠãããŒã¯ãã€ã¶ãæåã«åŠç¿ããããŒãžã«ãŒã«ã¯ã"u"
ã®åŸã«"g"
ãç¶ããã¹ãŠã®"u"
ã·ã³ãã«ãäžç·ã«ã°ã«ãŒãåããããšã§ãã次ã«ã"ug"
ãèªåœã«è¿œå ãããŸããåèªã®ã»ããã¯æ¬¡ã«ãªããŸãïŒ
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
次ã«ãBPEã¯æ¬¡ã«æãäžè¬çãªã·ã³ãã«ãã¢ãèå¥ããŸããããã¯ãâuâãã«ç¶ããŠãânâãã§ã16ååºçŸããŸãããããã£ãŠããâuâããšãânâãã¯ãâunâãã«çµåãããèªåœã«è¿œå ãããŸãã次ã«æãé »åºŠã®é«ãã·ã³ãã«ãã¢ã¯ããâhâãã«ç¶ããŠãâugâãã§ã15ååºçŸããŸããåã³ãã¢ãçµåããããhugããèªåœã«è¿œå ã§ããŸãã
ãã®æ®µéã§ã¯ãèªåœã¯["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]
ã§ãããäžæã®åèªã®ã»ããã¯ä»¥äžã®ããã«è¡šãããŸãïŒ
("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
åæãšããŠãByte-Pair EncodingïŒBPEïŒã®ãã¬ãŒãã³ã°ããã®æ®µéã§åæ¢ãããšãåŠç¿ãããããŒãžã«ãŒã«ãæ°ããåèªã«é©çšãããŸãïŒæ°ããåèªã«ã¯ããŒã¹ããã£ãã©ãªã«å«ãŸããŠããªãã·ã³ãã«ãå«ãŸããŠããªãéãïŒã äŸãã°ãåèª âbugâ 㯠[âbâ, âugâ] ãšããŠããŒã¯ã³åãããŸãããâmugâ ã¯ããŒã¹ããã£ãã©ãªã« âmâ ã·ã³ãã«ãå«ãŸããŠããªãããã[â<unk>â, âugâ] ãšããŠããŒã¯ã³åãããŸãã äžè¬çã«ãâmâ ã®ãããªåäžã®æåã¯ããã¬ãŒãã³ã°ããŒã¿ã«ã¯éåžžãåæåã®å°ãªããšã1ã€ã®åºçŸãå«ãŸããŠãããããâ<unk>â ã·ã³ãã«ã«çœ®ãæããããããšã¯ãããŸããããçµµæåã®ãããªéåžžã«ç¹æ®ãªæåã®å Žåã«ã¯çºçããå¯èœæ§ããããŸãã
åè¿°ã®ããã«ãããã£ãã©ãªãµã€ãºãããªãã¡ããŒã¹ããã£ãã©ãªãµã€ãº + ããŒãžã®åæ°ã¯éžæãããã€ããŒãã©ã¡ãŒã¿ã§ãã äŸãã°ãGPT ã¯ããŒã¹æåã478æåã§ã40,000åã®ããŒãžåŸã«ãã¬ãŒãã³ã°ãåæ¢ãããããããã£ãã©ãªãµã€ãºã¯40,478ã§ãã
Byte-level BPE
ãã¹ãŠã®UnicodeæåãããŒã¹æåãšèãããšããã¹ãŠã®å¯èœãªããŒã¹æåãå«ãŸãããããããªãããŒã¹ããã£ãã©ãªã¯ããªã倧ãããªãããšããããŸãã GPT-2 ã¯ãããŒã¹ããã£ãã©ãªã256ãã€ãã«ããè³¢ãããªãã¯ãšããŠãã€ããããŒã¹ããã£ãã©ãªãšããŠäœ¿çšãããã¹ãŠã®ããŒã¹æåãããã£ãã©ãªã«å«ãŸããããã«ããŠããŸãã ãã³ã¯ãã¥ãšãŒã·ã§ã³ãæ±ãããã®ããã€ãã®è¿œå ã«ãŒã«ãåããGPT2ã®ããŒã¯ãã€ã¶ã¯ã<unk> ã·ã³ãã«ãå¿ èŠãšããã«ãã¹ãŠã®ããã¹ããããŒã¯ã³åã§ããŸãã GPT-2 ã¯50,257ã®ããã£ãã©ãªãµã€ãºãæã£ãŠãããããã¯256ãã€ãã®ããŒã¹ããŒã¯ã³ãç¹å¥ãªããã¹ãã®çµäºã瀺ãããŒã¯ã³ãããã³50,000åã®ããŒãžã§åŠç¿ããã·ã³ãã«ã«å¯Ÿå¿ããŠããŸãã
WordPiece
WordPieceã¯ãBERTãDistilBERTãããã³Electraã§äœ¿çšããããµãã¯ãŒãããŒã¯ãã€ãŒãŒã·ã§ã³ã¢ã«ãŽãªãºã ã§ãã ãã®ã¢ã«ãŽãªãºã ã¯ãJapanese and Korean Voice Search (Schuster et al., 2012) ã§æŠèª¬ãããŠãããBPEã«éåžžã«äŒŒãŠããŸãã WordPieceã¯æãé »ç¹ãªã·ã³ãã«ãã¢ãéžæããã®ã§ã¯ãªãããã¬ãŒãã³ã°ããŒã¿ã«è¿œå ããå Žåã«ãã¬ãŒãã³ã°ããŒã¿ã®å°€åºŠãæ倧åããã·ã³ãã«ãã¢ãéžæããŸãã
ããã¯å ·äœçã«ã¯ã©ãããæå³ã§ããïŒåã®äŸãåç §ãããšããã¬ãŒãã³ã°ããŒã¿ã®å°€åºŠãæ倧åããããšã¯ããã®ã·ã³ãã«ãã¢ã®ç¢ºçããã®æåã®ã·ã³ãã«ã«ç¶ã2çªç®ã®ã·ã³ãã«ã®ç¢ºçã§å²ã£ããã®ãããã¹ãŠã®ã·ã³ãã«ãã¢ã®äžã§æã倧ããå Žåã«è©²åœããã·ã³ãã«ãã¢ãèŠã€ããããšã«çããã§ãã ããšãã°ãâuâ ã®åŸã« âgâ ãç¶ãå Žåãä»ã®ã©ã®ã·ã³ãã«ãã¢ããã âugâ ã®ç¢ºçã âuâãâgâ ã§å²ã£ã確çãé«ããã°ããããã®ã·ã³ãã«ã¯çµåãããŸããçŽæçã«èšãã°ãWordPieceã¯2ã€ã®ã·ã³ãã«ãçµåããããšã«ãã£ãŠå€±ããããã®ãè©äŸ¡ãããããããã«å€ãããã©ããã確èªããç¹ã§BPEãšã¯ãããã«ç°ãªããŸãã
Unigram
Unigramã¯ãSubword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018) ã§å°å ¥ããããµãã¯ãŒãããŒã¯ãã€ãŒãŒã·ã§ã³ã¢ã«ãŽãªãºã ã§ãã BPEãWordPieceãšã¯ç°ãªããUnigramã¯ããŒã¹ããã£ãã©ãªãå€æ°ã®ã·ã³ãã«ã§åæåããåã·ã³ãã«ãåæžããŠããå°ããªããã£ãã©ãªãååŸããŸãã ããŒã¹ããã£ãã©ãªã¯ãäºåã«ããŒã¯ã³åããããã¹ãŠã®åèªãšæãäžè¬çãªéšåæååã«å¯Ÿå¿ããå¯èœæ§ããããŸãã Unigramã¯transformersã®ã¢ãã«ã®çŽæ¥ã®äœ¿çšã«ã¯é©ããŠããŸããããSentencePieceãšçµã¿åãããŠäœ¿çšãããŸãã
åãã¬ãŒãã³ã°ã¹ãããã§ãUnigramã¢ã«ãŽãªãºã ã¯çŸåšã®ããã£ãã©ãªãšãŠãã°ã©ã èšèªã¢ãã«ã䜿çšããŠãã¬ãŒãã³ã°ããŒã¿äžã®æ倱ïŒéåžžã¯å¯Ÿæ°å°€åºŠãšããŠå®çŸ©ïŒãå®çŸ©ããŸãããã®åŸãããã£ãã©ãªå ã®åã·ã³ãã«ã«ã€ããŠããã®ã·ã³ãã«ãããã£ãã©ãªããåé€ãããå Žåã«å šäœã®æ倱ãã©ãã ãå¢å ããããèšç®ããŸãã Unigramã¯ãæ倱ã®å¢å ãæãäœãpïŒéåžžã¯10ïŒ ãŸãã¯20ïŒ ïŒããŒã»ã³ãã®ã·ã³ãã«ãåé€ããŸããã€ãŸãããã¬ãŒãã³ã°ããŒã¿å šäœã®æ倱ã«æã圱é¿ãäžããªããæãæ倱ã®å°ãªãã·ã³ãã«ãåé€ããŸãã ãã®ããã»ã¹ã¯ãããã£ãã©ãªãæãŸãããµã€ãºã«éãããŸã§ç¹°ãè¿ãããŸãã Unigramã¢ã«ãŽãªãºã ã¯åžžã«ããŒã¹æåãä¿æãããããä»»æã®åèªãããŒã¯ã³åã§ããŸãã
Unigramã¯ããŒãžã«ãŒã«ã«åºã¥ããŠããªãããïŒBPEãšWordPieceãšã¯å¯Ÿç §çã«ïŒããã¬ãŒãã³ã°åŸã®æ°ããããã¹ãã®ããŒã¯ã³åã«ã¯ããã€ãã®æ¹æ³ããããŸããäŸãšããŠããã¬ãŒãã³ã°ãããUnigramããŒã¯ãã€ã¶ãæã€ããã£ãã©ãªã次ã®ãããªå ŽåïŒ
["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
"hugs"
ã¯ã["hug", "s"]
ã["h", "ug", "s"]
ããŸãã¯["h", "u", "g", "s"]
ã®ããã«ããŒã¯ã³åã§ããŸããã§ã¯ãã©ããéžæãã¹ãã§ããããïŒ Unigramã¯ããã¬ãŒãã³ã°ã³ãŒãã¹å
ã®åããŒã¯ã³ã®ç¢ºçãä¿åãããã¬ãŒãã³ã°åŸã«åå¯èœãªããŒã¯ã³åã®ç¢ºçãèšç®ã§ããããã«ããŸãããã®ã¢ã«ãŽãªãºã ã¯å®éã«ã¯æãå¯èœæ§ã®é«ãããŒã¯ã³åãéžæããŸããã確çã«åŸã£ãŠå¯èœãªããŒã¯ã³åããµã³ããªã³ã°ãããªãã·ã§ã³ãæäŸããŸãã
ãããã®ç¢ºçã¯ãããŒã¯ãã€ã¶ãŒããã¬ãŒãã³ã°ã«äœ¿çšããæ倱ã«ãã£ãŠå®çŸ©ãããŸãããã¬ãŒãã³ã°ããŒã¿ãåèª ã§æ§æãããåèª ã®ãã¹ãŠã®å¯èœãªããŒã¯ã³åã®ã»ããã ãšå®çŸ©ãããå Žåãå šäœã®æ倱ã¯æ¬¡ã®ããã«å®çŸ©ãããŸãã
SentencePiece
ãããŸã§ã«èª¬æãããã¹ãŠã®ããŒã¯ã³åã¢ã«ãŽãªãºã ã«ã¯åãåé¡ããããŸããããã¯ãå ¥åããã¹ããåèªãåºåãããã«ã¹ããŒã¹ã䜿çšããŠãããšä»®å®ããŠãããšããããšã§ãããããããã¹ãŠã®èšèªãåèªãåºåãããã«ã¹ããŒã¹ã䜿çšããŠããããã§ã¯ãããŸããããã®åé¡ãäžè¬çã«è§£æ±ºããããã®1ã€ã®æ¹æ³ã¯ãèšèªåºæã®åããŒã¯ãã€ã¶ãŒã䜿çšããããšã§ãïŒäŸïŒXLMã¯ç¹å®ã®äžåœèªãæ¥æ¬èªãããã³ã¿ã€èªã®åããŒã¯ãã€ã¶ãŒã䜿çšããŠããŸãïŒãããäžè¬çã«ãã®åé¡ã解決ããããã«ãSentencePieceïŒãã¥ãŒã©ã«ããã¹ãåŠçã®ããã®ã·ã³ãã«ã§èšèªéäŸåã®ãµãã¯ãŒãããŒã¯ãã€ã¶ãŒããã³ãããŒã¯ãã€ã¶ãŒïŒKudo et al.ã2018ïŒ ã¯ãå ¥åãçã®å ¥åã¹ããªãŒã ãšããŠæ±ããã¹ããŒã¹ã䜿çšããæåã®ã»ããã«å«ããŸããããããBPEãŸãã¯unigramã¢ã«ãŽãªãºã ã䜿çšããŠé©åãªèªåœãæ§ç¯ããŸãã
ããšãã°ãXLNetTokenizer
ã¯SentencePieceã䜿çšããŠããããã®ããã«åè¿°ã®äŸã§"â"
æåãèªåœã«å«ãŸããŠããŸãããSentencePieceã䜿çšãããã³ãŒãã¯éåžžã«ç°¡åã§ããã¹ãŠã®ããŒã¯ã³ãåçŽã«é£çµãã"â"
ã¯ã¹ããŒã¹ã«çœ®æãããŸãã
ã©ã€ãã©ãªå ã®ãã¹ãŠã®transformersã¢ãã«ã¯ãSentencePieceãunigramãšçµã¿åãããŠäœ¿çšããŸããSentencePieceã䜿çšããã¢ãã«ã®äŸã«ã¯ãALBERTãXLNetãMarianãããã³T5ããããŸãã