--- license: cc-by-nc-4.0 pipeline_tag: fill-mask widget: - text: >- The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical attack. tags: - cybersecurity --- # CyBERTuned CyBERTuned is a BERT-like model trained with an NLE (non-linguistic element) aware pretraining method tuned for the cybersecurity domain. ## Sample Usage ```python >>> from transformers import pipeline >>> folder_dir = "CyBERTuned" >>> unmasker = pipeline('fill-mask', model=folder_dir) >>> unmasker("RagnarLocker, LockBit, and REvil are types of .") [{'score': 0.8489783406257629, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'}, {'score': 0.1364559829235077, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'}, {'score': 0.0022238395176827908, 'token': 1912, 'token_str': ' attacks', 'sequence': 'RagnarLocker, LockBit, and REvil are types of attacks.'}, {'score': 0.001197474543005228, 'token': 11341, 'token_str': ' infections', 'sequence': 'RagnarLocker, LockBit, and REvil are types of infections.'}, {'score': 0.0009669850114732981, 'token': 6773, 'token_str': ' files', 'sequence': 'RagnarLocker, LockBit, and REvil are types of files.'}] >>> # text requiring url comprehension (redirection attack), modified from https://intezer.com/blog/research/targeted-phishing-attack-against-ukrainian-government-expands-to-georgia/ >>> url_text = 'The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical attack.' >>> unmasker(url_text)[0] {'score': 0.1701660305261612, 'token': 30970, 'token_str': ' redirect', 'sequence': 'The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical redirect attack.'} >>> from transformers import AutoModel, AutoTokenizer >>> model = AutoModel.from_pretrained(folder_dir) >>> tokenizer = AutoTokenizer.from_pretrained(folder_dir) >>> text = "Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging." >>> encoded = tokenizer(text, return_tensors="pt") >>> output = model(**encoded) >>> output[0].shape torch.Size([1, 27, 768]) ``` # Citation If you're using CyBERTuned please cite the following paper: ``` Eugene Jang, Jian Cui, Dayeon Yim, Youngjin Jin, Jin-Woo Chung, Seungwon Shin, and Yongjae Lee. 2024. Ignore Me But Don’t Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 29–42, Mexico City, Mexico. Association for Computational Linguistics. ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0006 - train_batch_size: 64 - eval_batch_size: 32 - seed: 42 - distributed_type: multi-GPU - num_devices: 4 - gradient_accumulation_steps: 8 - total_train_batch_size: 2048 - total_eval_batch_size: 128 - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.048 - num_epochs: 200 ### Framework versions - Transformers 4.27.0.dev0 - Pytorch 1.12.1 - Datasets 2.6.1 - Tokenizers 0.13.2