SecureBERT: A Domain-Specific Language Model for Cybersecurity

Ehsan Aghaei*, Xi Niu, Waseem Shadid, Ehab Al-Shaer

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

89 Scopus citations

Abstract

Natural Language Processing (NLP) has recently gained wide attention in cybersecurity, particularly in Cyber Threat Intelligence (CTI) and cyber automation. Increased connection and automation have revolutionized the world’s economic and cultural infrastructures, while they have introduced risks in terms of cyber attacks. CTI is information that helps cybersecurity analysts make intelligent security decisions, that is often delivered in the form of natural language text, which must be transformed to machine readable format through an automated procedure before it can be used for automated security measures. This paper proposes SecureBERT, a cybersecurity language model capable of capturing text connotations in cybersecurity text (e.g., CTI) and therefore successful in automation for many critical cybersecurity tasks that would otherwise rely on human expertise and time-consuming manual efforts. SecureBERT has been trained using a large corpus of cybersecurity text. To make SecureBERT effective not just in retaining general English understanding, but also when applied to text with cybersecurity implications, we developed a customized tokenizer as well as a method to alter pre-trained weights. The SecureBERT is evaluated using the standard Masked Language Model (MLM) test as well as two additional standard NLP tasks. Our evaluation studies show that SecureBERT outperforms existing similar models, confirming its capability for solving crucial NLP tasks in cybersecurity.

Original languageEnglish
Title of host publicationSecurity and Privacy in Communication Networks - 18th EAI International Conference, SecureComm 2022, Proceedings
EditorsFengjun Li, Kaitai Liang, Zhiqiang Lin, Sokratis K. Katsikas
PublisherSpringer Science and Business Media Deutschland GmbH
Pages39-56
Number of pages18
ISBN (Print)9783031255373
DOIs
StatePublished - 2023
Externally publishedYes
Event18th EAI International Conference on Security and Privacy in Communication Networks, SecureComm 2022 - Virtual, Online
Duration: 17 Oct 202219 Oct 2022

Publication series

NameLecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST
Volume462 LNICST
ISSN (Print)1867-8211
ISSN (Electronic)1867-822X

Conference

Conference18th EAI International Conference on Security and Privacy in Communication Networks, SecureComm 2022
CityVirtual, Online
Period17/10/2219/10/22

Bibliographical note

Publisher Copyright:
© 2023, ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering.

Keywords

  • Cyber automation
  • Cyber threat intelligence
  • Language model

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'SecureBERT: A Domain-Specific Language Model for Cybersecurity'. Together they form a unique fingerprint.

Cite this