Behavioral analysis in cybersecurity using machine learning: a study based on graph representation, class imbalance and temporal dissection

Zola, Francesco

Publication:
Behavioral analysis in cybersecurity using machine learning: a study based on graph representation, class imbalance and temporal dissection

Files

Tesis Zola, Francesco.pdf (6.21 MB)

Date

2022

Authors

Zola, Francesco

Director

Galar Idoate, Mikel
Bruse, Jan Lukas

Acceso abierto / Sarbide irekia

Tesis doctoral / Doktoretza tesia

Abstract

The main goal of this thesis is to improve behavioral cybersecurity analysis using machine learning, exploiting graph structures, temporal dissection, and addressing imbalance problems.This main objective is divided into four specific goals: OBJ1: To study the influence of the temporal resolution on highlighting micro-dynamics in the entity behavior classification problem. In real use cases, time-series information could be not enough for describing the entity behavior classification. For this reason, we plan to exploit graph structures for integrating both structured and unstructured data in a representation of entities and their relationships. In this way, it will be possible to appreciate not only the single temporal communication but the whole behavior of these entities. Nevertheless, entity behaviors evolve over time and therefore, a static graph may not be enoughto describe all these changes. For this reason, we propose to use a temporal dissection for creating temporal subgraphs and therefore, analyze the influence of the temporal resolution on the graph creation and the entity behaviors within. Furthermore, we propose to study how the temporal granularity should be used for highlighting network micro-dynamics and short-term behavioral changes which can be a hint of suspicious activities. OBJ2: To develop novel sampling methods that work with disconnected graphs for addressing imbalanced problems avoiding component topology changes. Graph imbalance problem is a very common and challenging task and traditional graph sampling techniques that work directly on these structures cannot be used without modifying the graph’s intrinsic information or introducing bias. Furthermore, existing techniques have shown to be limited when disconnected graphs are used. For this reason, novel resampling methods for balancing the number of nodes that can be directly applied over disconnected graphs, without altering component topologies, need to be introduced. In particular, we propose to take advantage of the existence of disconnected graphs to detect and replicate the most relevant graph components without changing their topology, while considering traditional data-level strategies for handling the entity behaviors within. OBJ3: To study the usefulness of the generative adversarial networks for addressing the class imbalance problem in cybersecurity applications. Although traditional data-level pre-processing techniques have shown to be effective for addressing class imbalance problems, they have also shown downside effects when highly variable datasets are used, as it happens in cybersecurity. For this reason, new techniques that can exploit the overall data distribution for learning highly variable behaviors should be investigated. In this sense, GANs have shown promising results in the image and video domain, however, their extension to tabular data is not trivial. For this reason, we propose to adapt GANs for working with cybersecurity data and exploit their ability in learning and reproducing the input distribution for addressing the class imbalance problem (as an oversampling technique). Furthermore, since it is not possible to find a unique GAN solution that works for every scenario, we propose to study several GAN architectures with several training configurations to detect which is the best option for a cybersecurity application. OBJ4: To analyze temporal data trends and performance drift for enhancing cyber threat analysis. Temporal dynamics and incoming new data can affect the quality of the predictions compromising the model reliability. This phenomenon makes models get outdated without noticing. In this sense, it is very important to be able to extract more insightful information from the application domain analyzing data trends, learning processes, and performance drifts over time. For this reason, we propose to develop a systematic approach for analyzing how the data quality and their amount affect the learning process. Moreover, in the contextof CTI, we propose to study the relations between temporal performance drifts and the input data distribution for detecting possible model limitations, enhancing cyber threat analysis.

Keywords

Machine learning, Cybersecurity, Graph representation|Class imbalance, Temporal dissection

Department

Estadística, Informática y Matemáticas / Estatistika, Informatika eta Matematika

Doctorate program

Programa de Doctorado en Ciencias y Tecnologías Industriales (RD 99/2011)
Industria Zientzietako eta Teknologietako Doktoretza Programa (ED 99/2011)

URI

https://academica-e.unavarra.es/handle/2454/44259

Creative Commons Reconocimiento-NoComercial-CompartirIgual 4.0 Internacional (CC BY-NC-SA 4.0)

Collections

Tesis doctorales DEIM - EIMS Doktoretza tesiak
Tesis doctorales - Doktoretza tesiak

Full item page

Publication:
Behavioral analysis in cybersecurity using machine learning: a study based on graph representation, class imbalance and temporal dissection

Files

Consultable a partir de

Date

Authors

Director

Publisher

Project identifier

Abstract

Keywords

Department

Faculty/School

Degree

Doctorate program

Editor version

URI

Funding entities

Collections

Publication: Behavioral analysis in cybersecurity using machine learning: a study based on graph representation, class imbalance and temporal dissection

Files

Consultable a partir de

Date

Authors

Director

Publisher

Project identifier

Abstract

Keywords

Department

Faculty/School

Degree

Doctorate program

Editor version

URI

Funding entities

Collections

Publication:
Behavioral analysis in cybersecurity using machine learning: a study based on graph representation, class imbalance and temporal dissection