Browsing by Author "Bruse, Jan Lukas"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Publication Open Access Attacking bitcoin anonymity: generative adversarial networks for improving bitcoin entity classification(Springer, 2022) Zola, Francesco; Segurola-Gil, Lander; Bruse, Jan Lukas; Galar Idoate, Mikel; Orduna Urrutia, Raúl; Institute of Smart Cities - ISCClassification of Bitcoin entities is an important task to help Law Enforcement Agencies reduce anonymity in the Bitcoin blockchain network and to detect classes more tied to illegal activities. However, this task is strongly conditioned by a severe class imbalance in Bitcoin datasets. Existing approaches for addressing the class imbalance problem can be improved considering generative adversarial networks (GANs) that can boost data diversity. However, GANs are mainly applied in computer vision and natural language processing tasks, but not in Bitcoin entity behaviour classification where they may be useful for learning and generating synthetic behaviours. Therefore, in this work, we present a novel approach to address the class imbalance in Bitcoin entity classification by applying GANs. In particular, three GAN architectures were implemented and compared in order to find the most suitable architecture for generating Bitcoin entity behaviours. More specifically, GANs were used to address the Bitcoin imbalance problem by generating synthetic data of the less represented classes before training the final entity classifier. The results were used to evaluate the capabilities of the different GAN architectures in terms of training time, performance, repeatability, and computational costs. Finally, the results achieved by the proposed GAN-based resampling were compared with those obtained using five well-known data-level preprocessing techniques. Models trained with data resampled with our GAN-based approach achieved the highest accuracy improvements and were among the best in terms of precision, recall and f1-score. Together with Random Oversampling (ROS), GANs proved to be strong contenders in addressing Bitcoin class imbalance and consequently in reducing Bitcoin entity anonymity (overall and per-class classification performance). To the best of our knowledge, this is the first work to explore the advantages and limitations of GANs in generating specific Bitcoin data and “attacking” Bitcoin anonymity. The proposed methods ultimately demonstrate that in Bitcoin applications, GANs are indeed able to learn the data distribution and generate new samples starting from a very limited class representation, which leads to better detection of classes related to illegal activities.Publication Open Access Behavioral analysis in cybersecurity using machine learning: a study based on graph representation, class imbalance and temporal dissection(2022) Zola, Francesco; Galar Idoate, Mikel; Bruse, Jan Lukas; Estadística, Informática y Matemáticas; Estatistika, Informatika eta MatematikaThe main goal of this thesis is to improve behavioral cybersecurity analysis using machine learning, exploiting graph structures, temporal dissection, and addressing imbalance problems.This main objective is divided into four specific goals: OBJ1: To study the influence of the temporal resolution on highlighting micro-dynamics in the entity behavior classification problem. In real use cases, time-series information could be not enough for describing the entity behavior classification. For this reason, we plan to exploit graph structures for integrating both structured and unstructured data in a representation of entities and their relationships. In this way, it will be possible to appreciate not only the single temporal communication but the whole behavior of these entities. Nevertheless, entity behaviors evolve over time and therefore, a static graph may not be enoughto describe all these changes. For this reason, we propose to use a temporal dissection for creating temporal subgraphs and therefore, analyze the influence of the temporal resolution on the graph creation and the entity behaviors within. Furthermore, we propose to study how the temporal granularity should be used for highlighting network micro-dynamics and short-term behavioral changes which can be a hint of suspicious activities. OBJ2: To develop novel sampling methods that work with disconnected graphs for addressing imbalanced problems avoiding component topology changes. Graph imbalance problem is a very common and challenging task and traditional graph sampling techniques that work directly on these structures cannot be used without modifying the graph’s intrinsic information or introducing bias. Furthermore, existing techniques have shown to be limited when disconnected graphs are used. For this reason, novel resampling methods for balancing the number of nodes that can be directly applied over disconnected graphs, without altering component topologies, need to be introduced. In particular, we propose to take advantage of the existence of disconnected graphs to detect and replicate the most relevant graph components without changing their topology, while considering traditional data-level strategies for handling the entity behaviors within. OBJ3: To study the usefulness of the generative adversarial networks for addressing the class imbalance problem in cybersecurity applications. Although traditional data-level pre-processing techniques have shown to be effective for addressing class imbalance problems, they have also shown downside effects when highly variable datasets are used, as it happens in cybersecurity. For this reason, new techniques that can exploit the overall data distribution for learning highly variable behaviors should be investigated. In this sense, GANs have shown promising results in the image and video domain, however, their extension to tabular data is not trivial. For this reason, we propose to adapt GANs for working with cybersecurity data and exploit their ability in learning and reproducing the input distribution for addressing the class imbalance problem (as an oversampling technique). Furthermore, since it is not possible to find a unique GAN solution that works for every scenario, we propose to study several GAN architectures with several training configurations to detect which is the best option for a cybersecurity application. OBJ4: To analyze temporal data trends and performance drift for enhancing cyber threat analysis. Temporal dynamics and incoming new data can affect the quality of the predictions compromising the model reliability. This phenomenon makes models get outdated without noticing. In this sense, it is very important to be able to extract more insightful information from the application domain analyzing data trends, learning processes, and performance drifts over time. For this reason, we propose to develop a systematic approach for analyzing how the data quality and their amount affect the learning process. Moreover, in the contextof CTI, we propose to study the relations between temporal performance drifts and the input data distribution for detecting possible model limitations, enhancing cyber threat analysis.Publication Open Access Bitcoin and cybersecurity: temporal dissection of blockchain data to unveil changes in entity behavioral patterns(MDPI, 2019) Zola, Francesco; Bruse, Jan Lukas; Eguimendia, María; Galar Idoate, Mikel; Orduna Urrutia, Raúl; Institute of Smart Cities - ISCThe Bitcoin network not only is vulnerable to cyber-attacks but currently represents the most frequently used cryptocurrency for concealing illicit activities. Typically, Bitcoin activity is monitored by decreasing anonymity of its entities using machine learning-based techniques, which consider the whole blockchain. This entails two issues: first, it increases the complexity of the analysis requiring higher efforts and, second, it may hide network micro-dynamics important for detecting short-term changes in entity behavioral patterns. The aim of this paper is to address both issues by performing a 'temporal dissection' of the Bitcoin blockchain, i.e., dividing it into smaller temporal batches to achieve entity classification. The idea is that a machine learning model trained on a certain time-interval (batch) should achieve good classification performance when tested on another batch if entity behavioral patterns are similar. We apply cascading machine learning principles'a type of ensemble learning applying stacking techniques'introducing a 'k-fold cross-testing' concept across batches of varying size. Results show that blockchain batch size used for entity classification could be reduced for certain classes (Exchange, Gambling, and eWallet) as classification rates did not vary significantly with batch size; suggesting that behavioral patterns did not change significantly over time. Mixer and Market class detection, however, can be negatively affected. A deeper analysis of Mining Pool behavior showed that models trained on recent data perform better than models trained on older data, suggesting that 'typical' Mining Pool behavior may be represented better by recent data. This work provides a first step towards uncovering entity behavioral changes via temporal dissection of blockchain data.Publication Open Access Generative adversarial networks for bitcoin data augmentation(IEEE, 2020) Zola, Francesco; Bruse, Jan Lukas; Etxeberria Barrio, Xabier; Galar Idoate, Mikel; Orduna Urrutia, Raúl; Institute of Smart Cities - ISCIn Bitcoin entity classification, results are strongly conditioned by the ground-truth dataset, especially when applying supervised machine learning approaches. However, these ground-truth datasets are frequently affected by significant class imbalance as generally they contain much more information regarding legal services (Exchange, Gambling), than regarding services that may be related to illicit activities (Mixer, Service). Class imbalance increases the complexity of applying machine learning techniques and reduces the quality of classification results, especially for underrepresented, but critical classes.In this paper, we propose to address this problem by using Generative Adversarial Networks (GANs) for Bitcoin data augmentation as GANs recently have shown promising results in the domain of image classification. However, there is no 'one-fits-all' GAN solution that works for every scenario. In fact, setting GAN training parameters is non-trivial and heavily affects the quality of the generated synthetic data. We therefore evaluate how GAN parameters such as the optimization function, the size of the dataset and the chosen batch size affect GAN implementation for one underrepresented entity class (Mining Pool) and demonstrate how a 'good' GAN configuration can be obtained that achieves high similarity between synthetically generated and real Bitcoin address data. To the best of our knowledge, this is the first study presenting GANs as a valid tool for generating synthetic address data for data augmentation in Bitcoin entity classification.Publication Open Access Network traffic analysis through node behaviour classification: a graph-based approach with temporal dissection and data-level preprocessing(Elsevier, 2022) Zola, Francesco; Segurola-Gil, Lander; Bruse, Jan Lukas; Galar Idoate, Mikel; Orduna Urrutia, Raúl; Institute of Smart Cities - ISCNetwork traffic analysis is an important cybersecurity task, which helps to classify anomalous, potentially dangerous connections. In many cases, it is critical not only to detect individual malicious connections, but to detect which node in a network has generated malicious traffic so that appropriate actions can be taken to reduce the threat and increase the system's cybersecurity. Instead of analysing connections only, node behavioural analysis can be performed by exploiting the graph information encoded in a connection network. Network traffic, however, is temporal data and extracting graph information without a fixed time scope may only unveil macro-dynamics that are less related to cybersecurity threats. To address these issues, a threefold approach is proposed here: firstly, temporal dissection for extracting graph-based information is applied. As the resulting graphs are typically affected by class imbalance (i.e. malicious nodes are under-represented), two novel graph data-level preprocessing techniques - R-hybrid and SM-hybrid - are introduced, which focus on exploiting the most relevant graph substructures. Finally, a Neural Network (NN) and two Graph Convolutional Network (GCN) approaches are compared when performing node behaviour classification. Furthermore, we compare the node classification performance of these supervised models with traditional unsupervised anomaly detection techniques. Results show that temporal dissection parameters affected classification performance, while the data-level preprocessing strategies reduced class imbalance and led to improved supervised node behaviour classification, outperforming anomaly detection models. In particular, Neural Network (NN) outperformed Graph Convolutional Network (GCN) approaches for two attack families and was less affected by class imbalance, yet one GCN performed best overall. The presented study successfully applies a temporal graph-based approach for malicious actor detection in network traffic data.Publication Open Access Temporal analysis of distribution shifts in malware classification for digital forensics(IEEE, 2023) Zola, Francesco; Bruse, Jan Lukas; Galar Idoate, Mikel; Institute of Smart Cities - ISCIn recent years, malware diversity and complexity have increased substantially, so the detection and classification of malware families have become one of the key objectives of information security. Machine learning (ML)-based approaches have been proposed to tackle this problem. However, most of these approaches focus on achieving high classification performance scores in static scenarios, without taking into account a key feature of malware: it is constantly evolving. This leads to ML models being outdated and performing poorly after only a few months, leaving stakeholders exposed to potential security risks. With this work, our aim is to highlight the issues that may arise when applying ML-based classification to malware data. We propose a three-step approach to carry out a forensics exploration of model failures. In particular, in the first step, we evaluate and compare the concept drift generated by models trained using a rolling windows approach for selecting the training dataset. In the second step, we evaluate model drift based on the amount of temporal information used in the training dataset. Finally, we perform an in-depth misclassification and feature analysis to emphasize the interpretation of the results and to highlight drift causes. We conclude that caution is warranted when training ML models for malware analysis, as concept drift and clear performance drops were observed even for models trained on larger datasets. Based on our results, it may be more beneficial to train models on fewer but recent data and re-train them after a few months in order to maintain performance.Publication Open Access Verification system based on long-range iris and Graph Siamese Neural Networks(ACM, 2022) Zola, Francesco; Fernandez-Carrasco, José Álvaro; Bruse, Jan Lukas; Galar Idoate, Mikel; Geradts, Zeno; Estadística, Informática y Matemáticas; Estatistika, Informatika eta Matematika; Institute of Smart Cities - ISCBiometric systems represent valid solutions in tasks like user authentication and verification, since they are able to analyze physical and behavioural features with high precision. However, especially when physical biometrics are used, as is the case of iris recognition, they require specific hardware such as retina scanners, sensors, or HD cameras to achieve relevant results. At the same time, they require the users to be very close to the camera to extract high-resolution information. For this reason, in this work, we propose a novel approach that uses long-range (LR) distance images for implementing an iris verification system. More specifically, we present a novel methodology for converting LR iris images into graphs and then use Graph Siamese Neural Networks (GSNN) to predict whether two graphs belong to the same person. In this study, we not only describe this methodology but also evaluate how the spectral components of these images can be used for improving the graph extraction and the final classification task. Results demonstrate the suitability of this approach, encouraging the community to explore graph application in biometric systems.