Comparativa de técnicas de extracción de características de texto para la detección de Fake News

Amatriain García, Alejandro

Comparativa de técnicas de extracción de características de texto para la detección de Fake News

dc.contributor.advisorTFE	Sesma Sara, Mikel
dc.contributor.affiliation	Escuela Técnica Superior de Ingeniería Agronómica y Biociencias	es_ES
dc.contributor.affiliation	Nekazaritzako Ingeniaritzako eta Biozientzietako Goi Mailako Eskola Teknikoa	eu
dc.contributor.author	Amatriain García, Alejandro
dc.date.accessioned	2022-10-04T05:47:16Z
dc.date.available	2022-10-04T05:47:16Z
dc.date.issued	2022
dc.date.updated	2022-10-03T09:24:15Z
dc.description.abstract	En la actualidad, la divulgación de noticias falsas es una estrategia de manipulación a la sociedad, que repercute en la visión y opinión que esta se forja respecto a diversos temas. En el ámbito del procesamiento del lenguaje natural el proceso de distinción entre una noticia real y una falsa comenzaría por definir una estructura de cómo se caracteriza el contenido de un texto, para después, conforme a esa estructura, encontrar los patrones y/o tendencias que se dan en un tipo de noticia falsa y una real (tendencias que serán detectadas a partir de aplicar dicha estructura a un conjunto de fake news y a otro conjunto de noticias verdaderas que disponemos, conjuntos que llamaremos de entrenamiento o train). A estas estructuras se les llama Técnicas de extracción de características. Forman parte del proceso de aprendizaje automático. El trabajo consistirá en aplicar y comprobar la eficiencia de varias de ellas. En el presente trabajo abordaremos cuatro técnicas: Bag of Words TF-IDF, GloVe, Word2vec y FastText. La primera (TF-IDF) se trata de una técnica supervisada ya que se comienza por definir un vocabulario. Su forma de caracterizar un texto es dar pesos a los términos presentes en el texto que forman parte del vocabulario. Las otras tres pertenecen a la familia de Representación vectorial de palabras. Grosso modo, consisten en aprender un vector numérico para cada término del vocabulario para después agregar todos los vectores de los términos presentes. Dichos métodos aprenden por sí mismos el vector de características de cada palabra basándose en la idea de que los vectores de características de dos términos que aparecen juntos con mucha frecuencia, deben ser parecidos. Con estos métodos, el programador debe especificar únicamente el número de componentes del vector de características. Tras el proceso de entrenamiento, desconoceremos qué representa cada componente del vector, al contrario que con Bag of Words.	es_ES
dc.description.abstract	Nowdays, the dissemination of false news is a manipulation strategy for society, which damages the vision and opinion that people have about some issues. On the field of natural language processing, the process of distinguishing between real and false news begins by defining a structure of how the content of a text is characterized, and then, according to that structure, finding patterns or tendencies that relate to fake news but not real news (trends will be detected by applying said structure to a set of fake news and another set of true news, sets that we will call “train”). These structures are called feature extraction techniques. They are part of the machine learning process. This work will aims at applying and checking the efficiency of several of them. In this work we will apply four techniques: Bag of Words, GloVe, Word2vec and FastText. TF-IDF is a supervised technique since it begins by defining a vocabulary. Its way of characterizing a text is to give weights to the terms present in the text that are part of the vocabulary. The other three belong to the Vector Representation word family. They consist of learning a numerical vector for each vocabulary term and then aggregating all the vectors of the present terms. These methods learn by themselves the feature vector of each word based on the idea that the feature vectors of two terms that appear together very frequently must be similar. The programmer must specify only the number of features of the feature vector. After the training process, unlike Bag of Wordswe will not know what each feature of the vector represents.	en
dc.description.degree	Graduado o Graduada en Ciencia de Datos por la Universidad Pública de Navarra	es_ES
dc.description.degree	Datu Zientzietan Graduatua Nafarroako Unibertsitate Publikoan	eu
dc.format.mimetype	application/pdf	en
dc.identifier.uri	https://academica-e.unavarra.es/handle/2454/44153
dc.language.iso	spa	en
dc.rights.accessRights	info:eu-repo/semantics/openAccess
dc.subject	Fake new	es_ES
dc.subject	Aprendizaje automático	es_ES
dc.subject	Extracción de características	es_ES
dc.subject	Clasificación	es_ES
dc.subject	NLP	es_ES
dc.subject	Fake new	en
dc.subject	Machine learning	en
dc.subject	Feature extraction	en
dc.subject	Classification	en
dc.subject	NLP	en
dc.title	Comparativa de técnicas de extracción de características de texto para la detección de Fake News	es_ES
dc.type	info:eu-repo/semantics/bachelorThesis
dspace.entity.type	Publication
relation.isAdvisorTFEOfPublication	3a541442-8e82-49d5-903d-60e0aedbc1f6
relation.isAdvisorTFEOfPublication.latestForDiscovery	3a541442-8e82-49d5-903d-60e0aedbc1f6

Files

Original bundle

Now showing 1 - 1 of 1

Name:: TFG_AlejandroAmatriainGarcia.pdf
Size:: 1.76 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.78 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Trabajos Fin de Grado ETSIA - NIGMET Gradu Amaierako Lanak
Trabajos Fin de Grado - Gradu Amaierako Lanak