Machine Learning Approaches in Bioinformatics: Advances in Transcription and Protein Fitness Prediction

Barbero Aparicio, José Antonio

doi:10.36443/10259/9060

dc.contributor.advisor	García Osorio, César
dc.contributor.advisor	Diez Pastor, José Francisco
dc.contributor.author	Barbero Aparicio, José Antonio
dc.contributor.other	Universidad de Burgos. Departamento de Ingeniería Informática
dc.date.accessioned	2024-04-26T10:33:24Z
dc.date.available	2024-04-26T10:33:24Z
dc.date.issued	2023
dc.date.submitted	2023-12-01
dc.identifier.uri	http://hdl.handle.net/10259/9060
dc.description.abstract	As we move deeper into the information age, bioinformatics has become increasingly important in modern biology, largely due to its critical role in processing and analyzing the vast amounts of complex data generated in the field. Traditional methods are often overwhelmed by the large volume and complexity of this data, positioning machine learning techniques as an optimal solution. Exploring the intersection of machine learning and bioinformatics offers numerous opportunities to develop and improve computational tools designed to handle and gain insights from these vast datasets. The main objective of this thesis is to develop a thorough exploration of the possibilities of machine learning in the field of bioinformatics, with a particular focus on specific problems such as transcription start and protein fitness prediction. Furthermore, given the similarities between bioinformatics sequence data and the natural language processing domain, the research emphasizes the use of sequence-based methods. Our research has resulted in several contributions to the field in the form of three scientific papers. The first two focus on transcription start prediction. In the first, we discovered that the integration of biophysical simulations in conjunction with the DNA sequence can improve the results of machine learning methods. Additionally, in our second paper we concluded that, while support vector machines have been a traditional choice for transcription start prediction, our research suggests that deep learning methods outperform them, marking a paradigm shift in the field. In addition, we presented custom-built datasets using Ensembl data, providing a valuable resource for future studies. The third paper addresses the issue of protein fitness prediction specifically in scarce dataset scenarios and concludes that deep transfer learning methods get established as the best alternative when compared with other strategies well suited for such situations, such as semi-supervised learning.	en
dc.description.abstract	A medida que nos seguimos adentrando en la era de la información, la bioinformática está pasando a ser cada vez más importante en la biología moderna, en gran parte debido a su papel crítico en el procesamiento y análisis de la gran cantidad de datos complejos generados en el campo. A menudo los métodos tradicionales encuentran dificultades debidas al gran volumen y complejidad de estos datos, posicionando a las técnicas de aprendizaje automático como una solución más óptima. La exploración de la intersección entre aprendizaje automático y la bioinformática ofrece numerosas oportunidades para el desarrollo y la mejora de herramientas computacionales diseñadas para manejar y obtener información crítica sobre estos grandes conjuntos de datos. El objetivo principal de esta tesis es el desarrollo de una exploración exhaustiva de las posibilidades del aprendizaje automático en bioinformática, con un enfoque específico en problemas como la predicción del inicio de la transcripción y la predicción del fitness en las proteínas. Además, dadas las similitudes entre las secuencias bioinformáticas y el campo del procesamiento del lenguaje natural, la investigación tiene un claro énfasis en el uso de métodos basados en secuencias. Este trabajo ha resultado en la producción de varias contribuciones al campo en forma de tres artículos científicos. Los dos primeros se centran en la predicción del inicio de la transcripción. En el primero de ellos descubrimos que la integración de simulaciones biofísicas en conjunto con la secuencia de ADN puede mejorar los resultados de los métodos de aprendizaje automático. En el segundo, además, llegamos a la conclusión de que, mientras que las máquinas de soporte vectorial han sido una opción muy establecida en el campo de la predicción del inicio de la transcripción, nuestra investigación sugiere que los métodos de aprendizaje profundo los superan, marcando un cambio de paradigma en el área. Además, presentamos conjuntos de datos personalizados a partir de datos de Ensembl, proporcionando un recurso valioso para futuros estudios. El tercer artículo aborda la predicción del fitness en proteínas, específicamente en escenarios con conjuntos de datos escasos y concluye que los métodos de deep transfer learning se establecen como la mejor alternativa ante otras estrategias bien adaptadas a tales situaciones, como los métodos de aprendizaje semi-supervisado.	es
dc.description.sponsorship	his thesis has been funded through a pre-doctoral grant by the University of Burgos. The work included in this thesis has also been supported by the Junta de Castilla y León under project BU055P20 (JCyL/FEDER, UE), by the Ministry of Science and Innovation under project PID2019-109481GB-I00 and the Junta de Andalucia under project UCO1264182, in both cases co-financed through European Union FEDER funds and by Fundación Bancaria la Caixa under project 2020/00062/001. NVIDIA Corporation donated the TITAN Xp GPUs used in this research.	en
dc.format.mimetype	application/pdf
dc.language.iso	eng	es
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Bioinformatics	en
dc.subject	Machine learning	en
dc.subject	Transcription start site	en
dc.subject	Deep learning	en
dc.subject	Protein fitness	en
dc.subject	Bioinformática	es
dc.subject	Aprendizaje automático	es
dc.subject	Sitio de inicio de la transcripción	es
dc.subject	Aprendizaje profundo	es
dc.subject	Aptitud de la proteína	es
dc.subject.other	Informática	es
dc.subject.other	Computer science	en
dc.title	Machine Learning Approaches in Bioinformatics: Advances in Transcription and Protein Fitness Prediction	en
dc.type	info:eu-repo/semantics/doctoralThesis	es
dc.rights.accessRights	info:eu-repo/semantics/embargoedAccess	es
dc.identifier.doi	10.36443/10259/9060
dc.subject.unesco	1203.04 Inteligencia Artificial
dc.relation.projectID	info:eu-repo/grantAgreement/Junta de Castilla y León//BU055P20//Métodos y Aplicaciones Industriales del Aprendizaje Semisupervisado/	es
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-109481GB-I00/ES/NUEVA APROXIMACION A LA CONSTRUCCION DE ENJAMBRES PARA APRENDIZAJE MULTI-ETIQUETA: APLICACION A LA QUEMINFORMATICA Y LA BIOINFORMATICA/	es
dc.relation.projectID	info:eu-repo/grantAgreement/Junta de Andalucía//UCO-1264182/	es
dc.relation.projectID	info:eu-repo/grantAgreement/Fundación Bancaria Caixa d'Estalvis i Pensions de Barcelona//2020%2F00062%2F001/	es
dc.type.hasVersion	info:eu-repo/semantics/acceptedVersion	es

Arquivos deste item

Nome:: Barbero_Aparicio_Jose_Antonio- ...Embargado hasta: 2026-12-02
Tamanho:: 6.817Mb
Formato:: Adobe PDF

Visualizar/Abrir

Este item aparece na(s) seguinte(s) coleção(s)

Untitled

Mostrar registro simples