RT info:eu-repo/semantics/doctoralThesis
T1 Machine Learning Approaches in Bioinformatics: Advances in Transcription and Protein Fitness Prediction
A1 Barbero Aparicio, José Antonio
A2 Universidad de Burgos. Departamento de Ingeniería Informática
K1 Bioinformatics
K1 Machine learning
K1 Transcription start site
K1 Deep learning
K1 Protein fitness
K1 Bioinformática
K1 Aprendizaje automático
K1 Sitio de inicio de la transcripción
K1 Aprendizaje profundo
K1 Aptitud de la proteína
K1 Informática
K1 Computer science
K1 1203.04 Inteligencia Artificial
AB As we move deeper into the information age, bioinformatics has become increasingly important in modern biology, largely due to its critical role in processing and analyzing the vast amounts of complex data generated in the field. Traditional methods are often overwhelmed by the large volume and complexity of this data, positioning machine learning techniques as an optimal solution. Exploring the intersection of machine learning and bioinformatics offers numerous opportunities to develop and improve computational tools designed to handle and gain insights from these vast datasets. The main objective of this thesis is to develop a thorough exploration of the possibilities of machine learning in the field of bioinformatics, with a particular focus on specific problems such as transcription start and protein fitness prediction. Furthermore, given the similarities between bioinformatics sequence data and the natural language processing domain, the research emphasizes the use of sequence-based methods.  Our research has resulted in several contributions to the field in the form of three scientific papers. The first two focus on transcription start prediction. In the first, we discovered that the integration of biophysical simulations in conjunction with the DNA sequence can improve the results of machine learning methods. Additionally, in our second paper we concluded that, while support vector machines have been a traditional choice for transcription start prediction, our research suggests that deep learning methods outperform them, marking a paradigm shift in the field. In addition, we presented custom-built datasets using Ensembl data, providing a valuable resource for future studies. The third paper addresses the issue of protein fitness prediction specifically in scarce dataset scenarios and concludes that deep transfer learning methods get established as the best alternative when compared with other strategies well suited for such situations, such as semi-supervised learning.
YR 2023
FD 2023
LK http://hdl.handle.net/10259/9060
UL http://hdl.handle.net/10259/9060
LA eng
NO his thesis has been funded through a pre-doctoral grant by the University of Burgos. The work included in this thesis has also been supported by the Junta de Castilla y León under project BU055P20 (JCyL/FEDER, UE), by the Ministry of Science and Innovation under project PID2019-109481GB-I00 and the Junta de Andalucia under project UCO1264182, in both cases co-financed through European Union FEDER funds and by Fundación Bancaria la Caixa under project 2020/00062/001. NVIDIA Corporation donated the TITAN Xp GPUs used in this research.
DS Repositorio Institucional de la Universidad de Burgos
RD 19-may-2024