RT info:eu-repo/semantics/doctoralThesis T1 Machine Learning Approaches in Bioinformatics: Advances in Transcription and Protein Fitness Prediction A1 Barbero Aparicio, José Antonio A2 Universidad de Burgos. Departamento de Ingeniería Informática K1 Bioinformatics K1 Machine learning K1 Transcription start site K1 Deep learning K1 Protein fitness K1 Bioinformática K1 Aprendizaje automático K1 Sitio de inicio de la transcripción K1 Aprendizaje profundo K1 Aptitud de la proteína K1 Informática K1 Computer science K1 1203.04 Inteligencia Artificial AB As we move deeper into the information age, bioinformatics has become increasingly important in modern biology, largely due to its critical role in processing and analyzing the vast amounts of complex data generated in the field. Traditional methods are often overwhelmed by the large volume and complexity of this data, positioning machine learning techniques as an optimal solution. Exploring the intersection of machine learning and bioinformatics offers numerous opportunities to develop and improve computational tools designed to handle and gain insights from these vast datasets. The main objective of this thesis is to develop a thorough exploration of the possibilities of machine learning in the field of bioinformatics, with a particular focus on specific problems such as transcription start and protein fitness prediction. Furthermore, given the similarities between bioinformatics sequence data and the natural language processing domain, the research emphasizes the use of sequence-based methods. Our research has resulted in several contributions to the field in the form of three scientific papers. The first two focus on transcription start prediction. In the first, we discovered that the integration of biophysical simulations in conjunction with the DNA sequence can improve the results of machine learning methods. Additionally, in our second paper we concluded that, while support vector machines have been a traditional choice for transcription start prediction, our research suggests that deep learning methods outperform them, marking a paradigm shift in the field. In addition, we presented custom-built datasets using Ensembl data, providing a valuable resource for future studies. The third paper addresses the issue of protein fitness prediction specifically in scarce dataset scenarios and concludes that deep transfer learning methods get established as the best alternative when compared with other strategies well suited for such situations, such as semi-supervised learning. YR 2023 FD 2023 LK http://hdl.handle.net/10259/9060 UL http://hdl.handle.net/10259/9060 LA eng NO his thesis has been funded through a pre-doctoral grant by the University of Burgos. The work included in this thesis has also been supported by the Junta de Castilla y León under project BU055P20 (JCyL/FEDER, UE), by the Ministry of Science and Innovation under project PID2019-109481GB-I00 and the Junta de Andalucia under project UCO1264182, in both cases co-financed through European Union FEDER funds and by Fundación Bancaria la Caixa under project 2020/00062/001. NVIDIA Corporation donated the TITAN Xp GPUs used in this research. DS Repositorio Institucional de la Universidad de Burgos RD 11-dic-2024