Artículos BEST-AI

Artículos BEST-AI https://hdl.handle.net/10259/5378 Thu, 18 Jun 2026 09:58:28 GMT 2026-06-18T09:58:28Z Semi-supervised prediction of protein fitness for data-driven protein engineering https://hdl.handle.net/10259/11430 Semi-supervised prediction of protein fitness for data-driven protein engineering Olivares Gil, Alicia; Barbero Aparicio, José Antonio; Rodríguez Diez, Juan José; Diez Pastor, José Francisco; García Osorio, César; Davari, Mehdi D. Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics. Mon, 01 Dec 2025 00:00:00 GMT https://hdl.handle.net/10259/11430 2025-12-01T00:00:00Z Deep learning and support vector machines for transcription start site identification https://hdl.handle.net/10259/11429 Deep learning and support vector machines for transcription start site identification Barbero Aparicio, José Antonio; Olivares Gil, Alicia; Diez Pastor, José Francisco; García Osorio, César Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments. Sat, 01 Apr 2023 00:00:00 GMT https://hdl.handle.net/10259/11429 2023-04-01T00:00:00Z Label prediction on issue tracking systems using text mining https://hdl.handle.net/10259/11283 Label prediction on issue tracking systems using text mining Alonso-Abad, Jesús M.; López Nozal, Carlos; Maudes Raedo, Jesús M.; Marticorena Sánchez, Raúl Issue tracking systems are overall change-management tools in software development. The issue-solving life cycle is a complex socio-technical activity that requires team discussion and knowledge sharing between members. In that process, issue classification facilitates an understanding of issues and their analysis. Issue tracking systems permit the tagging of issues with default labels (e.g., bug, enhancement) or with customized team labels (e.g., test failures, performance). However, a current problem is that many issues in open-source projects remain unlabeled. The aim of this paper is to improve maintenance tasks in development teams, evaluating models that can suggest a label for an issue using its text comments. We analyze data on issues from several GitHub trending projects, first by extracting issue information and then by applying text mining classifiers (i.e., support vector machine and naive Bayes multinomial). The results suggest that very suitable classifiers may be obtained to label the issues or, at least, to suggest the most suitable candidate labels. Sun, 01 Sep 2019 00:00:00 GMT https://hdl.handle.net/10259/11283 2019-09-01T00:00:00Z An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data https://hdl.handle.net/10259/11282 An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data Ramos Pérez, Ismael; Barbero Aparicio, José Antonio; Canepa Oneto, Antonio Jesús; Arnaiz González, Álvar; Maudes Raedo, Jesús M. The most common preprocessing techniques used to deal with datasets having high dimensionality and a low number of instances—or wide data—are feature reduction (FR), feature selection (FS), and resampling. This study explores the use of FR and resampling techniques, expanding the limited comparisons between FR and filter FS methods in the existing literature, especially in the context of wide data. We compare the optimal outcomes from a previous comprehensive study of FS against new experiments conducted using FR methods. Two specific challenges associated with the use of FR are outlined in detail: finding FR methods that are compatible with wide data and the need for a reduction estimator of nonlinear approaches to process out-of-sample data. The experimental study compares 17 techniques, including supervised, unsupervised, linear, and nonlinear approaches, using 7 resampling strategies and 5 classifiers. The results demonstrate which configurations are optimal, according to their performance and computation time. Moreover, the best configuration—namely, k Nearest Neighbor (KNN) + the Maximal Margin Criterion (MMC) feature reducer with no resampling—is shown to outperform state-of-the-art algorithms. Mon, 01 Apr 2024 00:00:00 GMT https://hdl.handle.net/10259/11282 2024-04-01T00:00:00Z