Addressing data scarcity in protein fitness landscape analysis: A study on semi-supervised and deep transfer learning techniques

Barbero Aparicio, José Antonio; Olivares Gil, Alicia; Rodríguez Diez, Juan José; García Osorio, César; Diez Pastor, José Francisco

doi:10.1016/j.inffus.2023.102035

dc.contributor.author	Barbero Aparicio, José Antonio
dc.contributor.author	Olivares Gil, Alicia
dc.contributor.author	Rodríguez Diez, Juan José
dc.contributor.author	García Osorio, César
dc.contributor.author	Diez Pastor, José Francisco
dc.date.accessioned	2024-06-17T10:39:50Z
dc.date.available	2024-06-17T10:39:50Z
dc.date.issued	2024-02
dc.identifier.issn	1566-2535
dc.identifier.uri	http://hdl.handle.net/10259/9281
dc.description.abstract	This paper presents a comprehensive analysis of deep transfer learning methods, supervised methods, and semi-supervised methods in the context of protein fitness prediction, with a focus on small datasets. The analysis includes the exploration of the combination of different data sources to enhance the performance of the models. While deep learning and deep transfer learning methods have shown remarkable performance in situations with abundant data, this study aims to address the more realistic scenario faced by wet lab researchers, where labeled data is often limited. The novelty of this work lies in its examination of deep transfer learning in the context of small datasets and its consideration of semi-supervised methods and multi-view strategies. While previous research has extensively explored deep transfer learning in large dataset scenarios, little attention has been given to its efficacy in small dataset settings or its comparison with semi-supervised approaches. Our findings suggest that deep transfer learning, exemplified by ProteinBERT, shows promising performance in this context compared to the rest of the methods across various evaluation metrics, not only in small dataset contexts but also in large dataset scenarios. This highlights the robustness and versatility of deep transfer learning in protein fitness prediction tasks, even with limited labeled data. The results of this study shed light on the potential of deep transfer learning as a state-of-the-art approach in the field of protein fitness prediction. By leveraging pre-trained models and fine-tuning them on small datasets, researchers can achieve competitive performance surpassing traditional supervised and semi-supervised methods. These findings provide valuable insights for wet lab researchers who face the challenge of limited labeled data, enabling them to make informed decisions when selecting the most effective methodology for their specific protein fitness prediction tasks. Additionally, the study investigated the combination of two different sources of information (encodings) through our enhanced semi-supervised methods, yielding noteworthy results improving their base model and providing valuable insights for further research. The presented analysis contributes to a better understanding of the capabilities and limitations of different learning approaches in small dataset scenarios, ultimately aiding in the development of improved protein fitness prediction methods.	en
dc.description.sponsorship	This work is supported by the Junta de Castilla Leon, Spain under project BU055P20 (JCyL/FEDER, UE), and the Ministry of Science and Innovation, Spain under project PID2020- 119894 GB-I00 co-financed through European Union FEDER funds. José A. Barbero-Aparicio is funded through a pre-doctoral grant by the University of Burgos and Alicia Olivares-Gil is funded by the predoctoral grant from the Department of Education of Junta de Castilla y León (VA) (ORDEN EDU/875/2021) (Spain).	en
dc.format.mimetype	application/pdf
dc.language.iso	eng	es
dc.publisher	Elsevier	en
dc.relation.ispartof	Information Fusion. 2024, V. 102, 102035	en
dc.rights	Atribución-NoComercial 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	*
dc.subject	Bioinformatics	en
dc.subject	Machine learning	en
dc.subject	Transfer learning	en
dc.subject	Semi-supervised learning	en
dc.subject	Protein fitness prediction	en
dc.subject	Small datasets	en
dc.subject.other	Informática	es
dc.subject.other	Computer science	en
dc.subject.other	Bioinformática	es
dc.subject.other	Bioinformatics	en
dc.title	Addressing data scarcity in protein fitness landscape analysis: A study on semi-supervised and deep transfer learning techniques	en
dc.type	info:eu-repo/semantics/article	es
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es
dc.relation.publisherversion	https://doi.org/10.1016/j.inffus.2023.102035	es
dc.identifier.doi	10.1016/j.inffus.2023.102035
dc.journal.title	Information Fusion	en
dc.volume.number	102	es
dc.type.hasVersion	info:eu-repo/semantics/publishedVersion	es

Fichier(s) constituant ce document

Nom:: Barbero-if_2024.pdf
Taille:: 727.8Ko
Format:: Adobe PDF

Voir/Ouvrir

Ce document figure dans la(les) collection(s) suivante(s)

Afficher la notice abrégée