DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Feng, Xuening; Jiang, Zhaohui; Kaufmann, Timo ORCID: https://orcid.org/0000-0001-5193-8574; Xu, Puchen; Hüllermeier, Eyke ORCID: https://orcid.org/0000-0002-9944-4108; Weng, Paul und Zhu, Yifei (April 2025): DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback. The 39th Annual AAAI Conference on Artificial Intelligence, Philadelphia, Pennsylvania, USA, 25. February - 4. March 2025. Proceedings of the AAAI Conference on Artificial Intelligence. Bd. 39, Nr. 16 S. 16604-16612

Volltext auf 'Open Access LMU' nicht verfügbar.

DOI: http://dx.doi.org/10.1609/aaai.v39i16.33824

Externer Volltext: https://ojs.aaai.org/index.php/AAAI/article/view/33824/35979

Abstract

Defining a reward function is usually a challenging but critical task for the system designer in reinforcement learning, especially when specifying complex behaviors. Reinforcement learning from human feedback (RLHF) emerges as a promising approach to circumvent this. In RLHF, the agent typically learns a reward function by querying a human teacher using pairwise comparisons of trajectory segments. A key question in this domain is how to reduce the number of queries necessary to learn an informative reward function since asking a human teacher too many queries is impractical and costly. To tackle this question, we propose DUO, a novel method for diverse, uncertain, on-policy query generation and selection in RLHF. Our method produces queries that are (1) more relevant for policy training (via an on-policy criterion), (2) more informative (via a principled measure of epistemic uncertainty), and (3) diverse (via a clustering-based filter). Experimental results on a variety of locomotion and robotic manipulation tasks demonstrate that our method can outperform state-of-the-art RLHF methods given the same total budget of queries, while being robust to possibly irrational teachers.

Dokumententyp:	Konferenzbeitrag (Paper)
Fakultät:	Mathematik, Informatik und Statistik > Informatik > Künstliche Intelligenz und Maschinelles Lernen
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik
Sprache:	Englisch
Dokumenten ID:	126292
Datum der Veröffentlichung auf Open Access LMU:	23. Mai 2025 15:02
Letzte Änderungen:	23. Mai 2025 15:02

Dokument bearbeiten