We've fine-tuned GPT-2 using human feedback for tasks such as summarizing articles, matching the preferences of human labelers (if not always our own). We're hoping this brings safety methods closer to machines learning values by talking with humans.
Yikes! If we're going to keep using human preferences & raters as a crucial part of training AI systems (which IMO is necessary, if we're gonna use AI, for it to go OK!), we need to design robust & humane processes for those workers!
We've applied reward learning to language! Lots of lessons learned along the way (data quality is hard). Also watch out for sign flips. :)
Links for my conversees this weekend: - Ads' 10% user penalty: - Anime NNs: - RL-based poetry generation: