Understanding Limitations of Present Reward Fashions
Though reward fashions play a vital function in Reinforcement Studying from Human Suggestions (RLHF), a lot of at this time’s top-performing open fashions nonetheless battle to replicate the total vary of advanced human preferences. Even with subtle coaching strategies, significant progress has been restricted. A significant purpose seems to be the shortcomings in present choice datasets, which are sometimes too slim, artificially generated, or poorly vetted. Whereas some rule-based programs are efficient for clear duties like math or coding, they often fail to seize nuanced human judgment. Furthermore, widespread benchmarks like RewardBench have gotten much less dependable indicators of real-world RM efficiency, displaying poor correlation with downstream activity success.
Challenges in Desire Knowledge Creation and New Approaches
Creating high-quality choice information has historically relied on human annotators, however this technique is time-consuming, expensive, and generally inconsistent. To deal with this, current strategies like RLAIF use LLMs to automate annotations, generally even outperforming people. Newer approaches purpose to mix the strengths of each by integrating LLM-generated information with human-verified labels. In the meantime, reward fashions have advanced from easy scoring programs, such because the Bradley-Terry mannequin, to extra advanced frameworks, together with generative and direct optimization strategies. Regardless of the provision of quite a few strong open fashions and datasets, challenges persist in precisely capturing nuanced human preferences throughout numerous duties and languages.
Introducing SynPref-40M: Giant-Scale Human-AI Desire Dataset
Researchers from 2050 Analysis, Skywork AI introduce SynPref-40M, a large dataset of 40 million choice pairs curated by a two-stage human-AI pipeline. Human annotators guarantee high quality by strict verification, whereas LLMs scale up information curation utilizing human steerage. From this, they develop Skywork-Reward-V2, a household of eight reward fashions (0.6B–8B parameters) educated on a high-quality subset of 26 M. These fashions obtain state-of-the-art outcomes throughout seven main benchmarks, excelling in alignment, security, objectivity, and robustness. The research highlights that success comes not simply from information quantity, however from cautious, iterative curation that blends human experience with AI scalability.
Scalable Two-Stage Human-AI Curation Pipeline
Present open reward fashions typically undergo from overfitting to slim benchmarks, comparable to RewardBench, which limits their real-world usefulness. To deal with this, the researchers introduce a two-stage, human-AI pipeline for curating large-scale choice information. Stage 1 begins with human-verified annotations to information LLMs in labeling numerous choice attributes, adopted by iterative coaching and error evaluation to refine the reward mannequin. Stage 2 scales this course of utilizing consistency checks between the very best and a human-trained “gold” reward mannequin, filtering dependable samples with out additional human enter. This method strikes a steadiness between high quality and scalability, finally enabling the creation of tens of tens of millions of high-quality choice pairs.
Benchmarking Skywork-Reward-V2: Compact But Highly effective Fashions
The Skywork-Reward-V2 sequence demonstrates robust efficiency throughout a number of benchmarks, outperforming each bigger fashions (e.g., 70B parameters) and rising generative reward fashions. Educated utilizing Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these fashions obtain excessive scores on RewardBench, PPE, RM-Bench, and JudgeBench, with the best-performing variant (Llama-3.1-8B-40M) surpassing all others with a median rating of 88.6. Regardless of smaller mannequin sizes, Skywork-Reward-V2 fashions profit from high-quality choice information (SynPref-40M) and environment friendly coaching setups, enabling them to generalize higher in real-world RLHF situations. Notably, even mid-sized fashions just like the Qwen3-1.7B outperform some 70B fashions, emphasizing the affect of coaching information high quality and methodology over sheer parameter depend.

Conclusion and Future Outlook: Scaling with Precision
In conclusion, SynPref-40M, a large-scale choice dataset constructed by a two-stage human-AI collaboration, combining human judgment with LLM-based scalability. Utilizing a curated subset of 26 million choice pairs, the crew developed the Skywork-Reward-V2, a set of eight reward fashions (0.6B–8B parameters) that outperform current fashions throughout seven key benchmarks. These fashions present robust generalization in aligning with human values, guaranteeing correctness, security, and robustness to bias. Intensive research affirm that each the information high quality and curation technique are key drivers of efficiency. Trying ahead, the researchers purpose to discover new coaching methods, as reward fashions change into central to LLM growth and alignment.
Try the Paper, Mannequin on Hugging Face and GitHub Web page. All credit score for this analysis goes to the researchers of this challenge. Additionally, be at liberty to observe us on Twitter, Youtube and Spotify and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication.

Sana Hassan, a consulting intern at Marktechpost and dual-degree scholar at IIT Madras, is keen about making use of know-how and AI to deal with real-world challenges. With a eager curiosity in fixing sensible issues, he brings a contemporary perspective to the intersection of AI and real-life options.