You are about to action:

Object being modified by the action

Do you want to proceed?

Share Video Link

https://www.bitchute.com/video/52kMBrAI_IM/

Click to copy, then share by pasting into your messages, comments, social media posts and websites.

Embed Video HTML

Click to copy, then add into your webpages so users can view and engage with this video from your site.

Share to Social Media

Share to social media by clicking on the quick share links.

Report Content

Reason

Please select the most appropriate reason from the list provided.

Note: For a more detailed description of each reason, see our Community Guidelines.

Additional Comments

Please add any additonal comments that will help with the assessment of your request.

Note: Copyright claims must contain all the items specified within the Copyright Policy.

Email Submissions

We also accept reports via email. Please see the Guidelines Enforcement Process for instructions on how to make a request via email.

Thank you for submitting your report

We will investigate and take the appropriate action.

Add to Playlist

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Next video playing soon

Click to cancel

Autoplay has been paused

Click to watch next video

First published at 01:38 UTC on May 2nd, 2024.

Yannic Kilcher

subscribers

BitChute Advertisement

Paper: https://arxiv.org/abs/2403.07691

Abstract:
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).

Authors: Jiwoo Hong, Noah Lee, James Thorne

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B..

LESS

Category	Science & Technology
Sensitivity	Normal - Content that is suitable for ages 16 and over

DISCUSS THIS VIDEO

The owner has disabled comments on this channel.

RANT

RAVE

Playing Next

117 37:00

TransformerFAM: Feedback attention is working memory

Yannic Kilcher

2 weeks, 5 days ago

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Playing Next

Related Videos

Warning - This video exceeds your sensitivity preference!