Will any model that performs better than the equivalent size (± 10% in parameter count) LLaMa 3.1 model be officially released by Meta, where "performs better" means "at least 0.5% more accurate at MMLU"? Base model only.
For example, LLaMa 3.1 70B's MMLU score is 83.6% (an improvement over LLaMa 3.0 70B's 79.5% MMLU). LLaMa 3.2 70B would need to perform at 84.1% MMLU to resolve this market YES. Note that any model in the family (8B, 70B, 405B) performing at least 0.5% better is enough to resolve this market.
Multimodal models eligible but only text MMLU performance will be evaluated. Models that were fine-tuned, DPO'd, RLHF'd, or CPT'd on synthetic data will not resolve this market.
For reference, LLaMa 3.0 70B's MMLU score was 79.5, GPT-4o's score is 88.7, and LLaMa 3.1 405B base's score is 85.2. (LLaMa 3.1 405B Instruct score's is 88.7)
Please add liquidity to this market! This is an important question that I care about. I've already added M3000 myself