New secret math benchmark stumps AI models and PhDs alike

FrontierMath’s difficult questions remain unpublished so that AI companies can’t train against it.

On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI. The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete.

FrontierMath’s performance results, revealed in a preprint research paper, paint a stark picture of current AI model limitations. Even with access to Python environments for testing and verification, top models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro scored extremely poorly. This contrasts with their high performance on simpler math benchmarks—many models now score above 90 percent on tests like GSM8K and MATH.

The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination. Many existing AI models are trained on other test problem datasets, allowing the AI models to easily solve the problems and appear more generally capable than they actually are. Many experts cite this as evidence that current large language models (LLMs) are poor generalist learners.

Read full article

Comments

ars-rss

Recent Posts

Recent Comments

Challengers screenwriter Justin Kuritzkes on his trippy, scary adaptation of Queer

You Can Still Find Savings Rates Over 5%, but They Won’t Likely Last. Today’s Daily Savings Rates, Nov. 27, 2024

Qualcomm Reportedly Loses Interest In Intel Takeover

Categories

Archives

Recent Posts

Recent Comments

Challengers screenwriter Justin Kuritzkes on his trippy, scary adaptation of Queer

You Can Still Find Savings Rates Over 5%, but They Won’t Likely Last. Today’s Daily Savings Rates, Nov. 27, 2024

Qualcomm Reportedly Loses Interest In Intel Takeover

Categories

Archives

New secret math benchmark stumps AI models and PhDs alike

Leave a Reply Cancel reply

Archives

Categories