Baseline leaderboard^‡

The baseline leaderboard measures how models perform "out of the box," without extra tools, context, or scaffolding. Models are selected by the ForecastBench team for standardized evaluation, hence this leaderboard provides a consistent benchmark for tracking progress in LLM forecasting accuracy.

Performance on dataset and market questions is scored using the difficulty-adjusted Brier score to account for differences in question difficulty across question sets. Forecasters are ranked by their overall score, which is the equal-weighted average of the dataset and market scores.

Hover over the column titles to see tooltips with further explanations. Notes are at the bottom of the page.

‡Notes

To ensure leaderboard stability, models are included on the leaderboard 50 days after forecast submission .
Human comparison groups are highlighted in red.
The zero shot and scratchpad prompts used for the models run by ForecastBench can be found on GitHub .
The ForecastBench baseline forecasters are described on the Changelog .

Baseline leaderboard‡

‡Notes

Baseline leaderboard^‡