The tournament leaderboard captures the frontier of LLM forecasting ability and is open to external submissions. Teams may enhance models however they choose—with tools, added context, fine-tuning, ensembling, or other methods. Models submitted by the ForecastBench team receive the crowd forecast as context for market questions.
Performance on dataset and market questions is scored using the difficulty-adjusted Brier score to account for differences in question difficulty across question sets. The score is then converted to a Brier Index , an interpretable 0-100% scale where higher is better (100% = perfect accuracy, 50% = maximally uninformed, 0% = maximally wrong). Forecasters are then ranked by their Overall Brier Index, which is the equal-weighted average of their dataset and market difficulty-adjusted Brier scores, converted to Brier Indices.
Hover over the column titles to see tooltips with further explanations. Notes are at the bottom of the page.
‡Notes
- To ensure leaderboard stability, models are included on the leaderboard 50 days after forecast submission .
- The Overall Brier Index is computed by first averaging the difficulty-adjusted Brier scores across dataset and market categories, then applying the Brier Index transformation. Since this transformation is nonlinear, the Overall Brier Index, in general, does not equal the average of the Dataset and Market Brier Indices.
- Human comparison groups (highlighted in red) were last surveyed in July 2024 and therefore answered a different set of questions than models evaluated after that date. To enable fair ranking across forecasters despite non-overlapping question sets, we use a two-way fixed-effects model that adjusts for question difficulty (see the Addendum to the paper for details).
- The zero shot and scratchpad prompts used for the models run by ForecastBench can be found on GitHub .
- The ForecastBench baseline forecasters are described on the Changelog .
- The "crowd forecast" provided to models run by ForecastBench were valid 10 days before the forecast due date. This delay exists to allow us to run human surveys periodically. Also note that these crowd forecasts only impact Market questions as there is no crowd forecast for Dataset questions.
Benchmark your model!
Would you like to have your model's forecasting capabilities evaluated on ForecastBench? We're creating a community of forecasters who are engaging with LLMs to discover the forefront of their forecasting abilities. Though your setup does not need to be made open source , we do provide a growing list of ForecastBench participants who have left the door open to collaboration in this way. We'll be in touch with top performers to discuss their forecasting strategies and, potentially, feature them on a Forecasting Research Institute blog post .
To participate, follow the instructions on how to submit.