Leaderboard: overall

  • Ranking: The position of the model in the leaderboard as ordered by Overall Score
  • Organization: The group responsible for the model or forecasts
  • Model: The LLM model & prompt info or the human group and forecast aggregation method
    • zero shot: used a zero-shot prompt
    • scratchpad: used a scratchpad prompt with instructions that outline a procedure the model should use to reason about the question
    • with freeze values: means that, for questions from market sources, the prompt was supplemented with the aggregate human forecast from the relevant platform on the day the question set was generated
    • with news: means that the prompt was supplemented with relevant news summaries obtained through an automated process
  • Dataset Score: The average Brier score across all questions sourced from datasets
  • Market Score (resolved): The average Brier score across all resolved questions sourced from prediction markets and forecast aggregation platforms
  • Market Score (unresolved): The average Brier score across all unresolved questions sourced from prediction markets and forecast aggregation platforms
  • Market Score (overall): The average Brier score across all questions sourced from prediction markets and forecast aggregation platforms
  • Overall Resolved Score: The average of the Dataset Score and the Market Score (resolved) columns
  • Overall Score: The average of the Dataset Score and the Market Score (overall) columns
  • Overall Score 95% CI: The 95% confidence interval for the Overall Score
  • Pairwise p-value comparing to No. 1 (bootstrapped): The p-value calculated by bootstrapping the differences in overall score between each model and the best forecaster (the group with rank 1) under the null hypothesis that there's no difference.
  • Pct. more accurate than No. 1: The percent of questions where this forecaster had a better overall score than the best forecaster (with rank 1)
  • Pct. imputed: The percent of questions for which this forecaster did not provide a forecast and hence had a forecast value imputed (0.5 for dataset questions and the aggregate human forecast on the forecast due date for questions sourced from prediction markets or forecast aggregation platforms)
Ranking Organization Model Dataset Score (N=2,737) Market Score (resolved) (N=85) Market Score (unresolved) (N=827) Market Score (overall) (N=912) Overall Resolved Score (N=2,822) Overall Score (N=3,649) Overall Score 95% CI Pairwise p-value comparing to No. 1 (bootstrapped) Pct. more accurate than No. 1 Pct. Imputed
1 OpenAI GPT-4-Turbo-2024-04-09 (scratchpad with freeze values) 0.175 0.109 0.045 0.051 0.142 0.113 [0.107, 0.118] 0% 0%
2 Anthropic Claude-3-5-Sonnet-20240620 (scratchpad with freeze values) 0.179 0.110 0.041 0.047 0.145 0.113 [0.108, 0.119] 0.442 53% 0%
3 OpenAI GPT-4o (scratchpad with freeze values) 0.201 0.095 0.033 0.039 0.148 0.120 [0.115, 0.125] <0.001 45% 0%
4 Google Gemini-1.5-Pro (scratchpad with freeze values) 0.168 0.135 0.069 0.075 0.152 0.121 [0.116, 0.127] <0.001 42% 0%
5 OpenAI GPT-4o (scratchpad with news with freeze values) 0.196 0.105 0.044 0.050 0.150 0.123 [0.117, 0.128] <0.001 41% 0%
6 Anthropic Claude-3-Opus-20240229 (zero shot with freeze values) 0.192 0.094 0.058 0.061 0.143 0.127 [0.12, 0.133] <0.001 44% 0%
7 Google Gemini-1.5-Pro (scratchpad with news with freeze values) 0.172 0.143 0.076 0.082 0.157 0.127 [0.121, 0.132] <0.001 41% 0%
8 OpenAI GPT-4-Turbo-2024-04-09 (zero shot with freeze values) 0.199 0.106 0.050 0.055 0.153 0.127 [0.12, 0.133] <0.001 47% 0%
9 Qwen Qwen1.5-110B-Chat (scratchpad with freeze values) 0.176 0.126 0.073 0.078 0.151 0.127 [0.122, 0.133] <0.001 32% 0%
10 OpenAI GPT-4-Turbo-2024-04-09 (scratchpad) 0.175 0.162 0.074 0.083 0.169 0.129 [0.124, 0.134] <0.001 7% 0%
11 Anthropic Claude-3-5-Sonnet-20240620 (scratchpad with news with freeze values) 0.197 0.118 0.062 0.067 0.158 0.132 [0.127, 0.138] <0.001 43% 0%
12 Google Gemini-1.5-Pro (scratchpad) 0.168 0.167 0.090 0.097 0.168 0.133 [0.127, 0.138] <0.001 39% 0%
13 OpenAI GPT-4 (scratchpad with freeze values) 0.198 0.135 0.061 0.067 0.166 0.133 [0.127, 0.138] <0.001 33% 0%
14 Anthropic Claude-3-5-Sonnet-20240620 (scratchpad) 0.179 0.177 0.077 0.087 0.178 0.133 [0.127, 0.139] <0.001 46% 0%
15 Meta Llama-3-70b-Chat-Hf (zero shot with freeze values) 0.199 0.116 0.064 0.069 0.157 0.134 [0.128, 0.14] <0.001 38% 0%
16 Google Gemini-1.5-Pro (scratchpad with news) 0.172 0.167 0.089 0.097 0.169 0.134 [0.128, 0.14] <0.001 39% 0%
17 Google Gemini-1.5-Pro (zero shot with freeze values) 0.219 0.084 0.048 0.052 0.151 0.135 [0.129, 0.141] <0.001 44% 17%
18 Anthropic Claude-3-5-Sonnet-20240620 (zero shot with freeze values) 0.208 0.130 0.056 0.063 0.169 0.136 [0.129, 0.142] <0.001 51% 0%
19 ForecastBench Imputed Forecaster 0.250 0.043 0.019 0.021 0.146 0.136 [0.133, 0.138] <0.001 43% 100%
20 ForecastBench LLM Crowd (gpt-4o, claude-3.5-sonnet, gemini-1.5-pro) with news 0.234 0.087 0.034 0.039 0.160 0.137 [0.134, 0.14] <0.001 37% 71%
21 OpenAI GPT-4-Turbo-2024-04-09 (scratchpad with news with freeze values) 0.213 0.103 0.056 0.061 0.158 0.137 [0.132, 0.142] <0.001 29% 0%
22 Google Gemini-1.5-Flash (scratchpad with freeze values) 0.187 0.138 0.083 0.088 0.163 0.138 [0.131, 0.144] <0.001 40% 0%
23 OpenAI GPT-4o (scratchpad with news) 0.196 0.133 0.077 0.082 0.164 0.139 [0.133, 0.145] <0.001 34% 0%
24 Qwen Qwen1.5-110B-Chat (scratchpad) 0.176 0.149 0.097 0.102 0.162 0.139 [0.134, 0.144] <0.001 30% 0%
25 ForecastBench LLM Crowd (gpt-4o, claude-3.5-sonnet, gemini-1.5-pro) with news 0.236 0.088 0.037 0.042 0.162 0.139 [0.136, 0.142] <0.001 36% 71%
26 ForecastBench LLM Crowd (gpt-4o, claude-3.5-sonnet, gemini-1.5-pro) with news 0.237 0.086 0.037 0.042 0.162 0.139 [0.136, 0.142] <0.001 36% 71%
27 OpenAI GPT-4 (zero shot with freeze values) 0.222 0.100 0.053 0.057 0.161 0.140 [0.134, 0.145] <0.001 38% 0%
28 Mistral AI Mistral-Large-Latest (scratchpad with freeze values) 0.197 0.120 0.078 0.082 0.159 0.140 [0.135, 0.145] <0.001 29% 0%
29 OpenAI GPT-4o (scratchpad) 0.201 0.196 0.069 0.081 0.199 0.141 [0.136, 0.147] <0.001 37% 0%
30 Meta Llama-3-70b-Chat-Hf (scratchpad with freeze values) 0.208 0.130 0.070 0.075 0.169 0.142 [0.137, 0.147] <0.001 29% 0%
31 OpenAI GPT-4-Turbo-2024-04-09 (zero shot) 0.199 0.149 0.079 0.086 0.174 0.142 [0.136, 0.148] <0.001 38% 0%
32 Google Gemini-1.5-Pro (zero shot) 0.219 0.133 0.063 0.069 0.176 0.144 [0.138, 0.15] <0.001 42% 17%
33 Anthropic Claude-3-5-Sonnet-20240620 (scratchpad with news) 0.197 0.138 0.087 0.091 0.168 0.144 [0.139, 0.15] <0.001 40% 0%
34 OpenAI GPT-4 (scratchpad) 0.198 0.136 0.087 0.091 0.167 0.145 [0.14, 0.149] <0.001 29% 0%
35 Anthropic Claude-3-Opus-20240229 (scratchpad with freeze values) 0.210 0.125 0.075 0.080 0.168 0.145 [0.139, 0.151] <0.001 36% 0%
36 Mistral AI Mistral-Large-Latest (zero shot with freeze values) 0.201 0.148 0.085 0.091 0.175 0.146 [0.139, 0.153] <0.001 35% 0%
37 Google Gemini-1.5-Pro (superforecaster with news 3) 0.193 0.172 0.092 0.099 0.182 0.146 [0.14, 0.152] <0.001 34% 0%
38 Google Gemini-1.5-Flash (scratchpad) 0.187 0.163 0.100 0.106 0.175 0.147 [0.141, 0.153] <0.001 37% 0%
39 Google Gemini-1.5-Flash (zero shot with freeze values) 0.225 0.125 0.063 0.068 0.175 0.147 [0.14, 0.154] <0.001 45% 0%
40 Meta Llama-3-70b-Chat-Hf (zero shot) 0.199 0.141 0.090 0.095 0.170 0.147 [0.141, 0.152] <0.001 34% 0%
41 Anthropic Claude-3-5-Sonnet-20240620 (superforecaster with news 3) 0.199 0.142 0.091 0.096 0.171 0.148 [0.142, 0.153] <0.001 38% 1%
42 Anthropic Claude-3-Opus-20240229 (zero shot) 0.192 0.158 0.099 0.104 0.175 0.148 [0.141, 0.155] <0.001 37% 0%
43 Qwen Qwen1.5-110B-Chat (scratchpad with news with freeze values) 0.207 0.118 0.087 0.090 0.163 0.149 [0.143, 0.154] <0.001 27% 0%
44 OpenAI GPT-4-Turbo-2024-04-09 (scratchpad with news) 0.213 0.125 0.080 0.085 0.169 0.149 [0.144, 0.154] <0.001 24% 0%
45 OpenAI GPT-4-Turbo-2024-04-09 (superforecaster with news 3) 0.212 0.127 0.082 0.086 0.169 0.149 [0.144, 0.154] <0.001 27% 5%
46 Mistral AI Mixtral-8x22B-Instruct-V0.1 (scratchpad with freeze values) 0.214 0.141 0.086 0.091 0.178 0.152 [0.146, 0.158] <0.001 28% 0%
47 Anthropic Claude-2.1 (scratchpad) 0.232 0.109 0.069 0.072 0.171 0.152 [0.148, 0.157] <0.001 36% 22%
48 OpenAI GPT-4o (superforecaster with news 3) 0.212 0.125 0.089 0.093 0.169 0.152 [0.147, 0.158] <0.001 28% 3%
49 Anthropic Claude-3-5-Sonnet-20240620 (zero shot) 0.208 0.177 0.092 0.100 0.192 0.154 [0.147, 0.161] <0.001 43% 1%
50 Anthropic Claude-3-5-Sonnet-20240620 (superforecaster with news 1) 0.213 0.138 0.092 0.096 0.175 0.154 [0.149, 0.16] <0.001 34% 11%
51 OpenAI GPT-4-Turbo-2024-04-09 (superforecaster with news 1) 0.216 0.196 0.084 0.095 0.206 0.155 [0.149, 0.161] <0.001 30% 0%
52 OpenAI GPT-4o (superforecaster with news 1) 0.217 0.136 0.089 0.094 0.176 0.155 [0.149, 0.162] <0.001 37% 0%
53 Anthropic Claude-2.1 (scratchpad with freeze values) 0.232 0.148 0.075 0.081 0.190 0.157 [0.151, 0.163] <0.001 34% 17%
54 Mistral AI Mistral-Large-Latest (scratchpad) 0.197 0.184 0.111 0.118 0.191 0.158 [0.152, 0.163] <0.001 28% 0%
55 Qwen Qwen1.5-110B-Chat (zero shot with freeze values) 0.229 0.129 0.082 0.086 0.179 0.158 [0.151, 0.164] <0.001 33% 0%
56 OpenAI GPT-4o (zero shot with freeze values) 0.233 0.117 0.081 0.084 0.175 0.159 [0.151, 0.166] <0.001 41% 0%
57 Google Gemini-1.5-Flash (scratchpad with news with freeze values) 0.218 0.150 0.094 0.099 0.184 0.159 [0.152, 0.165] <0.001 32% 0%
58 Anthropic Claude-3-Opus-20240229 (scratchpad) 0.210 0.153 0.103 0.107 0.182 0.159 [0.153, 0.165] <0.001 34% 0%
59 Qwen Qwen1.5-110B-Chat (scratchpad with news) 0.207 0.148 0.108 0.112 0.177 0.160 [0.154, 0.165] <0.001 27% 0%
60 Meta Llama-3-8b-Chat-Hf (zero shot with freeze values) 0.220 0.163 0.093 0.100 0.191 0.160 [0.153, 0.167] <0.001 43% 0%
61 OpenAI GPT-4o (zero shot) 0.233 0.183 0.079 0.089 0.208 0.161 [0.154, 0.167] <0.001 35% 2%
62 Meta Llama-3-70b-Chat-Hf (scratchpad) 0.208 0.182 0.107 0.114 0.195 0.161 [0.156, 0.166] <0.001 26% 0%
63 Qwen Qwen1.5-110B-Chat (superforecaster with news 1) 0.210 0.231 0.106 0.118 0.221 0.164 [0.158, 0.17] <0.001 32% 11%
64 Mistral AI Mixtral-8x22B-Instruct-V0.1 (scratchpad) 0.214 0.195 0.107 0.115 0.204 0.164 [0.159, 0.17] <0.001 26% 0%
65 Qwen Qwen1.5-110B-Chat (superforecaster with news 3) 0.215 0.164 0.109 0.114 0.190 0.165 [0.16, 0.17] <0.001 26% 2%
66 Mistral AI Mixtral-8x22B-Instruct-V0.1 (zero shot with freeze values) 0.215 0.174 0.110 0.116 0.194 0.165 [0.157, 0.173] <0.001 34% 0%
67 Anthropic Claude-2.1 (scratchpad with news) 0.241 0.179 0.081 0.090 0.210 0.165 [0.16, 0.171] <0.001 33% 17%
68 Mistral AI Mixtral-8x22B-Instruct-V0.1 (zero shot) 0.215 0.153 0.113 0.117 0.184 0.166 [0.159, 0.173] <0.001 31% 0%
69 Mistral AI Mistral-Large-Latest (zero shot) 0.201 0.150 0.131 0.133 0.175 0.167 [0.16, 0.174] <0.001 30% 0%
70 OpenAI GPT-4 (zero shot) 0.222 0.146 0.110 0.113 0.184 0.168 [0.162, 0.174] <0.001 29% 0%
71 Google Gemini-1.5-Flash (superforecaster with news 2) 0.236 0.131 0.097 0.100 0.184 0.168 [0.162, 0.174] <0.001 32% 11%
72 Anthropic Claude-2.1 (scratchpad with news with freeze values) 0.241 0.158 0.089 0.095 0.200 0.168 [0.162, 0.174] <0.001 32% 14%
73 Google Gemini-1.5-Flash (scratchpad with news) 0.218 0.176 0.115 0.121 0.197 0.169 [0.163, 0.176] <0.001 30% 0%
74 Anthropic Claude-3-Opus-20240229 (scratchpad with news with freeze values) 0.231 0.142 0.105 0.108 0.187 0.170 [0.164, 0.176] <0.001 30% 0%
75 Anthropic Claude-3-Opus-20240229 (superforecaster with news 1) 0.218 0.180 0.115 0.121 0.199 0.170 [0.164, 0.176] <0.001 31% 6%
76 Google Gemini-1.5-Pro (superforecaster with news 1) 0.225 0.191 0.106 0.114 0.208 0.170 [0.163, 0.176] <0.001 32% 3%
77 Qwen Qwen1.5-110B-Chat (zero shot) 0.229 0.158 0.113 0.117 0.193 0.173 [0.167, 0.179] <0.001 28% 0%
78 Mistral AI Mixtral-8x22B-Instruct-V0.1 (superforecaster with news 1) 0.230 0.156 0.112 0.116 0.193 0.173 [0.167, 0.179] <0.001 30% 13%
79 Meta Llama-3-8b-Chat-Hf (zero shot) 0.220 0.227 0.119 0.129 0.223 0.175 [0.167, 0.182] <0.001 41% 0%
80 Mistral AI Mixtral-8x22B-Instruct-V0.1 (scratchpad with news with freeze values) 0.233 0.179 0.111 0.117 0.206 0.175 [0.169, 0.181] <0.001 24% 0%
81 Anthropic Claude-3-Opus-20240229 (superforecaster with news 3) 0.222 0.165 0.126 0.130 0.193 0.176 [0.17, 0.182] <0.001 30% 3%
82 Mistral AI Mixtral-8x22B-Instruct-V0.1 (superforecaster with news 3) 0.233 0.176 0.115 0.121 0.204 0.177 [0.172, 0.182] <0.001 24% 7%
83 Anthropic Claude-3-5-Sonnet-20240620 (superforecaster with news 2) 0.225 0.176 0.125 0.130 0.200 0.177 [0.17, 0.184] <0.001 35% 0%
84 Anthropic Claude-3-Opus-20240229 (superforecaster with news 2) 0.229 0.162 0.123 0.127 0.196 0.178 [0.171, 0.185] <0.001 30% 0%
85 Anthropic Claude-3-Opus-20240229 (scratchpad with news) 0.231 0.169 0.120 0.125 0.200 0.178 [0.172, 0.184] <0.001 30% 0%
86 Mistral AI Mixtral-8x22B-Instruct-V0.1 (scratchpad with news) 0.233 0.206 0.116 0.125 0.219 0.179 [0.173, 0.184] <0.001 24% 0%
87 Google Gemini-1.5-Flash (superforecaster with news 3) 0.228 0.203 0.123 0.130 0.215 0.179 [0.173, 0.185] <0.001 27% 6%
88 Anthropic Claude-2.1 (superforecaster with news 2) 0.267 0.164 0.086 0.093 0.215 0.180 [0.173, 0.187] <0.001 38% 22%
89 Mistral AI Mixtral-8x7B-Instruct-V0.1 (scratchpad with freeze values) 0.250 0.143 0.108 0.111 0.197 0.181 [0.174, 0.188] <0.001 30% 10%
90 Mistral AI Mixtral-8x7B-Instruct-V0.1 (zero shot with freeze values) 0.263 0.143 0.095 0.099 0.203 0.181 [0.173, 0.19] <0.001 39% 0%
91 Google Gemini-1.5-Flash (zero shot) 0.225 0.201 0.132 0.138 0.213 0.182 [0.174, 0.189] <0.001 36% 0%
92 Google Gemini-1.5-Pro (superforecaster with news 2) 0.243 0.206 0.112 0.121 0.224 0.182 [0.174, 0.189] <0.001 32% 0%
93 OpenAI GPT-4o (superforecaster with news 2) 0.256 0.161 0.102 0.108 0.208 0.182 [0.175, 0.189] <0.001 31% 2%
94 Mistral AI Mixtral-8x7B-Instruct-V0.1 (scratchpad) 0.250 0.110 0.115 0.115 0.180 0.183 [0.177, 0.189] <0.001 30% 11%
95 Anthropic Claude-2.1 (zero shot with freeze values) 0.243 0.187 0.116 0.123 0.215 0.183 [0.176, 0.191] <0.001 30% 0%
96 Qwen Qwen1.5-110B-Chat (superforecaster with news 2) 0.239 0.193 0.128 0.134 0.216 0.186 [0.18, 0.192] <0.001 26% 2%
97 Mistral AI Mixtral-8x7B-Instruct-V0.1 (superforecaster with news 3) 0.272 0.164 0.096 0.102 0.218 0.187 [0.181, 0.193] <0.001 31% 18%
98 Anthropic Claude-2.1 (superforecaster with news 1) 0.268 0.194 0.102 0.110 0.231 0.189 [0.182, 0.196] <0.001 32% 21%
99 Mistral AI Mixtral-8x22B-Instruct-V0.1 (superforecaster with news 2) 0.248 0.176 0.126 0.131 0.212 0.189 [0.184, 0.195] <0.001 23% 1%
100 OpenAI GPT-4-Turbo-2024-04-09 (superforecaster with news 2) 0.252 0.160 0.123 0.127 0.206 0.190 [0.182, 0.197] <0.001 27% 5%
101 Mistral AI Mistral-Large-Latest (scratchpad with news with freeze values) 0.252 0.198 0.126 0.132 0.225 0.192 [0.186, 0.198] <0.001 24% 0%
102 Mistral AI Mistral-Large-Latest (superforecaster with news 2) 0.233 0.209 0.146 0.152 0.221 0.192 [0.185, 0.2] <0.001 29% 6%
103 Mistral AI Mistral-Large-Latest (superforecaster with news 1) 0.249 0.199 0.131 0.138 0.224 0.193 [0.186, 0.2] <0.001 28% 17%
104 Mistral AI Mistral-Large-Latest (scratchpad with news) 0.252 0.187 0.137 0.142 0.219 0.197 [0.191, 0.203] <0.001 24% 0%
105 Anthropic Claude-2.1 (zero shot) 0.243 0.239 0.147 0.155 0.241 0.199 [0.193, 0.205] <0.001 24% 0%
106 Mistral AI Mixtral-8x7B-Instruct-V0.1 (scratchpad with news) 0.287 0.168 0.111 0.117 0.227 0.202 [0.195, 0.208] <0.001 30% 15%
107 Mistral AI Mixtral-8x7B-Instruct-V0.1 (zero shot) 0.263 0.168 0.138 0.141 0.215 0.202 [0.193, 0.211] <0.001 34% 0%
108 Anthropic Claude-2.1 (superforecaster with news 3) 0.263 0.238 0.131 0.141 0.250 0.202 [0.195, 0.209] <0.001 26% 9%
109 Mistral AI Mixtral-8x7B-Instruct-V0.1 (scratchpad with news with freeze values) 0.287 0.175 0.112 0.118 0.231 0.202 [0.196, 0.209] <0.001 29% 15%
110 Mistral AI Mixtral-8x7B-Instruct-V0.1 (superforecaster with news 1) 0.302 0.163 0.099 0.105 0.232 0.203 [0.197, 0.21] <0.001 33% 18%
111 Meta Llama-2-70b-Chat-Hf (scratchpad with freeze values) 0.262 0.214 0.139 0.146 0.238 0.204 [0.198, 0.209] <0.001 22% 0%
112 Meta Llama-3-8b-Chat-Hf (scratchpad with freeze values) 0.270 0.198 0.132 0.138 0.234 0.204 [0.198, 0.21] <0.001 21% 0%
113 Google Gemini-1.5-Flash (superforecaster with news 1) 0.247 0.251 0.153 0.162 0.249 0.205 [0.197, 0.212] <0.001 26% 12%
114 ForecastBench Always 0.5 0.250 0.250 0.151 0.160 0.250 0.205 [0.202, 0.208] <0.001 22% 0%
115 Meta Llama-2-70b-Chat-Hf (scratchpad) 0.262 0.226 0.143 0.150 0.244 0.206 [0.201, 0.211] <0.001 22% 0%
116 Anthropic Claude-3-Haiku-20240307 (zero shot with freeze values) 0.281 0.152 0.131 0.133 0.216 0.207 [0.2, 0.213] <0.001 21% 0%
117 Mistral AI Mistral-Large-Latest (superforecaster with news 3) 0.273 0.213 0.137 0.144 0.243 0.209 [0.203, 0.215] <0.001 21% 2%
118 Anthropic Claude-3-Haiku-20240307 (scratchpad with freeze values) 0.276 0.187 0.145 0.149 0.231 0.212 [0.206, 0.218] <0.001 20% 0%
119 Mistral AI Mixtral-8x7B-Instruct-V0.1 (superforecaster with news 2) 0.294 0.213 0.124 0.132 0.254 0.213 [0.205, 0.22] <0.001 34% 18%
120 Anthropic Claude-3-Haiku-20240307 (superforecaster with news 2) 0.265 0.206 0.166 0.170 0.236 0.218 [0.211, 0.224] <0.001 21% 4%
121 Meta Llama-3-8b-Chat-Hf (scratchpad) 0.270 0.227 0.161 0.167 0.248 0.218 [0.212, 0.224] <0.001 20% 0%
122 Anthropic Claude-3-Haiku-20240307 (zero shot) 0.281 0.204 0.151 0.156 0.243 0.219 [0.213, 0.224] <0.001 20% 0%
123 ForecastBench Always 0 0.271 0.282 0.163 0.174 0.277 0.223 [0.21, 0.235] <0.001 58% 0%
124 Anthropic Claude-3-Haiku-20240307 (scratchpad) 0.276 0.276 0.162 0.173 0.276 0.224 [0.219, 0.229] <0.001 20% 0%
125 Anthropic Claude-3-Haiku-20240307 (scratchpad with news with freeze values) 0.302 0.191 0.156 0.159 0.247 0.231 [0.225, 0.236] <0.001 19% 0%
126 Anthropic Claude-3-Haiku-20240307 (scratchpad with news) 0.302 0.214 0.168 0.173 0.258 0.237 [0.232, 0.243] <0.001 19% 1%
127 Anthropic Claude-3-Haiku-20240307 (superforecaster with news 3) 0.289 0.257 0.181 0.188 0.273 0.238 [0.233, 0.244] <0.001 21% 12%
128 OpenAI GPT-3.5-Turbo-0125 (scratchpad with freeze values) 0.297 0.270 0.175 0.184 0.283 0.240 [0.235, 0.246] <0.001 19% 0%
129 OpenAI GPT-3.5-Turbo-0125 (scratchpad) 0.297 0.271 0.181 0.189 0.284 0.243 [0.237, 0.249] <0.001 19% 0%
130 Meta Llama-2-70b-Chat-Hf (zero shot with freeze values) 0.302 0.284 0.185 0.194 0.293 0.248 [0.239, 0.257] <0.001 26% 0%
131 Anthropic Claude-3-Haiku-20240307 (superforecaster with news 1) 0.299 0.287 0.210 0.218 0.293 0.258 [0.251, 0.266] <0.001 21% 4%
132 Meta Llama-2-70b-Chat-Hf (zero shot) 0.302 0.308 0.212 0.221 0.305 0.261 [0.254, 0.269] <0.001 24% 1%
133 ForecastBench Random Uniform 0.335 0.369 0.221 0.235 0.352 0.285 [0.275, 0.295] <0.001 29% 0%
134 OpenAI GPT-3.5-Turbo-0125 (zero shot with freeze values) 0.457 0.172 0.172 0.172 0.315 0.315 [0.305, 0.325] <0.001 26% 0%
135 OpenAI GPT-3.5-Turbo-0125 (zero shot) 0.457 0.297 0.245 0.250 0.377 0.354 [0.344, 0.363] <0.001 19% 0%
136 ForecastBench Always 1 0.729 0.718 0.639 0.647 0.723 0.688 [0.673, 0.702] <0.001 20% 0%