Experiment 2 — brazil-bench (24 cells)

Generated 2026-04-14T00:18:52+00:00

Stacks
24
Runs
24
Completed
22
Failed
2
Metrics
10
Tokens (total)
33,586,214
Cost (total)
$29.85

Stack maturity

Click any column header to sort. Tokens / Cost / Duration are per-replicate means — sort ascending to find the most efficient stacks at a given quality level.

Maturity Phase Stack n code_quality tokens (mean) cost (mean) duration (mean) turns (mean) $/quality
0.750 trial language=go, model=opus, tooling=none 1/1 1.000 1,098,817 $1.3872 273.5s 24.0 $1.3872
0.750 trial language=go, model=opus, tooling=beads 1/1 1.000 683,860 $1.2267 268.7s 28.0 $1.2267
0.750 trial language=go, model=sonnet, tooling=none 1/1 1.000 1,540,294 $1.1782 425.7s 30.0 $1.1782
0.750 trial language=java, model=opus, tooling=none 1/1 1.000 965,345 $1.2560 218.0s 24.0 $1.2560
0.750 trial language=java, model=opus, tooling=beads 1/1 1.000 1,669,932 $1.7526 340.6s 39.0 $1.7526
0.750 trial language=java, model=sonnet, tooling=beads 1/1 1.000 2,779,597 $1.8355 674.0s 57.0 $1.8355
0.708 trial language=rust, model=opus, tooling=none 1/1 0.833 593,895 $0.8660 174.7s 16.0 $1.0392
0.708 trial language=clojure, model=opus, tooling=none 1/1 0.833 630,222 $0.8090 178.2s 17.0 $0.9709
0.708 trial language=rust, model=opus, tooling=beads 1/1 0.833 1,108,771 $1.5749 350.3s 32.0 $1.8899
0.708 trial language=rust, model=sonnet, tooling=none 1/1 0.833 209,825 $1.1439 471.0s 9.0 $1.3727
0.708 trial language=rust, model=sonnet, tooling=beads 1/1 0.833 491,969 $1.1087 532.5s 17.0 $1.3304
0.708 trial language=clojure, model=opus, tooling=beads 1/1 0.833 1,391,321 $1.3912 342.5s 34.0 $1.6694
0.708 trial language=clojure, model=sonnet, tooling=beads 1/1 0.833 1,811,932 $1.0288 410.3s 49.0 $1.2346
0.708 trial language=clojure, model=sonnet, tooling=none 1/1 0.833 1,920,625 $1.1249 436.6s 45.0 $1.3498
0.683 trial language=typescript, model=sonnet, tooling=beads 1/1 0.733 1,556,688 $0.9246 361.9s 43.0 $1.2609
0.683 trial language=typescript, model=opus, tooling=none 1/1 0.733 919,663 $1.0130 187.5s 23.0 $1.3814
0.683 trial language=typescript, model=opus, tooling=beads 1/1 0.733 1,022,265 $1.0675 204.8s 27.0 $1.4556
0.683 trial language=typescript, model=sonnet, tooling=none 1/1 0.733 797,512 $0.7092 274.5s 24.0 $0.9671
0.667 trial language=python, model=sonnet, tooling=none 1/1 0.667 879,497 $0.7158 328.9s $1.0737
0.667 trial language=python, model=opus, tooling=none 1/1 0.667 580,884 $0.7256 149.3s 16.0 $1.0884
0.667 trial language=python, model=opus, tooling=beads 1/1 0.667 1,625,376 $1.7259 348.8s 44.0 $2.5888
0.667 trial language=python, model=sonnet, tooling=beads 1/1 0.667 2,113,900 $1.2469 482.8s 49.0 $1.8704
0.150 candidate language=java, model=sonnet, tooling=none 0/1 n/a 4,013,637 $2.3117 779.7s 61.0
0.150 candidate language=go, model=sonnet, tooling=beads 0/1 n/a 3,180,387 $1.7240 506.8s 61.0

Click a column header to sort. retort · maturity = 0.30·agreement + 0.30·completion + 0.25·score + 0.15·coverage

ANOVA

From retort analyze on the exported CSV. Significant factors are flagged at the bottom of each response section.

============================================================
Response: code_quality    transform: log10(y)
R² = 1.0000  Adj R² = 1.0000
============================================================
                   sum_sq    df             F         PR(>F)
C(language)  8.535518e-02   5.0  9.957970e+29  2.474325e-206
C(model)     2.831117e-32   1.0  1.651463e+00   2.196141e-01
C(tooling)   1.966053e-34   1.0  1.146849e-02   9.162363e-01
Residual     2.400032e-31  14.0           NaN            NaN

Significant (p < 0.1): language

============================================================
Response: _tokens    transform: log10(y)
R² = 0.6020  Adj R² = 0.4029
============================================================
               sum_sq    df         F    PR(>F)
C(language)  0.536723   5.0  2.739070  0.062773
C(model)     0.046457   1.0  1.185415  0.294644
C(tooling)   0.230278   1.0  5.875925  0.029479
Residual     0.548662  14.0       NaN       NaN

Significant (p < 0.1): language, tooling

============================================================
Response: _cost_usd    transform: log10(y)
R² = 0.6531  Adj R² = 0.4796
============================================================
               sum_sq    df          F    PR(>F)
C(language)  0.097462   5.0   2.451931  0.085321
C(model)     0.004787   1.0   0.602110  0.450685
C(tooling)   0.089329   1.0  11.236655  0.004744
Residual     0.111298  14.0        NaN       NaN

Significant (p < 0.1): language, tooling

============================================================
Response: _duration_seconds    transform: log10(y)
R² = 0.8547  Adj R² = 0.7820
============================================================
               sum_sq    df          F    PR(>F)
C(language)  0.087041   5.0   2.611892  0.071824
C(model)     0.357773   1.0  53.679806  0.000004
C(tooling)   0.123530   1.0  18.534238  0.000726
Residual     0.093309  14.0        NaN       NaN

Significant (p < 0.1): language, model, tooling

============================================================
Response: _turns    transform: log10(y)
R² = 0.7036  Adj R² = 0.5439
============================================================
               sum_sq    df          F    PR(>F)
C(language)  0.267959   5.0   2.807060  0.062101
C(model)     0.043085   1.0   2.256755  0.156929
C(tooling)   0.242839   1.0  12.719571  0.003446
Residual     0.248193  13.0        NaN       NaN

Significant (p < 0.1): language, tooling

============================================================
Response: test_coverage    transform: log10(y+1)
R² = 0.6104  Adj R² = 0.4157
============================================================
               sum_sq    df         F    PR(>F)
C(language)  0.220742   5.0  4.120960  0.016429
C(model)     0.011791   1.0  1.100608  0.311907
C(tooling)   0.002332   1.0  0.217711  0.647968
Residual     0.149984  14.0       NaN       NaN

Significant (p < 0.1): language