Experiment 1 — Final Results (6 languages)

Generated 2026-04-13T21:50:10+00:00

Stacks
24
Runs
73
Completed
67
Failed
6
Metrics
4
Tokens (total)
25,753,820
Cost (total)
$25.07

Stack maturity

Click any column header to sort. Tokens / Cost / Duration are per-replicate means — sort ascending to find the most efficient stacks at a given quality level.

Maturity Phase Stack n code_quality tokens (mean) cost (mean) duration (mean) turns (mean) $/quality
1.000 production language=go, model=sonnet, tooling=beads 3/3 1.000 476,955 $0.3110 146.6s $0.3110
1.000 production language=java, model=opus, tooling=none 3/3 1.000 217,163 $0.4364 131.1s $0.4364
1.000 production language=java, model=opus, tooling=beads 3/3 1.000 325,112 $0.5525 149.6s $0.5525
1.000 production language=java, model=sonnet, tooling=none 3/3 1.000 494,116 $0.3259 151.6s $0.3259
1.000 production language=java, model=sonnet, tooling=beads 3/3 1.000 611,395 $0.3646 181.6s $0.3646
0.989 production language=go, model=sonnet, tooling=none 3/3 0.956 435,374 $0.3029 123.4s $0.3169
0.970 production language=go, model=opus, tooling=beads 3/3 0.985 346,216 $0.4908 117.0s $0.4981
0.958 production language=rust, model=opus, tooling=none 3/3 0.833 150,702 $0.3314 106.1s $0.3977
0.958 production language=rust, model=opus, tooling=beads 3/3 0.833 355,100 $0.4808 142.6s $0.5770
0.958 production language=rust, model=sonnet, tooling=none 3/3 0.833 395,257 $0.3551 194.1s $0.4261
0.958 production language=clojure, model=opus, tooling=none 3/3 0.833 409,366 $0.5790 178.7s $0.6948
0.933 production language=typescript, model=opus, tooling=none 3/3 0.733 168,703 $0.3187 181.2s $0.4346
0.933 production language=typescript, model=opus, tooling=beads 3/3 0.733 454,220 $0.5119 219.1s $0.6980
0.924 production language=go, model=opus, tooling=none 3/3 0.963 230,498 $0.3611 93.5s $0.3750
0.869 production language=python, model=sonnet, tooling=none 3/3 0.637 332,390 $0.2257 73.8s $0.3543
0.858 production language=typescript, model=sonnet, tooling=beads 3/4 0.733 637,683 $0.3812 167.7s $0.5198
0.808 trial language=rust, model=sonnet, tooling=beads 2/3 0.833 643,793 $0.4141 207.6s $0.4969
0.808 trial language=clojure, model=opus, tooling=beads 2/3 0.833 723,724 $0.7618 201.4s $0.9142
0.808 trial language=clojure, model=sonnet, tooling=beads 2/3 0.833 722,940 $0.5204 259.1s $0.6245
0.808 trial language=clojure, model=sonnet, tooling=none 2/3 0.833 665,636 $0.5747 310.1s $0.6896
0.791 trial language=python, model=sonnet, tooling=beads 3/3 0.696 436,754 $0.2617 110.2s $0.3758
0.789 trial language=python, model=opus, tooling=beads 3/3 0.672 280,360 $0.3734 79.1s $0.5554
0.783 trial language=typescript, model=sonnet, tooling=none 2/3 0.733 835,319 $0.5314 281.1s $0.7246
0.736 trial language=python, model=opus, tooling=none 3/3 0.789 91,698 $0.2034 44.0s $0.2579

Click a column header to sort. retort · maturity = 0.30·agreement + 0.30·completion + 0.25·score + 0.15·coverage

ANOVA

From retort analyze on the exported CSV. Significant factors are flagged at the bottom of each response section.

============================================================
Response: code_quality    transform: log10(y)
R² = 0.8454  Adj R² = 0.8270
============================================================
                   sum_sq    df          F        PR(>F)
C(language)  2.433865e-01   5.0  64.428644  1.260303e-22
C(model)     6.148193e-04   1.0   0.813767  3.706769e-01
C(tooling)   7.135823e-07   1.0   0.000944  9.755866e-01
Residual     4.457583e-02  59.0        NaN           NaN

Significant (p < 0.1): language

============================================================
Response: _tokens    transform: log10(y)
R² = 0.7425  Adj R² = 0.7050
============================================================
               sum_sq    df          F        PR(>F)
C(language)  0.585702   5.0   7.218186  4.155562e-05
C(model)     1.099493   1.0  67.750717  9.877011e-11
C(tooling)   0.541440   1.0  33.363505  5.500229e-07
Residual     0.778968  48.0        NaN           NaN

Significant (p < 0.1): language, model, tooling

============================================================
Response: _cost_usd    transform: log10(y)
R² = 0.7031  Adj R² = 0.6598
============================================================
               sum_sq    df          F        PR(>F)
C(language)  0.512915   5.0  16.710178  1.550257e-09
C(model)     0.089848   1.0  14.635766  3.765549e-04
C(tooling)   0.098682   1.0  16.074650  2.116749e-04
Residual     0.294670  48.0        NaN           NaN

Significant (p < 0.1): language, model, tooling

============================================================
Response: _duration_seconds    transform: log10(y)
R² = 0.7673  Adj R² = 0.7333
============================================================
               sum_sq    df          F        PR(>F)
C(language)  1.208637   5.0  26.687311  8.496877e-13
C(model)     0.204070   1.0  22.529828  1.900173e-05
C(tooling)   0.036614   1.0   4.042288  5.000990e-02
Residual     0.434773  48.0        NaN           NaN

Significant (p < 0.1): language, model, tooling