Skip to content

2026 - 05

Goal: Find the second converged + profitable strategy. Close the backtest gap. | Phase: 1 (Validate)

Month headline — two things happened, one durable, one clarifying:

  1. Capability (durable): backtesting now runs entirely inside the strategy repo — make backtest, no engine checkout, edit→re-run→see-result. Combined with realistic fills + OOS discipline, the loop is finally fast and trustworthy. This is the engine/strategy separation paying off, and it's permanent.
  2. Clarity (point-in-time): using that loop, two structural verdicts — passive MM on BTC/ETH spot is unviable, and price-derived trend/breakout/reversal signals on BTC/ETH spot are measurably dead. Both first-principles diagnoses, not tuning-quit. MeanReversion is the exception — not dead. The kline backtest is structurally blind to MR's 2-second intra-bar entries, so "offline shows no edge" is uninformative for it; live is the only ground truth, and live is uniformly positive on small n. MR is held for live evidence.

No proven strategy yet, but one live candidate (MR) and a tool that screens the rest cheaply. The June question is whether MR's small-n signal holds at larger n, and which new axis to point the loop at for everything else.

Strategy: Second Converged + Profitable Strategy — 🔶 (MR held live; MM + TF verdicted dead)

The plan was to find a second convergent strategy. Rigorous backtesting (now finally trustworthy — see Backtest workstream) killed two candidate families with first-principles diagnoses, and isolated MeanReversion as the one thread still alive:

  • MeanReversion — NOT dead, held for live evidence. v19 all 4 variants live-green: +26 to +41 bps, 100% win, ~25–30 min holds, on a small sample. The kline backtest is structurally blind to MR's 2-second intra-bar entries — "offline shows no edge" is uninformative here, not a verdict. ~25–30 min hold makes it an intermediate-horizon reversion. Caveat: baseline + tight_sr fire on near-identical setups, so 6 fills ≈ ~3 independent events × 2 symbols — promising, perfectly positive, still small n. Live is the only ground truth; held.
  • TrendFollowing: iterated v15 → v18 (breakout, er_capped, S&R variants). Measurably dead — net-negative across 4 regime windows even at zero fees; loss is structural inverted R:R (losers ~2× winners), backtest-able, live-corroborated. The only positive TF cells are er_capped at n=1 that "win" by not trading.
  • MarketMaking: iterated v9 → v15 (vol-gate, volume-spike, inverse-gate, trade-tape). Structurally unviable — per fill ≈ +0.75 rebate − 1.7 adverse = −0.95 bps net; adverse selection exceeds our OKX maker rebate (−0.75 bps) ~2:1, on a 0.01-0.05 bps spread with no queue priority. Caveat: the v9-v15 cohort modeled maker_fee_bps=0 (ignored the rebate) → quoted ~2 bps off TV. A rebate-aware config (~1 bps quote) is an untested operating point — probably doesn't flip the verdict (still far from the touch, no queue priority), but never measured.
  • AdaptiveDirectional: dead by transitivity on the TF arm — routes/sizes a dead edge (TF) and an unprovable one (MR). ML arms never beat the regime-luck control. The machinery (regime routing, ML-vol sizing) is reusable if a real edge appears.
  • No P1 incidents — one minor state-drift incident, $0 cost.

Two versioned research memos document the full evidence chains: MM viability, spot directional viability.

Live P&L: May-created cohort netted -$52 across 84 strategies — $43 of it MarketMaking (the v6→v15 churn), which is the empirical price of the MM-unviability verdict, not random loss. Non-MM families were ~-$9, in line with April. Active strategies near-flat: MM v15 ETH variants marginally positive (invgate +$0.15 / 494 trades / +0.06 bps; tape +$0.14 / 1809 trades) but BTC variants negative + disabled — the ETH "positive" is within noise of the structural ceiling, not edge. MR v19 is the bright spot: all 4 variants green at +26 to +41 bps, 100% win — but on ~3 independent events, far too small to call. No clean profitable month; one live candidate worth extending.

Cohort aggregate method + family breakdown: 2026-05-strategy-metrics.md

Detailed strategy metrics: 2026-05-strategy-metrics.md

Backtest: Runs In The Strategy Repo Now — ✅ (the milestone)

The April plan framed this as "close the silent-strategy gap via SkipReason." The real work was bigger, and it's the quiet win of the month: backtesting now runs entirely inside the strategy repo — no engine checkout — and the results are trustworthy enough to kill strategies with confidence. That combination (fast loop + trustworthy results) is what made May's rigorous validation possible at all.

  • make backtest runs in the strategy repo's own venv (#347, phases P0→P4). The backtest ships as a wheel (trading-engine-backtest), vendored into shared_dist/ by a backtest-update workflow and installed by poetry install. The trading-backtest console script runs against the repo's working tree as the live strategy source. Edit a file → re-run make backtest → see the result. No engine clone, no full-engine CI cycle. Supports config overrides (--set), offline CSV replay (--csv-dir), aggTrade tick path (--agg-trade-dir).
  • Trustworthy fills — aggTrade-path spread for faithful TF/MR taker fills; OKX kline + aggTrade replay dumpers for live-venue replay; out-of-process execution to protect the live loop.
  • Research harness — feature-screen IC harness (#353), per-window OOS validation, IC→bps translation. This is what turned "iterate and hope" into "screen for signal, then build."
  • Real OKX fee tier in backtest (#352) — open. Retail fallback (~10× our real rate) was used; changed no verdict (TF dead at zero fees) but blocks trustworthy maker-heavy analysis.
  • Backtest↔live distribution comparator (#348) — open, the formal fidelity loop.

Why this matters more than the verdicts: the verdicts are point-in-time facts about BTC/ETH spot. This loop is a durable capability — every future strategy idea, on any axis we pick for June, gets screened cheaply before it ever touches paper or live. It's the engine/strategy separation finally paying off in iteration speed, and it's also the tool Vicky picks up when she onboards.

ML: VSR Regime Classifier (the build) + Direction Ruled Out (the conclusion) — ✅

The ML side produced both a concrete deliverable and a structural conclusion this month.

  • VSR (Vol-Structure Regime) breakthrough — MCC 0.11 → 0.40, a 4× jump to production-grade. Direction-family regime labels failed (MCC 0.076-0.109, never predicts TRENDING_DOWN). Pivoting to a vol-clustering label (TRENDING / RANGING / UNKNOWN, anchored to regime-stationary vol structure) unlocked it: test MCC 0.404, accuracy 0.61, ROC AUC 0.78. CatBoost beat XGBoost + LightGBM (Optuna-tuned, 100 trials each), bar-size invariant, 6 selected features. This is a deployable regime gate — the "vol/range is predictable" thesis made real.
  • Direction ruled out, three ways. Range/vol predictions have signal; direction does not — confirmed independently by 4-class regime, signed Donchian (R² 0.197 unsigned → −0.0005 signed), and fwd_ER (dir-acc 0.479, below random). Not a model weakness — OHLCV features are non-stationary (8/10 feature ICs sign-flip across the 2022 bear → 2024 bull) and predict magnitude, not direction.
  • Regime-conditional routing (E4-E7): VSR gate helps regime detection, but within a regime UP/DOWN precision is still ~14pp below the ~52% break-even for 20 bps fees. Confirms the direction ceiling from another angle — bottleneck is that vol-family features are wrong-purpose for direction. Next: direction-tuned features (return lags, RSI, MACD, funding) — which crosses into the non-price/multivariate axis the directional memo flagged as untested.
  • Disciplined rule-out cleanup (the evidence base for the directional verdict): forward_ER deleted, regime_intensity deleted, direction-family replaced, signed Donchian abandoned, OB asymmetry ruled out (4-round diagnostic), OB→price abandoned, order flow too weak. Seven dead targets documented "do not re-attempt."
  • ML infra: Binance futures depth + OKX OB ingestion, model-agnostic Optuna tuning, rolling-mean vol/range baselines, async I/O hardening (a blocking pipeline call had crashed stag via ECS health-probe timeout).

Where VSR fits the verdict: it's a vol-structure signal, not a direction signal — so it doesn't resurrect the directional family. Its home is exactly what the memo names: a routing/gating support layer (TRENDING → momentum, RANGING → mean-reversion) and position sizing. It strengthens the "vol/range predictable" side without contradicting "direction isn't."

Frontend: Strategy Area Redesign — 🔶

  • Strategy area restructured: Overview → Current · Console · History split (#176, #181); Registry on its own page with wheel-introspection fields (#175); Order & Fill history stat strips reworked for actionable signals (#174)
  • Consumed 17 new read-only engine endpoints (operational state, accounting, market data) + registry/status
  • Original 6 UI issues from the May plan partially absorbed into the redesign; not all closed individually

Ops: Operator Runbook — ✅ (4-month carry, finally shipped)

  • docs/reliability/operator-runbook.md shipped — signal → decision → action index. Status matrix, triage flow per signal class, safe procedures. The Feb-planned, 4× carried deliverable is done.

Incident: OKX Silent Balance Drift (May 17)

Impact: $0 realized. Detected within one verification cycle (~60 min).

What happened: A balance discrepancy surfaced via verification without an obvious triggering event. State reconciled in ~33 min; a Layer-1 code fix to prevent recurrence is pending. Low severity — caught by the hardened verification rules shipped after April's incidents, which is the system working as intended.


Team

MJ (~280 commits — engine 128, strategies 102, frontend 31, inf-trading 15, ml 5)

  • Strategy iteration: MM v9→v15, MR v17→v19, TF v15→v18, StatsArb v17→v19, AdaptiveDirectional v1→v2.
  • Backtest package (#347): extracted standalone trading-engine-backtest, published as wheel, realistic fills, OKX replay, out-of-process execution. The infra that made the month's verdicts trustworthy.
  • Two viability diagnoses: MM (adverse-selection economics) and spot-directional (feature/horizon screen across TF/MR/breakout/AD). First-principles, data-backed, versioned as research memos.
  • Composition refactor (#43, ~15 PRs): deleted BaseStrategy, migrated all strategy families to direct composition — extract PositionGate / EntrySizer / FeeAwareRouter / OutputBuilder / close-logic orchestrator.
  • Operator runbook shipped. Public-trade data flow (gateway + MDS + interface contracts).

Vicky (ML — 4 commits - most ML work is experiments, not committed code)

  • VSR regime classifier — the month's ML headline. MCC 0.11 → 0.40 by pivoting from direction-family labels to a vol-clustering label. Production-grade (test MCC 0.404 / acc 0.61 / AUC 0.78), CatBoost over XGB/LGBM, Optuna-tuned, bar-size invariant. A deployable regime gate.
  • Regime-conditional routing (E4-E7) — built specialist routing on the VSR gate; surfaced that direction precision stays ~14pp under break-even, isolating direction-tuned features as the next lever.
  • Disciplined target rule-out — 7 dead targets documented + deleted (forward_ER, regime_intensity, direction-family, signed Donchian, OB asymmetry 4-round, OB→price, order flow). This rigor is what makes the directional verdict credible.
  • ML infra: Binance futures depth + OKX OB ingestion, Optuna tuning framework, rolling-mean baselines, async I/O fix (blocking call had crashed stag via ECS health-probe timeout).
  • Strategy repo onboarding (#55) — reading + framework familiarization done (local dev on MDM Mac, read AD v1/v2, MR v18, TF v17, MMTV v13, StatsArb v19). AdaptiveDirectional v3 is the first contribution. Role expanding from ML research to owning ML-driven strategies. HR (school/salary) pending.

Gaddafi (11 commits — frontend)

  • Strategy area redesign: Current / Console / History split, dedicated Registry page, reworked history stat strips
  • Consumed new operational endpoints; nav restructure to st.Page registration

Phase Assessment

Still in Phase 1 (Validate). This month changed what Phase 1 means.

Phase 1 as written — 2 convergent profitable strategies at min qty on BTC/ETH spot — got sharper this month, not just delayed. Two of the three families we've iterated since February are now verdicted dead here: passive MM can't beat adverse selection without a rebate we don't have, and trend/breakout/reversal direction doesn't clear cost at any horizon. MeanReversion is the survivor — live-positive on small n, structurally unfalsifiable in our kline backtest, held for live evidence. So we're not at zero candidates; we're at one promising-but-unproven candidate plus a clean diagnosis of what to stop iterating.

What's actually strong now:

  • Infra: backtest is a trustworthy, standalone, fast iteration tool. The moat is real and stress-tested.
  • Methodology: we screen for signal before building, validate against regime-luck, translate IC to bps. We will not burn another four months on a dead path without knowing it early.
  • ML: a deployable, production-grade regime gate (VSR, MCC 0.40) — usable now as a routing/sizing layer for whatever strategy survives.
  • Risk: hardened, one trivial incident, operator runbook shipped.

What's missing is the one thing that matters: a proven edge. MR is the closest we have — two of the three prop-spot theses (passive MM, price-trend direction) are wrong on this setup, while MR is genuinely unproven rather than disproven.

So the position entering June is: one live candidate to confirm or kill (MR), a set of untested axes to probe (carry/funding, wider-spread MM, non-price/ML direction), and — crucially — the tools and discipline to evaluate them cheaply. The concrete plan and priority for that is the June doc; the takeaway here is that we exit May with a sharp map, not a strategy.

Learning

Hunt edges you can measure live, not ones you have to predict. Every dead price signal this year shared one trap: a signal competed to zero is invisible — you can't distinguish "edge gone" from "bad luck," which is exactly what produced the regime-luck mirages. The Jun-1 carry exploration surfaced the epistemic upgrade: a carry edge is directly observable — you read the funding rate off the screen; competed-to-zero is just funding ≈ 0, visible in real time. After a year of inferred price signals, the highest-value reframe isn't a new symbol, it's a new category: prefer edges you can measure over edges you must forecast.

Every in-sample edge was guilty until validated — and almost all were guilty. A signal looks great on a 90-360d slice, then shrinks to zero or negative under full multi-regime history + non-overlapping samples + realistic cost + per-quarter breakdown. Regime-luck inflation showed up ≥3 times (4h reversal Sharpe +1.31 → −0.42 over 2.5yr the sharpest). The 6-step validation that kills it is now standard — and it cuts both ways: it also flagged that MR is unfalsifiable in our kline backtest (blind to its 2s entries), so "offline shows no edge" was wrongly read as "dead." Not concluding is itself a discipline.

The backtest earned trust, and trust is what makes a backtest useful. Its job was never to find edge — it's to reject non-edge cheaply and let us believe the rejection. Standalone package + realistic fills + in-repo loop converted it from "results we don't trust" (Mar/Apr) to "results we kill strategies on" (May). Highest-ROI infra of the year.

ML predicts magnitude, not direction — which points off spot entirely. Vol/range is forecastable (VSR hit MCC 0.40); direction isn't, by market structure. Magnitude can't create a spot-directional edge, only modulate one — so the ML edge points at derivatives (where vol is the product) or a routing/sizing support layer. Knowing where a tool does not apply is as valuable as the tool.

Four months bought a map of what doesn't work — plus one live thread and the tools to search fast. Two of three prop-spot theses (passive MM, price-trend) are disproven on BTC/ETH at our fees; MR survives, unproven-not-disproven. The infra, methodology, and risk core are strong. The lesson isn't "we failed" — it's that a year of iteration without this validation rigor would have kept all three theses alive on small-sample hope. We can now kill a bad idea in days, which is the precondition for searching a new playing field efficiently.