2026 - 04¶
Goal: First clean profitable month. Stay in Phase 1. | Phase: 1 (Validate)
Strategy: Prove Strategy Edge — Clean P&L — ❌¶
Active v10 strategies modestly positive but unproven. Need a full month of intentional profit, no incident windfalls.
- Run live for full month without interruption — broken by Apr 24 + Apr 30 incidents
- MeanReversion: now showing real promise — multiple v11–v13 variants reached convergent win rates (60–70%) and modest positive P&L despite incident churn. Previously only TF showed any promise; March's "MR became best" call is now corroborated across more variants
- TrendFollowing: diagnose regression — iterated v11→v14, current v14 modest (+$0.90 net)
- MarketMaking: iterate or park — chose iterate. v1 → v2 (TVMM) → v3 (MMTV). Volume scaling working
| Metric | Value |
|---|---|
| Active strategies P&L | +$1.78 |
| April-created total | ~-$8.87 |
| All-time cumulative | ~+$52 (includes +$113 Mar + +$23 Apr) |
| Excluding incidents | ~-$84 all-time |
| Best active | MR_v13_ETH: +$1.41, 30 trades, 70% WR |
| MM volume (v3) | $63K across 1,320 trades |
What worked: MarketMaking got real volume — MMTV_v3 traded $63K in 2 days at near-flat P&L, validating the
quoting infra. MeanReversion still profitable in live (carried from March). High iteration count this month: 14
strategy versions across 5 types (StatsArb v9→v15, TrendFollowing v11→v14, MR v11→v13, MMTV v1→v3, TVMM v1).
What didn't: Two new incidents broke the "clean month" goal. April-created strategies net -$9 — broad iteration
without backtest filter is expensive. StatsArb still net negative across all April versions. MMTV v2 hit -$4.22
before fix shipped.
Detailed strategy metrics: 2026-04-strategy-metrics.md
Backtest: Accuracy — Unlock Efficient Iteration — 🔶¶
March iterated v3→v10 via live + paper only. Each iteration costs real money (live) or time (paper wait). Backtest infrastructure is fast (~69x speedup) but results don't match live — can't be trusted for strategy decisions yet.
Making backtest accurate is the highest-leverage investment: it converts strategy iteration from days + dollars to minutes + zero cost. Without it, every parameter change requires deploying to paper/live and waiting.
- OOM crash fixed — RSS bounded by retention cap. Latest backtests complete cleanly
- Discovery phase: 6 backtest-vs-live bugs fixed across 8 PRs (#199–#211). ROI now computes, POST_ONLY supported, ATR no longer wiped by look-ahead guard, ml_vol no longer falls back to ATR. SkipReason telemetry + comparator tool shipped
- Investigate paper vs live divergence — partially done as part of discovery. Remaining gap: 6/9 strategies still produce zero actionable decisions in backtest where they trade actively in live. Skip-reason data not yet analyzed
- Calibrate backtest against known live results — premature until silent-strategy gap closed
- Define "backtest-validated" gate for paper/live promotion — needs threshold decision (trade count ±X%, PnL ±Y%, directional accuracy?)
Latest snapshot (backtest 2245, v1.80.0, just completed): ROI -0.33%, sim_end 2026-05-01,
traded 3/9 strategies.
ROI computes correctly. POST_ONLY supported. PR #211 plumbing in place but skip-reason distributions not yet read.
Remaining gap: 6/9 silent strategies — root cause unclear until skip-reason data is analyzed.
April iteration was still done via live + paper because backtest accuracy hadn't crossed the trust threshold. That bar moved a lot this month though.
ML: Signal — Beyond the Ceiling — ✅¶
Direction prediction capped at ~50% with current features. Shift focus.
- Range regression breakthrough: R² 0.37, MAE 19.23 bps, direction accuracy 77.3% — exceeds 70% profitability threshold for the first time
- Volatility regression improved: R² 0.28 (up from 0.17), direction acc 68.5% (up from 63%)
- Architecture comparison resolved: TimesNet removed. CatBoost wins. VOL_ADJUSTED/QUANTILE variants pruned
- Pipeline overhaul: walk-forward CV, purged CV + embargo (no leakage), 3-way split, mRMR with feature group caps, rolling stability audit (sign flip rate, regime consistency), CV stability penalty
Key finding: Proper feature selection + leakage control + R²-aligned scoring (0.7·R² + 0.3·Corr, was 0.4/0.6)
unlocked the gain. Range and volatility are distinct targets requiring independent models. New targets added:
funding_rate_direction, funding_rate_regression, kline_range_regression_bps. Time/calendar features
(hour_sin/cos, day_of_week, funding settlement timing) added.
Strategic implication: We now have a model with profitability-grade direction signal. May should put it into a strategy and validate live.
AI Agent: Operator Role — 🔶¶
Monitor agent exists. Make it genuinely useful.
- Architecture pivot: server-side scheduled agent → local on-demand agent — recurring LLM API spend on scheduled server runs was uneconomic. Local agent runs on operator demand or cron task.
- Incident detection → suggested remediation (not just alerting)
- Avoid false negative + better context for useful suggestions
Ops: Risk & Operational Maturity — 🔶¶
- Incident write-ups + ledger — 3 incidents documented this month (
docs/reliability/incidents/),incident_ledger.csvupdated. Reactive documentation working - Operator runbook at
docs/reliability/operator-runbook.md— status matrix (what each strategy status means - appropriate operator action), triage flow per signal class, safe procedures (change risk limit, pause strategy,
run
reconcile_positions), incident-class quick reference cross-linking per-incident docs
13 reactive remediations shipped this month from Apr 24 + Apr 30 incidents. Hardening keeps pace with incidents but proactive operator-facing docs (the runbook) still missing — tribal knowledge for live operations.
Incidents¶
Apr 24 — MarketMaking quotes drifted away from market¶
Impact: P2 — $0 lost. The exchange rejected 45 of our orders before any traded.
What happened: A new MarketMaking strategy started quoting prices that drifted further and further above the
actual market — about $12 above by the time the exchange (OKX) refused them for being too far outside the allowed
price band. Two related issues amplified the problem: when the strategy hit an error and was supposed to stay paused,
the system kept auto-resuming it without an operator's say-so; and a safety counter that should have throttled the
strategy didn't reset between pause/resume cycles.
Fixes shipped: Quotes are now clamped to a safe distance from the live market price. Strategies that error out because of risk-block or rejection now stay paused until an operator clears them — only transient errors auto-resume.
Apr 30 — Dust Auto-Promotion Phantom Closes¶
(old strategies traded by accident)
Impact: P1 — +$23 (luck — could easily have been -$3,000). Roughly $756 of unintended trades across
15 strategies in 14 seconds, including strategies that had been retired for weeks.
What happened: A startup process meant to clean up small leftover position fragments ("dust") had a bug that treated those fragments as real open positions and fired close orders on them. Because BTC and ETH happened to be moving the right way, the unintended trades came out slightly profitable — not skill, just luck.
The deeper finding: those leftover fragments had been silently accumulating for weeks (~$800 worth, dating back
~11 weeks). Stopping a strategy doesn't automatically flatten its positions, so retired strategies kept invisible
exposure on the venue.
Fixes shipped: The buggy startup process was reverted. Three new health checks now catch this pattern (positions not matching their fill history, retired strategies that aren't actually flat, costs that look unreasonable). Two new operator tools were added: a dust report (audit what's stranded) and a close-dust endpoint (clean it up safely). Past-strategy auto-close was hardened to handle leftover fragments and alert on failure.
Apr 30 — OKX brief connectivity outages (no impact, recorded for awareness)¶
Impact: $0. Two short OKX connectivity blips ~2 hours apart. The system recovered automatically.
Why filed: During the outages, the system retried order cancellations very aggressively against an already-struggling exchange. Also, the engine treats "real execution failure" and "we lost the confirmation message but the order was fine" as the same kind of error — a few coincident OKX outages could falsely escalate a healthy strategy to KILL. No fix this month; captured as backlog for cancellation backoff and error-counter splitting.
Team¶
Gaddafi (29 commits — frontend)¶
- Dashboard redesign: unified dark sections, deep-blue headings, new strategy detail layout, signal icon system
- Strategy comparison: side-by-side tools, iteration tracking, unified orders table
- Charting: lightweight charts library, live OHLC overlay, interactive tooltips, marker clustering
- Data freshness indicators across Overview / Compare / Trading views
- Architecture: centralized session state, modular component packages (strategy detail + market data)
- Reliability: narrowed broad exception handling, replaced assertions with explicit exceptions, fixed ML API error handling, navigation/refresh fixes
- Next month: AI agent operator role (LLM-powered monitoring, automation, analysis)
Vicky (13 commits — ML)¶
- Range regression breakthrough: R² 0.37, direction accuracy 77.3% — first model exceeding profitability threshold
- Volatility regression: R² 0.28, direction acc 68.5% (up from 0.17/63% in March)
- Tuning framework overhaul: walk-forward CV (expanding + rolling), purged CV + embargo, 3-way internal split, CV stability penalty (mean − α·std), catastrophic pruning, fixed column-ordering instability
- Feature selection: mRMR with group caps (volatility/range/volume/OI/funding/return/calendar/MA), rolling Spearman
- mutual info, sign flip rate, regime consistency
- New targets: funding_rate_direction, funding_rate_regression, kline_range_regression_bps
- New features: hour_sin/cos, day_of_week_sin/cos, funding settlement timing, rolling intra-bar range
- Codebase simplification: removed TimesNet, VOL_ADJUSTED/QUANTILE variants
- Scoring fix: 0.7·R² + 0.3·Corr (was 0.4/0.6) — better PnL alignment
MJ (~164 commits — engine 97, ML 11, frontend 44, inf-trading 12)¶
- Backtest accuracy: closed catastrophic gaps that prevented any meaningful comparison vs live — OOM crash, ROI computation, POST_ONLY order support, ATR look-ahead filtering, ml_vol fallback. Added SkipReason telemetry so silent strategies leave a diagnostic trail. Built backtest-vs-live comparator tool
- Risk hardening: error-reason-aware auto-resume (preserve operator-decision quarantines), velocity counter cleanup on ERROR transitions, MMTV quote price clamp inside venue price band, class-level config override fix
- State integrity: position reconcile operator endpoint, three new verification rules covering vector consistency and retired-strategy invariants (closing the "counter-only checks miss vector corruption" blind spot), dust audit + close-dust operator endpoints, auto-close past-strategy dust with Slack alerts on failure
- Strategy iteration: StatsArb v9→v15, TrendFollowing v11→v14, MeanReversion v11→v13, MMTV v1→v3
- Frontend: dashboard improvements (collab w/ Gaddafi)
Phase Assessment¶
Still in Phase 1 (Validate). Exit criteria not met.
The Phase 1 design is intentional: minimum order quantity per strategy, 2 symbols at high/mid frequency, diversified portfolio of philosophies and types. The goal is a solid infra + robust risk management at the core that runs a diverse portfolio with rapid iteration based on live data. The ideal exit is 2 strategies with converged performance metrics that are positive in both P&L and profit margin.
Against that bar:
- Infra: stress-tested at MM scale (
$63Kin 2 days), survived two incidents, 21 remediations across Mar+Apr. Solid and getting more so. - Risk management: extensive hardening, but operator-facing documentation is still tribal.
- Diverse portfolio + iteration: 14 versions across 5 types this month — working as intended.
- 2 converged + profitable strategies: MeanReversion is the closest candidate (v11/v12/v13 converged to similar win rates and modest positive P&L). One candidate, not two. Profit margin not yet positive at min order quantity given fees.
Bigger picture shifts that affect Phase 2 readiness:
- MarketMaking quoting infra validated at scale: MMTV_v3 quoted
$63Kof volume in 2 days at near-flat P&L. Quoting stack works. Strategy edge is a separate problem — more quoting iterations won't solve it. - ML signal crossed profitability threshold: range regression at 77.3% direction accuracy. First model worth integrating into a real strategy.
- Operational debt growing: 13 reactive remediations this month. Documentation gap is now the bottleneck for adding teammates or scaling without MJ in the loop.
What's needed to exit Phase 1: A second strategy type (TrendFollowing or an ML-driven variant) showing the same cross-variant convergence MeanReversion has, both with positive profit margin at min order quantity. Backtest accurate enough to use for parameter selection. Operator runbook documented.
Learning¶
Infra is the hard part; strategy iteration is the cheap surface. MMTV's $63K in 2 days at flat P&L isn't a
strategy result, it's a multi-year infra result finally being stress-tested at scale — multi-venue execution,
state reconciliation, risk controls, lifecycle management. That's the moat. Strategies sit on top: each variant is
small, has a large sample size in days not months, and shows convergence vs noise quickly when you compare across
variants. April demonstrates the asymmetry — MeanReversion converged across v11/v12/v13 to similar win rates,
StatsArb didn't. That's a signal you can't get from a single variant in a single environment.
This makes strategy iteration the right surface for AI-agent automation: spin up parameter sweeps, deploy variants, compare metrics across them, surface the ones that converge. The expensive thing is the platform underneath, and that part is mostly built.
The strategy iteration loop is too slow and too expensive. 14 strategy versions shipped this month via live + paper
at ~-$9 net cost — and most of those losses were tuition, not capital allocation we'd defend. Backtest accuracy
isn't a tooling improvement, it's the binding constraint on how fast strategy work can move. The 8-PR April push
went from "backtest is mysteriously wrong" to "backtest is wrong in 6 specific known ways" — that visibility is
the actual progress, more important than the bug count.
Where edge comes from is changing. A year of work assumed direction prediction at 1-hour horizons would unlock
strategy profit. That hit a hard ceiling at ~50% in March. Vicky pivoted to predicting range (how much price will
move, not which way) and got 77% directional accuracy in one month — by fixing the evaluation methodology, not
the model. The lesson is broader than the result: when a research direction stalls for months, the answer is rarely
"more model" — it's usually "wrong target" or "wrong measurement". We should be quicker to question framing.
Single-operator mode is hiding a brittleness. 21 incident remediations shipped across March and April. None required a second operator. None are documented in a way a second operator could use. The system reliability is real but it lives in MJ's head, not in the codebase or the docs. This is fine until it isn't — the day a teammate needs to handle an incident or the day MJ is unavailable for 24 hours. The runbook keeps slipping because it has no forcing function; we should either give it one (e.g., make Gaddafi the on-call backup so he needs the doc) or stop pretending it's a goal.
Phase 1 is being defined the wrong way. "First profitable month" has been the bar for three months and we keep hitting it with incident-luck asterisks. Single-month total P&L is a noisy proxy — incident windfalls or one lucky trade can satisfy it without proving anything. The actual Phase 1 design — minimum order quantity, 2 symbols, diversified portfolio of strategy types — defines the harder, less gameable bar: 2 strategies with cross-variant convergence, positive P&L and positive profit margin. Margin is the part fees can eat; that's where the real edge test is.