Skip to content

2026 - 03

Goal: First profitable live month. Exit Phase 1. | Phase: 1 (Validate)

Strategy: Net Positive Live P&L — 🔶

Focus on what's working. Cut what isn't. Validate before deploying.

  • Iterate SMATrendFollowing params — renamed to TrendFollowing (dropped SMA prefix), iterated v3→v10
  • Create Market making strategy — SingleVenueMarketMaking v1, v2 deployed
  • Disable losing strategies — aggressive pruning, 30+ strategies disabled across versions
Metric Value
Active strategies P&L +$2.87
All-time cumulative +$42.43 (but +$113 is incident luck)
Incident P&L (Mar 22-23) +$113 — accidental, see incident report
Excluding incident ~-$71 all-time
Best active (Sharpe) MeanReversion_v10_BTC: +$1.24, 89% WR, 0.61 Sharpe

What worked: MeanReversion flipped from worst (Feb) to best performer. Massive iteration velocity (v3→v10 in one month). Active v10 strategies are stable and modestly profitable.

What didn't: MarketMaking net negative (-$3.78). TrendFollowing regressed — profitable in paper but flat/negative in live (possible market regime sensitivity). Paper MeanReversion losing while live is winning — divergence needs investigation.

Detailed strategy metrics: 2026-03-strategy-metrics.md

ML: Improve ML Signal — 🔶

Direction classification on kline-only features has ~50% accuracy — not actionable.

  • Volatility prediction: CatBoost best (R² > 0.17, direction accuracy > 0.63). Tested XGBoost, alt loss functions, log targets, cross-asset features. Near performance ceiling under current feature space
  • New data sources: funding rate, futures metrics, open interest, long/short ratios, order book (Binance futures + OKX spot/futures). External features didn't improve volatility. Order book improved direction MCC 0.107 → 0.157 — early signal worth exploring
  • Architecture comparison: CatBoost vs TimesNet on same data split — not completed
  • Strategy integration: 4 VolScaledDirectional variants running in paper (mean_rev, ml_vol, mean_rev_ml_vol, BTC). Not yet profitable (-$4.25 combined) — not deployed to live

Key finding: Direction prediction precision capped at ~48-50% (precision-recall analysis). Below ~70% threshold needed for profitability after trading costs. Return prediction (direction + bps) near-random, R² ≈ 0. Market regime classification ~50% accuracy with significant distribution drift. Current feature space may be structurally insufficient for profitable direction trading at 1h horizon.

AI Agent: Domain Integration — 🔶

Prototype deployed in February (Streamlit chat + role selector + OpenRouter LLM). Currently surface-level. March focus: deep domain integration so each role provides value beyond the dashboard.

  • Developer: run backtests from chat, analyze results with metric comparison, suggest parameter variations
  • Operator: monitor agent deployed — queries backend API, sends Slack alerts on anomalies. Role-specific deep integration (parameter adjustment, incident guidance, kill switch) not started
  • Advisor: performance attribution, cross-strategy risk analysis, event explanation — not started
  • Infra: domain context injection, write operation guardrails, multi-turn conversation memory — not started

Team bandwidth absorbed by strategy iteration and incident response.

Ops: Risk Hardening — ✅

  • Audit fail-closed vs fail-open decisions in risk/verification code
  • Ensure killed strategies wind down positions (not abandon)
  • Add alerting for state drift
  • Document risk management runbook — not done

8 remediations shipped from Mar 22-23 incident: separate WS fill channel, fill health monitoring, circuit breakers (rejection auto-pause, position-flip detection, freshness check), kill switch crash fix, duplicate close guard, stale-state guards at 3 layers.


Incident: OKX Fill Pipeline Failure (Mar 22-23)

Impact: P1 Critical — 20 hours, all 7 strategies, +$113 (luck).

What happened: OKX WS delivered order status but not fill records. Engine marked orders as handled, removed them from polling fallback. Strategies traded on stale position data. Manual reconcile fixed data but not the bug — restart immediately recurred. $3,170 in unintended positions accumulated. 228 rejected orders with no auto-pause. Kill switch crashed in infinite retry loop.

Key lesson: The +$113 profit masks a systemic failure. A price move the other way = -$3,000+. Reconciliation fixes state, not the mechanism that broke it. Before restarting after any incident, verify the pipeline is healthy — not just that verification passes.

Remediations (8 shipped): separate WS fill channel, fill health monitoring (60s), duplicate close guard, position-flip detection, pre-execution freshness check, consecutive rejection auto-pause, kill switch crash fix, strategy version reset.


Team

Gaddafi (4 commits — frontend)

  • Alert monitor agent: LLM-powered monitoring service with Slack integration, interval-based alerts (collab w/ MJ)
  • Multi-chart venue support, URL-based tab navigation (?tab= deep-linking)
  • Pre-commit tooling (linting, import sorting, Black formatting)
  • Ground-truth API implementation + UI improvements
  • Bug fixes: strategy detail view, codebase cleanup
  • Learning: investigated OKX gateway incident — reinforced importance of defensive programming in live systems

Vicky (7 commits — ML)

  • Volatility modeling: CatBoost regression for BTC, ETH, portfolio volatility (R² > 0.17, direction acc > 0.63). Tested XGBoost, alt loss functions, log targets. Cross-asset features hit ceiling
  • Direction prediction upper bound: precision-recall analysis shows ~48-50% precision cap — below ~70% profitability threshold. Return prediction near-random (R² ≈ 0). Regime classification ~50% with distribution drift
  • Feature expansion: funding rate, futures metrics, OI, long/short ratios, order book (Binance + OKX). Order book improved direction MCC 0.107 → 0.157
  • Infrastructure: fixed R² instability (expanded eval to 2,278 samples), refactored tuning pipeline to config-driven multi-model/multi-target framework

MJ (~201 commits — engine 128, ML 19, frontend 54)

  • Backtest: rebuilt as service — per-run isolation, SQLite, VWAP fills, ~69x speedup. Results still inaccurate vs live — not used for March iteration. Intentionally deferred to prioritize e2e live validation
  • Strategy: v3→v10 iteration via live + paper (backtest not accurate enough). New: SingleVenueMarketMaking
  • Risk: SETTLING status, OKX WS fills, alert system (Slack), verification hardening
  • Engine: order reconciliation, strategy lifecycle, past strategy management, CI/deploy automation
  • Frontend: strategy config UI, backtest UI, accounting improvements, monitor agent

Phase Assessment

Still in Phase 1 (Validate). Exit criteria not met.

All-time P&L is +$42 on paper, but +$113 came from the Mar 22-23 incident — accidental profit from unmanaged positions during a favorable price move. Excluding incident luck, all-time is ~-$71. Active v10 strategies are stable at +$2.87 but too early and too small to declare edge.

What's needed to exit: Sustain positive P&L from intentional strategy execution over a full month with no incident windfalls. Current active strategies (MeanReversion, StatsArb) are the best candidates.

Learning

Iteration velocity matters but is expensive. v3→v10 in one month via live + paper. MeanReversion went from worst (Feb) to best (Mar). But each iteration costs real money (live) or days (paper). Backtest accuracy would convert this to minutes + zero cost. Backtest infrastructure improved this month (per-run SQLite isolation, VWAP fills, ~69x speedup) but results don't match live behavior closely enough to use for iteration. Iteration was done entirely via live and paper — backtest accuracy is the prerequisite for cost-efficient strategy development.

Incident profit is not strategy profit. +$113 from unmanaged positions during a pipeline failure must be separated from strategy P&L in attribution. Luck-dependent outcomes mask systemic risk. The Mar 22-23 incident showed that reconciliation restores state but doesn't fix the bug — restarting without confirming pipeline health caused immediate recurrence and $3,170 in unintended positions. Fix systems, not symptoms.

Defense-in-depth requires independence. The polling fallback was disabled by the WS handler it was supposed to back up. Each defense layer must trigger on its own conditions, not be preemptible by the path it backs up.

Paper vs live divergence is real. MeanReversion profitable live, losing paper. TrendFollowing opposite. Same parameters, different results. Don't trust either in isolation. MarketMaking remains hard — two versions net negative, high volume but thin margins eaten by fees and slippage.

Direction prediction has a ceiling. Precision capped at ~48-50% with current features at 1h horizon. Below ~70% needed for profitability after fees. Next avenue: order book microstructure at shorter horizons (5-15min).