Reinforcement Learning in the Wild: Building AI for Financial Markets

From academic RL papers to a live system — the engineering and research challenges of deploying reinforcement learning where the environment reacts to your actions.

Why RL for Finance?

Financial decision-making has the structure of a sequential decision problem: you observe a state (market conditions, portfolio positions, risk exposure), take an action (buy, sell, hold, size), receive a reward (P&L, risk-adjusted return), and transition to a new state. The Markov structure is approximate but defensible for certain horizons. This maps naturally to reinforcement learning.

The appeal is not just formal. RL does not require labelled data in the supervised sense. It learns from interaction with the environment. In markets, where the “correct” action is never observed — only the outcome — this is genuinely attractive.

The Seductive Simplicity of the Formulation

A naive RL formulation for trading looks clean on paper:

State: price history, volume, portfolio position, time features
Action: discrete (buy/sell/hold) or continuous (position sizing)
Reward: portfolio return, Sharpe ratio, or a risk-penalized variant
Environment: historical market data (backtesting) or a live exchange feed

You can implement this in a weekend. The first backtest looks incredible. The agent learns to “time” the market perfectly on historical data. You get excited. Then you deploy it, and it falls apart.

Where It Breaks Down

Several things make markets fundamentally harder than the environments RL was benchmarked on:

Non-stationarity. The data distribution shifts continuously. A regime that existed in 2019 does not exist in 2022. An agent trained on one regime will overfit to it.

Partial observability. The true state of the market is not observable. Order flow, institutional positioning, macro sentiment — most of the signal that drives price is invisible to a standard agent.

Market impact. If your actions are large enough, the market reacts to them. You cannot treat the environment as fixed. This is especially severe in less liquid instruments.

Reward sparsity and delay. A position held for days generates a single reward signal after the fact. Assigning credit across the decision steps that led there is a hard temporal credit assignment problem.

Survivorship and lookahead bias. Backtesting environments are deceptively optimistic. It takes discipline to construct a clean evaluation.

What We Built

Our system had three main components:

1. A feature extraction layer. Raw price/volume data is not a good state representation. We built a pipeline that extracts multi-scale momentum features, microstructure signals, and cross-asset correlations. The representation was fixed and domain-driven, not learned end-to-end — we found this more stable.

2. A policy network trained with PPO. We used Proximal Policy Optimization with a reward function based on risk-adjusted return, with explicit penalty terms for drawdown and transaction costs. Action space was continuous position sizing over a set of instruments.

3. A regime detection module. Rather than training a single agent, we trained a small classifier to identify the current market regime (trending, mean-reverting, high-volatility) and routed to a regime-specific policy. This was the single biggest improvement in out-of-sample performance.

The Non-Stationarity Problem

The hardest problem we faced was non-stationarity. Markets change character over time. An agent trained on 2018–2021 data sees a very different distribution than 2022 data with rising rates and inflation.

Our partial solution was rolling retraining with exponential decay of older samples, combined with a “staleness detector” that triggered retraining when the feature distribution drifted beyond a threshold. This helped but did not fully solve the problem — it is an open research question.

Lessons

What worked: Domain-driven feature engineering, conservative reward shaping with explicit risk penalties, regime-aware routing, tight evaluation discipline.

What did not: End-to-end learning from raw price data, complex recurrent architectures without strong regularization, overfitting to backtest Sharpe.

The honest conclusion: RL in finance works, but not for the reasons the papers suggest. The contribution of the RL component per se is modest compared to the quality of the feature representation and the sophistication of the evaluation pipeline. The framework is useful for handling multi-step optimization under constraints — but it is not a free lunch.

The experience was a forcing function for rigor: when real money is on the line, you find the gaps in your methodology quickly.

Enjoy Reading This Article?

Here are some more articles you might like to read next: