Training a Deep Q-Network for Trading

I’ve been working on a trading algorithm that uses reinforcement learning to make buy, hold, and sell decisions. Not a toy project but a real system I’m training on Lambda Cloud with actual capital on the line eventually. The core is a Deep Q-Network, and getting it to train stably has been an education in all the things that can go wrong with RL.

This post covers how DQNs work and the techniques I’m using to train ours. I’m not going to detail the specific features or trading strategy, but I do want to explain the modeling approach because there’s a lot of hard-won knowledge in making these things actually converge.

What’s a Deep Q-Network?

A DQN learns to estimate the value of taking different actions in different states. The “Q” in Q-learning stands for quality: Q(state, action) tells you how good it is to take a particular action when you’re in a particular state.

The classic Q-learning algorithm maintains a table of Q-values and updates them based on the Bellman equation: the value of an action should equal the immediate reward plus the discounted value of the best action you can take from the next state. This works great when your state space is small enough to fit in a table.

For trading, our state space is continuous and high-dimensional. We have dozens of features describing market conditions, and no two states are exactly alike. So we use a neural network to approximate the Q-function instead of a table. That’s the “deep” part of Deep Q-Network.

The network takes a state as input and outputs Q-values for each possible action. During training, we pick actions mostly greedily (take the action with highest Q-value) but with some random exploration mixed in. We observe the reward and next state, then update the network to make its Q-value predictions more accurate.

The Instability Problem

Naive DQNs are notoriously unstable. The problem is that you’re chasing a moving target. Every time you update your network, the Q-values change, which changes what the “correct” Q-values should be for your training data, which means your training data is now wrong.

It’s like trying to hit a target that moves every time you swing. Sometimes you converge. Sometimes you oscillate forever. Sometimes you diverge completely.

The field has developed several techniques to address this, and we use most of them.

Target Networks

The first big fix is using a separate target network. You have two copies of your network: the policy network that you’re actively training, and the target network that you use to compute the “correct” Q-values for training.

The target network stays frozen while you train the policy network. Periodically, you update the target network to match the policy network. This gives you a stable target to train against, at least for a while.

We use soft updates rather than hard copies. Instead of completely replacing the target network every N steps, we slowly blend it toward the policy network:

θ_target ← τ * θ_policy + (1-τ) * θ_target

With τ = 0.005, the target network drifts slowly toward the policy network. This is more stable than periodic hard updates because there’s no sudden discontinuity in what you’re training toward.

Double DQN

Standard DQN has an overestimation problem. When you compute the target Q-value, you take the max over all actions. But your Q-value estimates are noisy, and taking the max of noisy estimates tends to overestimate the true value. Over time, this bias compounds and your Q-values drift upward.

Double DQN fixes this by decoupling action selection from action evaluation. The policy network picks which action is best, but the target network evaluates how good that action actually is:

best_actions = policy_net(next_states).argmax(dim=1)
next_q = target_net(next_states).gather(1, best_actions)

This simple change significantly reduces overestimation and leads to more stable training.

Prioritized Experience Replay

Standard experience replay stores transitions in a buffer and samples uniformly at random for training. But not all transitions are equally useful. A transition where your prediction was way off has more to teach you than one where you were already accurate.

Prioritized experience replay samples transitions proportionally to their TD error, the difference between your predicted Q-value and the target Q-value. High-error transitions get sampled more often.

The catch is that this introduces bias. You’re no longer sampling uniformly, so you need importance sampling weights to correct for it. Each sampled transition gets a weight inversely proportional to its sampling probability, which counters the bias from non-uniform sampling.

We anneal the importance sampling correction over training, starting with partial correction and moving toward full correction as training progresses. Early in training, we care more about learning from surprising transitions. Late in training, we care more about unbiased estimates.

We also clip TD errors aggressively. Trading environments can produce extreme rewards, and without clipping, a few outlier transitions would dominate the priority distribution. We cap errors to keep the priorities bounded.

Reward Shaping

The raw reward signal in trading is sparse and noisy. You take an action, wait for the trade to resolve, and get a profit or loss. But that P&L depends on market noise as much as decision quality. And most of the time, the right action is to hold, which gives you nothing to learn from.

We shape rewards to give the network more signal. Holding when there’s no opportunity gets a small positive reward. Holding when there’s a big gap to trade gets a small negative reward. Winning trades get scaled rewards based on magnitude, with a bonus for predicting direction correctly.

The goal is to give credit for good process, not just good outcomes. A trade that loses money but was the right decision given the information available should still get partial credit. This is the same logic behind Annie Duke’s “thinking in bets”: separate decision quality from outcome quality.

Architecture Evolution

My network architecture has evolved through four major versions, each addressing problems discovered in the previous one.

The first version used a simple feedforward network with shared layers feeding into separate heads for the action (buy/hold/sell) and the exit parameters (stop-loss and take-profit levels). This worked but had trouble learning which features mattered.

The current version uses an attention mechanism over feature groups. Instead of treating all inputs as one flat vector, it organizes features into semantic groups: overnight indicators, volume patterns, market regime, and so on. Each group gets projected to an embedding, then it applies multi-head self-attention across groups.

This has two benefits. First, the network can learn which feature groups are relevant for different decisions. Sometimes volume matters more; sometimes overnight action matters more. Attention lets the network route information dynamically.

Second, it gives us interpretability. I can look at the attention weights to see which feature groups the network focused on for a particular decision. When the model makes a trade I don’t understand, I can at least see what it was paying attention to.

Continuous Outputs for Risk Management

A pure DQN would output discrete actions: buy, hold, or sell. But in trading, the entry decision is only half the battle. You also need to decide where to set your stop-loss and take-profit levels.

We handle this with a multi-task architecture. The network has one head that outputs Q-values for discrete actions, and another head that outputs a continuous value for the exit percentage. The exit output goes through a sigmoid and gets scaled to a reasonable range.

The loss function is a weighted combination of the Q-learning loss (for action selection) and an MSE loss (for exit prediction). This lets us learn both aspects jointly rather than training separate models.

So far this has only somewhat worked, the model does predict stop loss and take profit levels but they are always within a very tight range, regardless of the underlying stock. More work needed here.

Training Stability Techniques

Beyond the algorithmic innovations, there’s a lot of basic engineering that matters for stability.

Gradient clipping prevents exploding gradients. I clip the gradient norm to 0.5, which is aggressive but necessary given how noisy trading rewards can be.

Learning rate scheduling reduces the learning rate when progress stalls. I use ReduceLROnPlateau, which halves the learning rate after 50 steps without improvement.

Batch normalization helps with internal covariate shift and makes training less sensitive to initialization.

Huber loss (SmoothL1) instead of MSE for the Q-learning loss. Huber loss is less sensitive to outliers, which matters when occasional trades produce extreme P&L.

Slow exploration decay. I start with high exploration (epsilon = 1.0, meaning random actions) and decay slowly toward low exploration (epsilon = 0.05). The decay rate of 0.998 per episode means exploration stays high for a long time. Premature exploitation is a common failure mode in RL.

Infrastructure

I train on Lambda Cloud because the iteration cycle matters. It can train on my macbook, and I’ve done that, but it takes days, and on an A100 it trains in a few hours for a few dollars. Being able to spin up GPU instances on demand, run experiments in parallel, and tear them down when done makes the whole process tractable.

Checkpointing is critical. I save the full training state every 25 episodes: network weights, optimizer state, replay buffer statistics, and training history. Training can crash or get preempted, and being able to resume exactly where you left off saves days of compute over time.

What’s Actually Hard

The techniques I’ve described are all documented in papers and implemented in libraries. The hard part isn’t knowing about them. It’s knowing when something is wrong and which technique might help.

Training instability can look like many things: loss spiking, Q-values diverging, the network collapsing to always predicting the same action, performance improving then crashing. Each symptom has multiple possible causes, and the causes interact.

Is your network diverging because the learning rate is too high, or because your rewards are too extreme, or because your target network is updating too fast, or because you have a bug in your environment? Usually you try all of the above, in some order, while slowly going insane.

The attention mechanism helps with interpretability, but it’s not a silver bullet. You can see what the network is paying attention to, but you can’t see why it’s making the decisions it’s making. The Q-values are just numbers. They don’t come with explanations.

Where We Are

The current system trains stably and learns to do better than random. I’m able to automate the application of it’s trading in Alpaca fairly easily and I run it in paper trading mode there periodically to see how it’s going.

Overall the performance is ok but the training environment needs more realistic modeling for slippage.

Reinforcement learning for trading is hard because the signal-to-noise ratio is terrible. Markets are noisy, rewards are delayed, and the optimal policy might just be “don’t trade.”

But that’s what makes it interesting.