Cài đặt DQN bằng python và pytorch

sonnh

Giới thiệu DQN

Học tăng cường được biết là không ổn định hoặc thậm chí phân kỳ khi một bộ xấp xỉ hàm phi tuyến như mạng nơ-ron được sử dụng để biểu diễn hàm giá trị hành động value-action (còn được gọi là hàm [imath]Q[/imath] )
Sự không ổn định này có một số nguyên nhân: các mối tương quan hiện diện trong chuỗi các quan sát (obs). Thực tế là các bản cập nhật nhỏ cho [imath]Q[/imath] có thể thay đổi đáng kể policy và do đó thay đổi phân phối dữ liệu và mối tương quan giữa các value-action( [imath]Q[/imath] ) và target value [imath]r + γmax_{a'}Q(s',a')[/imath] .

Để giải quyết những bất ổn này với một biến thể mới của Q-learning: Replay buffer và Fixed Q-target

Lấy mẫu ngẫu nhiên đồng nhất từ Bộ nhớ phát lại trải nghiệm

Tác nhân học tập củng cố lưu trữ các trải nghiệm liên tục trong bộ đệm. Vì vậy các chuyển đổi liền kề [imath](s, a, r, s')[/imath] được lưu trữ có nhiều khả năng có mối tương quan.
Để loại bỏ điều này, agent lấy mẫu đồng nhất một cách ngẫu nhiên từ nhóm các mẫu được lưu trữ
[imath] ((s,a,r,s')∼U(D)) [/imath] U là uniform đó các má.

Xem phương thức sample_batch của lớp ReplayBuffer để biết thêm chi tiết.

Fixed Q-target

DQN sử dụng vòng lặp lặp đi lặp lại để điều chỉnh action-values ( [imath]Q[/imath] ) hướng tới các giá trị mục tiêu chỉ được cập nhật định kỳ. Do đó làm giảm mối tương quan với target.
Nếu không, nó rất dễ bị chệch hướng vì mục tiêu liên tục di chuyển. Cập nhật Q-learning ở lần lặp [imath]i [/imath] sử dụng hàm mất mát sau:
[math] L_i(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim U(D)} \big[ \big( r + \gamma \max_{a'} Q(s',a';\theta_i^-) - Q(s, a; \theta_i) \big)^2 \big] [/math]

trong đó [imath]γ[/imath] là hệ số chiết khấu. [imath]θ_i[/imath] là các tham số của mạng [imath]Q[/imath] ở lần lặp [imath]i [/imath] và [imath]θ_i^-[/imath] là các tham số mạng được sử dụng để tính toán target ở lần lặp [imath]i[/imath] . Các tham số của target network [imath]θ_i^-[/imath] chỉ được cập nhật với các tham số của Q-network ( [imath]θi[/imath] ) sau mỗi bước [imath]C[/imath] và được giữ cố định giữa các lần cập nhật riêng lẻ. ( [imath]C=200 [/imath] in CartPole-v0)

Để ổn định hơn: Gradient clipping

Rất hữu ích khi cắt lỗi từ update [imath]r + γmax_{a'}Q (s', a'; θ^−_i) −Q (s, a,; θi)[/imath] nằm trong khoảng từ -1 đến 1. Vì hàm mất mát có giá trị tuyệt đối [imath] | x | [/imath] có đạo hàm là -1 với mọi giá trị âm của [imath]x[/imath] và đạo hàm bằng 1 với mọi giá trị dương của [imath]x[/imath] . Clipping the squared error thành từ -1 đến 1 tương ứng với việc sử dụng trị tuyệt đối của hàm mất mát cho các lỗi nằm ngoài khoảng (-1,1). Hình thức cắt lỗi này đã cải thiện thêm tính ổn định của thuật toán.

Cài đặt một số thư viện

pip install torch
pip install gym

import os
from typing import Dict, List, Tuple

import gym
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Replay buffer

Thông thường, mọi người triển khai bộ đệm phát lại với một trong ba cấu trúc dữ liệu sau:

collections.deque
list
numpy.ndarray

deque rất dễ xử lý khi bạn khởi tạo độ dài tối đa của nó (ví dụ: deque(maxlen=buffer_size)). Tuy nhiên, hoạt động lập chỉ mục của deque trở nên chậm kinh khủng khi nó lớn lên vì nó là danh sách liên kết đôi nội bộ. Mặt khác, list là một mảng, vì vậy nó tương đối nhanh hơn deque khi bạn lấy mẫu hàng loạt ở mỗi bước. Chi phí phân bổ của Get item là [imath]O(1)[/imath] .
Cuối cùng nhưng không kém phần quan trọng, hãy xem numpy.ndarray. numpy.ndarray thậm chí còn nhanh hơn list do nó là một mảng đồng nhất của các ô có kích thước cố định,

Ở đây, chúng ta sẽ triển khai bộ đệm phát lại bằng cách sử dụng numpy.ndarray.

class ReplayBuffer:
    """A simple numpy replay buffer."""

    def __init__(self, obs_dim: int, size: int, batch_size: int = 32):
        self.obs_buf = np.zeros([size, obs_dim], dtype=np.float32)
        self.next_obs_buf = np.zeros([size, obs_dim], dtype=np.float32)
        self.acts_buf = np.zeros([size], dtype=np.float32)
        self.rews_buf = np.zeros([size], dtype=np.float32)
        self.done_buf = np.zeros(size, dtype=np.float32)
        self.max_size, self.batch_size = size, batch_size
        self.ptr, self.size, = 0, 0

    def store(
        self,
        obs: np.ndarray,
        act: np.ndarray, 
        rew: float, 
        next_obs: np.ndarray, 
        done: bool,
    ):
        self.obs_buf[self.ptr] = obs
        self.next_obs_buf[self.ptr] = next_obs
        self.acts_buf[self.ptr] = act
        self.rews_buf[self.ptr] = rew
        self.done_buf[self.ptr] = done
        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample_batch(self) -> Dict[str, np.ndarray]:
        idxs = np.random.choice(self.size, size=self.batch_size, replace=False)
        return dict(obs=self.obs_buf[idxs],
                    next_obs=self.next_obs_buf[idxs],
                    acts=self.acts_buf[idxs],
                    rews=self.rews_buf[idxs],
                    done=self.done_buf[idxs])

    def __len__(self) -> int:
        return self.size

Network

Chúng ta sẽ sử dụng một kiến trúc mạng đơn giản với ba lớp được kết nối đầy đủ và hai hàm kích hoạt phi tuyến tính (ReLU)

class Network(nn.Module):
    def __init__(self, in_dim: int, out_dim: int):
        """Initialization."""
        super(Network, self).__init__()

        self.layers = nn.Sequential(
            nn.Linear(in_dim, 128), 
            nn.ReLU(),
            nn.Linear(128, 128), 
            nn.ReLU(), 
            nn.Linear(128, out_dim)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        return self.layers(x)

DQN Agent

Các chức năng của DQN class

select_action: chọn một hành động từ trạng thái đầu vào..
step: thực hiện một hành động và trả về phản hồi của env.
compute_dqn_loss: return dqn loss.
update_model: cập nhật mô hình bằng gradient descent.
target_hard_update: hard update từ local model qua target model.
train: train agent
test: test agent (1 episode).
plot: plot training progresses.

class DQNAgent:
    """DQN Agent interacting with environment.
    
    Attribute:
        env (gym.Env): openAI Gym environment
        memory (ReplayBuffer): replay memory to store transitions
        batch_size (int): batch size for sampling
        epsilon (float): parameter for epsilon greedy policy
        epsilon_decay (float): step size to decrease epsilon
        max_epsilon (float): max value of epsilon
        min_epsilon (float): min value of epsilon
        target_update (int): period for target model's hard update
        gamma (float): discount factor
        dqn (Network): model to train and select actions
        dqn_target (Network): target model to update
        optimizer (torch.optim): optimizer for training dqn
        transition (list): transition information including 
                           state, action, reward, next_state, done
    """

    def __init__(
        self, 
        env: gym.Env,
        memory_size: int,
        batch_size: int,
        target_update: int,
        epsilon_decay: float,
        max_epsilon: float = 1.0,
        min_epsilon: float = 0.1,
        gamma: float = 0.99,
    ):
        """Initialization.
        
        Args:
            env (gym.Env): openAI Gym environment
            memory_size (int): length of memory
            batch_size (int): batch size for sampling
            target_update (int): period for target model's hard update
            epsilon_decay (float): step size to decrease epsilon
            lr (float): learning rate
            max_epsilon (float): max value of epsilon
            min_epsilon (float): min value of epsilon
            gamma (float): discount factor
        """
        obs_dim = env.observation_space.shape[0]
        action_dim = env.action_space.n
        
        self.env = env
        self.memory = ReplayBuffer(obs_dim, memory_size, batch_size)
        self.batch_size = batch_size
        self.epsilon = max_epsilon
        self.epsilon_decay = epsilon_decay
        self.max_epsilon = max_epsilon
        self.min_epsilon = min_epsilon
        self.target_update = target_update
        self.gamma = gamma
        
        # device: cpu / gpu
        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"
        )
        print(self.device)

        # networks: dqn, dqn_target
        self.dqn = Network(obs_dim, action_dim).to(self.device)
        self.dqn_target = Network(obs_dim, action_dim).to(self.device)
        self.dqn_target.load_state_dict(self.dqn.state_dict())
        self.dqn_target.eval()
        
        # optimizer
        self.optimizer = optim.Adam(self.dqn.parameters())

        # transition to store in memory
        self.transition = list()
        
        # mode: train / test
        self.is_test = False

    def select_action(self, state: np.ndarray) -> np.ndarray:
        """Select an action from the input state."""
        # epsilon greedy policy
        if self.epsilon > np.random.random():
            selected_action = self.env.action_space.sample()
        else:
            selected_action = self.dqn(
                torch.FloatTensor(state).to(self.device)
            ).argmax()
            selected_action = selected_action.detach().cpu().numpy()
        
        if not self.is_test:
            self.transition = [state, selected_action]
        
        return selected_action

    def step(self, action: np.ndarray) -> Tuple[np.ndarray, np.float64, bool]:
        """Take an action and return the response of the env."""
        next_state, reward, done, _ = self.env.step(action)

        if not self.is_test:
            self.transition += [reward, next_state, done]
            self.memory.store(*self.transition)
    
        return next_state, reward, done

    def update_model(self) -> torch.Tensor:
        """Update the model by gradient descent."""
        samples = self.memory.sample_batch()

        loss = self._compute_dqn_loss(samples)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()
        
    def train(self, num_frames: int, plotting_interval: int = 200):
        """Train the agent."""
        self.is_test = False
        
        state = self.env.reset()
        update_cnt = 0
        epsilons = []
        losses = []
        scores = []
        score = 0

        for frame_idx in range(1, num_frames + 1):
            action = self.select_action(state)
            next_state, reward, done = self.step(action)

            state = next_state
            score += reward

            # if episode ends
            if done:
                state = self.env.reset()
                scores.append(score)
                score = 0

            # if training is ready
            if len(self.memory) >= self.batch_size:
                loss = self.update_model()
                losses.append(loss)
                update_cnt += 1
                
                # linearly decrease epsilon
                self.epsilon = max(
                    self.min_epsilon, self.epsilon - (
                        self.max_epsilon - self.min_epsilon
                    ) * self.epsilon_decay
                )
                epsilons.append(self.epsilon)
                
                # if hard update is needed
                if update_cnt % self.target_update == 0:
                    self._target_hard_update()

            # plotting
            if frame_idx % plotting_interval == 0:
                self._plot(frame_idx, scores, losses, epsilons)
                
        self.env.close()
                
    def test(self, video_folder: str) -> None:
        """Test the agent."""
        self.is_test = True
        
        # for recording a video
        naive_env = self.env
        self.env = gym.wrappers.RecordVideo(self.env, video_folder=video_folder)
        
        state = self.env.reset()
        done = False
        score = 0
        
        while not done:
            action = self.select_action(state)
            next_state, reward, done = self.step(action)

            state = next_state
            score += reward
        
        print("score: ", score)
        self.env.close()
        
        # reset
        self.env = naive_env

    def _compute_dqn_loss(self, samples: Dict[str, np.ndarray]) -> torch.Tensor:
        """Return dqn loss."""
        device = self.device  # for shortening the following lines
        state = torch.FloatTensor(samples["obs"]).to(device)
        next_state = torch.FloatTensor(samples["next_obs"]).to(device)
        action = torch.LongTensor(samples["acts"].reshape(-1, 1)).to(device)
        reward = torch.FloatTensor(samples["rews"].reshape(-1, 1)).to(device)
        done = torch.FloatTensor(samples["done"].reshape(-1, 1)).to(device)

        # G_t   = r + gamma * v(s_{t+1})  if state != Terminal
        #       = r                       otherwise
        curr_q_value = self.dqn(state).gather(1, action)
        next_q_value = self.dqn_target(
            next_state
        ).max(dim=1, keepdim=True)[0].detach()
        mask = 1 - done
        target = (reward + self.gamma * next_q_value * mask).to(self.device)

        # calculate dqn loss
        loss = F.smooth_l1_loss(curr_q_value, target)

        return loss

    def _target_hard_update(self):
        """Hard update: target <- local."""
        self.dqn_target.load_state_dict(self.dqn.state_dict())
                
    def _plot(
        self, 
        frame_idx: int, 
        scores: List[float], 
        losses: List[float], 
        epsilons: List[float],
    ):
        """Plot the training progresses."""
        clear_output(True)
        plt.figure(figsize=(20, 5))
        plt.subplot(131)
        plt.title('frame %s. score: %s' % (frame_idx, np.mean(scores[-10:])))
        plt.plot(scores)
        plt.subplot(132)
        plt.title('loss')
        plt.plot(losses)
        plt.subplot(133)
        plt.title('epsilons')
        plt.plot(epsilons)
        plt.show()

Environment


seed = 777

def seed_torch(seed):
    torch.manual_seed(seed)
    if torch.backends.cudnn.enabled:
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.deterministic = True

np.random.seed(seed)
seed_torch(seed)
env.seed(seed)

Initialize

# parameters
num_frames = 10000
memory_size = 1000
batch_size = 32
target_update = 100
epsilon_decay = 1 / 2000

agent = DQNAgent(env, memory_size, batch_size, target_update, epsilon_decay)

Train

agent.train(num_frames)

Test

video_folder="videos/dqn"
agent.test(video_folder=video_folder)

Tổng kết

Như vậy các bạn đã hoàn thành việc triển khai DQN từ đầu với env từ open AI gym