Take a look at the files in the directory to better understand the structure of the project.
task.py
: Define your task (environment) in this file.agents/
: Folder containing reinforcement learning agents.policy_search.py
: A sample agent has been provided here.agent.py
: Develop your agent here.physics_sim.py
: This file contains the simulator for the quadcopter. DO NOT MODIFY THIS FILE.For this project, you will define your own task in task.py
. Although we have provided a example task to get you started, you are encouraged to change it. Later in this notebook, you will learn more about how to amend this file.
You will also design a reinforcement learning agent in agent.py
to complete your chosen task.
You are welcome to create any additional files to help you to organize your code. For instance, you may find it useful to define a model.py
file defining any needed neural network architectures.
We provide a sample agent in the code cell below to show you how to use the sim to control the quadcopter. This agent is even simpler than the sample agent that you'll examine (in agents/policy_search.py
) later in this notebook!
The agent controls the quadcopter by setting the revolutions per second on each of its four rotors. The provided agent in the Basic_Agent
class below always selects a random action for each of the four rotors. These four speeds are returned by the act
method as a list of four floating-point numbers.
For this project, the agent that you will implement in agents/agent.py
will have a far more intelligent method for selecting actions!
# In[]: Import all libraries here
import os
print (os.getcwd())
import numpy as np
from physics_sim import PhysicsSim
import copy
import random
import gym
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import sys
import csv
import itertools
class Basic_Agent():
def __init__(self, task):
self.task = task
def act(self):
new_thrust = random.gauss(450., 25.)
return [new_thrust + random.gauss(0., 1.) for x in range(4)]
Run the code cell below to have the agent select actions to control the quadcopter.
Feel free to change the provided values of runtime
, init_pose
, init_velocities
, and init_angle_velocities
below to change the starting conditions of the quadcopter.
The labels
list below annotates statistics that are saved while running the simulation. All of this information is saved in a text file data.txt
and stored in the dictionary results
.
%load_ext autoreload
%autoreload 2
#observe on the vanilla task first
from task import Task
# Modify the values below to give the quadcopter a different starting position.
runtime = 5. # time limit of the episode
init_pose = np.array([0., 0., 10., 0., 0., 0.]) # initial pose
init_velocities = np.array([0., 0., 0.]) # initial velocities
init_angle_velocities = np.array([0., 0., 0.]) # initial angle velocities
file_output = 'data.txt' # file name for saved results
# Setup
task = Task(init_pose, init_velocities, init_angle_velocities, runtime)
agent = Basic_Agent(task)
done = False
labels = ['time', 'x', 'y', 'z', 'phi', 'theta', 'psi', 'x_velocity',
'y_velocity', 'z_velocity', 'phi_velocity', 'theta_velocity',
'psi_velocity', 'rotor_speed1', 'rotor_speed2', 'rotor_speed3', 'rotor_speed4']
results = {x : [] for x in labels}
# Run the simulation, and save the results.
with open(file_output, 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(labels)
while True:
rotor_speeds = agent.act()
_, _, done = task.step(rotor_speeds)
to_write = [task.sim.time] + list(task.sim.pose) + list(task.sim.v) + list(task.sim.angular_v) + list(rotor_speeds)
for ii in range(len(labels)):
results[labels[ii]].append(to_write[ii])
writer.writerow(to_write)
if done:
break
Run the code cell below to visualize how the position of the quadcopter evolved during the simulation.
%matplotlib inline
plt.plot(results['time'], results['x'], label='x')
plt.plot(results['time'], results['y'], label='y')
plt.plot(results['time'], results['z'], label='z')
plt.legend()
_ = plt.ylim()
The next code cell visualizes the velocity of the quadcopter.
plt.plot(results['time'], results['x_velocity'], label='x_hat')
plt.plot(results['time'], results['y_velocity'], label='y_hat')
plt.plot(results['time'], results['z_velocity'], label='z_hat')
plt.legend()
_ = plt.ylim()
Next, you can plot the Euler angles (the rotation of the quadcopter over the $x$-, $y$-, and $z$-axes),
plt.plot(results['time'], results['phi'], label='phi')
plt.plot(results['time'], results['theta'], label='theta')
plt.plot(results['time'], results['psi'], label='psi')
plt.legend()
_ = plt.ylim()
before plotting the velocities (in radians per second) corresponding to each of the Euler angles.
plt.plot(results['time'], results['phi_velocity'], label='phi_velocity')
plt.plot(results['time'], results['theta_velocity'], label='theta_velocity')
plt.plot(results['time'], results['psi_velocity'], label='psi_velocity')
plt.legend()
_ = plt.ylim()
Finally, you can use the code cell below to print the agent's choice of actions.
plt.plot(results['time'], results['rotor_speed1'], label='Rotor 1 revolutions / second')
plt.plot(results['time'], results['rotor_speed2'], label='Rotor 2 revolutions / second')
plt.plot(results['time'], results['rotor_speed3'], label='Rotor 3 revolutions / second')
plt.plot(results['time'], results['rotor_speed4'], label='Rotor 4 revolutions / second')
plt.legend()
_ = plt.ylim()
When specifying a task, you will derive the environment state from the simulator. Run the code cell below to print the values of the following variables at the end of the simulation:
task.sim.pose
(the position of the quadcopter in ($x,y,z$) dimensions and the Euler angles),task.sim.v
(the velocity of the quadcopter in ($x,y,z$) dimensions), andtask.sim.angular_v
(radians/second for each of the three Euler angles).# the pose, velocity, and angular velocity of the quadcopter at the end of the episode
print(task.sim.pose)
print(task.sim.v)
print(task.sim.angular_v)
In the sample task in task.py
, we use the 6-dimensional pose of the quadcopter to construct the state of the environment at each timestep. However, when amending the task for your purposes, you are welcome to expand the size of the state vector by including the velocity information. You can use any combination of the pose, velocity, and angular velocity - feel free to tinker here, and construct the state to suit your task.
A sample task has been provided for you in task.py
. Open this file in a new window now.
The __init__()
method is used to initialize several variables that are needed to specify the task.
PhysicsSim
class (from physics_sim.py
). action_repeats
timesteps. If you are not familiar with action repeats, please read the Results section in the DDPG paper.state_size
), we must take action repeats into account. action_size=4
). You can set the minimum (action_low
) and maximum (action_high
) values of each entry here.The reset()
method resets the simulator. The agent should call this method every time the episode ends. You can see an example of this in the code cell below.
The step()
method is perhaps the most important. It accepts the agent's choice of action rotor_speeds
, which is used to prepare the next state to pass on to the agent. Then, the reward is computed from get_reward()
. The episode is considered done if the time limit has been exceeded, or the quadcopter has travelled outside of the bounds of the simulation.
In the next section, you will learn how to test the performance of an agent on this task.
The sample agent given in agents/policy_search.py
uses a very simplistic linear policy to directly compute the action vector as a dot product of the state vector and a matrix of weights. Then, it randomly perturbs the parameters by adding some Gaussian noise, to produce a different policy. Based on the average reward obtained in each episode (score
), it keeps track of the best set of parameters found so far, how the score is changing, and accordingly tweaks a scaling factor to widen or tighten the noise.
Run the code cell below to see how the agent performs on the sample task.
from agents.policy_search import PolicySearch_Agent
from task import Task
num_episodes = 1000
target_pos = np.array([0., 0., 10.])
task = Task(target_pos=target_pos)
agent = PolicySearch_Agent(task)
for i_episode in range(1, num_episodes+1):
state = agent.reset_episode() # start a new episode
while True:
action = agent.act(state)
next_state, reward, done = task.step(action)
agent.step(reward, done)
state = next_state
if done:
print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f}), noise_scale = {}".format(
i_episode, agent.score, agent.best_score, agent.noise_scale), end="") # [debug]
break
sys.stdout.flush()
This agent should perform very poorly on this task. And that's where you come in!
Amend task.py
to specify a task of your choosing. If you're unsure what kind of task to specify, you may like to teach your quadcopter to takeoff, hover in place, land softly, or reach a target pose.
After specifying your task, use the sample agent in agents/policy_search.py
as a template to define your own agent in agents/agent.py
. You can borrow whatever you need from the sample agent, including ideas on how you might modularize your code (using helper methods like act()
, learn()
, reset_episode()
, etc.).
Note that it is highly unlikely that the first agent and task that you specify will learn well. You will likely have to tweak various hyperparameters and the reward function for your task until you arrive at reasonably good behavior.
As you develop your agent, it's important to keep an eye on how it's performing. Use the code above as inspiration to build in a mechanism to log/save the total rewards obtained in each episode to file. If the episode rewards are gradually increasing, this is an indication that your agent is learning.
## TODO: Train your agent here.
#load the needed libraries first
from collections import namedtuple, deque
from keras import layers, models, optimizers, activations, initializers, regularizers
from keras import backend as K
#class imports
from agents.agent import DDPG
from agents.agent import ReplayBuffer
from agents.agent import OUNoise
from agents.agent import Actor
from agents.agent import Critic
#my production task
from task_prod import Task
#specify initial and target positions
#maybe from a slight off position would perform better than from ground?
init_pose = np.array([10., 10., 10., 0., 0., 0.]) # initial pose
init_velocities = np.array([0., 0., 0.]) # initial velocities
init_angle_velocities = np.array([0., 0., 0.]) # initial angle velocities
target_pos = np.array([10., 10., 100.]) #goal is go vertically up a little bit
task = Task(target_pos=target_pos) #initial must in the physics_sim?
labels = ['time', 'x', 'y', 'z', 'phi', 'theta', 'psi', 'x_velocity',
'y_velocity', 'z_velocity', 'phi_velocity', 'theta_velocity',
'psi_velocity', 'rotor_speed1', 'rotor_speed2', 'rotor_speed3', 'rotor_speed4']
# for scores
labels2 = labels + ['score']
worst_score = 1000000
best_score = -1000000.
#output data for review
reward_log = "reward.txt"
results_log = "results.txt"
reward_labels = ['episode', 'reward']
reward_results = {x : [] for x in reward_labels}
#Instantiate the agent
#my agent is highly parameterized for manual grid search
#also spits out the Keras summary for the networks
#but I did not both with Tensorboard this time
#full pipeline is in the quad_project_full_code.py
v_actor_layer1_units=400
v_actor_layer2_units=300
v_actor_activation='relu'
v_actor_raw_action='tanh'
v_critic_layer1_units=400
v_critic_layer2_units=300
v_critic_activation='relu'
v_buffer_size = 100000
v_batch_size = 64
v_gamma = 0.99
v_tau = 0.001
v_lrate = 0.001
v_dropout_rate = 0.02
v_kernel_l2_reg = 1e-6
agent = DDPG(task, gym=False,
actor_layer1_units=v_actor_layer1_units,
actor_layer2_units=v_actor_layer2_units,
actor_activation=v_actor_activation,
actor_raw_action=v_actor_raw_action,
critic_layer1_units=v_critic_layer1_units,
critic_layer2_units=v_critic_layer2_units,
critic_activation=v_critic_activation,
buffer_size = v_buffer_size,
batch_size = v_batch_size,
gamma = v_gamma,
tau = v_tau,
lrate= v_lrate,
dropout_rate=v_dropout_rate,
kernel_l2_reg=v_kernel_l2_reg
)
# In[]:
num_episodes = 500
results_train=[]
for i_episode in range(1, num_episodes+1):
state = agent.reset_episode() # start a new episode
score = 0
while True:
action = agent.act(state)
next_state, reward, done = task.step(action)
agent.step(action, reward, next_state, done)
state = next_state
score += reward
best_score = max(best_score , score)
worst_score = min(worst_score , score)
#ME: I want to output stats from each episode to see
to_write = [task.sim.time] + list(task.sim.pose) + list(task.sim.v) + list(task.sim.angular_v) + list(action)
to_write.append(score) #had to append this way
results_train.append(to_write)
if done:
print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f} , worst = {:7.3f})".format(
i_episode, score, best_score, worst_score), end="")
break
reward_results['episode'].append(i_episode)
reward_results['reward'].append(score)
sys.stdout.flush()
Once you are satisfied with your performance, plot the episode rewards, either from a single run, or averaged over multiple runs.
#plot in another cell
plt.plot(reward_results['episode'], reward_results['reward'], label='reward/episode')
plt.legend()
_ = plt.ylim()
# In[]: #Plot position
output_df = pd.DataFrame(results_train, columns=labels2)
output_df.to_csv(results_log)
plt.plot(output_df.index, output_df['x'], label='x')
plt.plot(output_df.index, output_df['y'], label='y')
plt.plot(output_df.index, output_df['z'], label='z')
plt.legend()
_ = plt.ylim()
# In[]: Plot scores (for each time step)
plt.plot(output_df.index, output_df['score'], label='score')
plt.legend()
_ = plt.ylim()
# output the rewards
output_df = pd.DataFrame(reward_results)
output_df.to_csv(reward_log)
# In[]: season finale performance
print("Average reward over last 10 episodes): {}".format(np.sum(reward_results['reward'][-10:])/10))
Question 1: Describe the task that you specified in task.py
. How did you design the reward function?
Answer:
I specified the simple task of moving to a position. Specifically to gain altitude - go up the Z-coordinate. Tried several continuous reward functions based on reaching the target position:
Question 2: Discuss your agent briefly, using the following questions as a guide:
Answer:
I kept the recommended DDPG algorithm and spend most of my time on building the actor/critic neural networks and fine tuning them via a manual grid-search pipeline which I shared in the quad_project_full_code.py. In order for such grid search to work I had to parameterize as much as possible. The full list of hyper-parameters used is shown above. Given more time, perhaps, more fine tuning would have improved performance.
The networks that I created using Keras are printed via its summary() fn above. Here is a brief overview.
Actor network: two dense layers with L2-regularization, 'relu' activation, batch normalization and dropout. Last layer with with 'tanh' activation.
Critic network: For states - two dense layers with L2-regularization, 'relu' activation, but only one of them followed by batch normalization and dropout. For actions - only one dense layer with L2-regularization, 'relu' activation. The two network streams are then joined, activated and passed through the final dense layer to produce the Q-values.
Surprisingly, I ended up having a whole lot more units than in the vanilla code: 300-400 units for both Actor and Critic. In the future it might be worthwhile for me to test if I can get good performance with fewer units.
Question 3: Using the episode rewards plot, discuss how the agent learned over time.
Answer:
I found it hard to train the agent as desired. Perhaps, it would be easier to start with a smaller state and action space. My agent seemed to be learning but not in a smooth consistent way that I would have preferred.
As you can see from the plot above, performance was volatile until about 300 episodes after which there is a noticeable improvement albeit with poorly performing outliers.
The agent's average performance over the last 10 episodes was ~300 on a scale of (100,550) which is slightly higher than the mid-point and would have been better without the outliers.
Question 4: Briefly summarize your experience working on this project. You can use the following prompts for ideas.
Answer:
This module proved to be the hardest project for me in the MLND due to my lack of familiarity with the topic, inability to find one good all-encompassing textbook I could peck at every day. Udacity videos and academic articles helped to understand the general concepts but translating the algorithms into Python was challenging.
For this project it helped me to study the parts of the algorithm separately first to get comfortable and then put everything together. I put my exploration of auxilliary fns in these scripts: aux_ou_noise.py and aux_replay_buffer.py.
In the end, I did like working on the Quadcopter. If I work on an RL project in the future though, I will definitely take the time to perfect the simplest tasks first and then increase complexity.
The End