hub / github.com/lazyprogrammer/machine_learning_examples / ddpg

Function ddpg

rl3/ddpg.py:87–296 · view source on GitHub ↗

(
    env_fn,
    ac_kwargs=dict(),
    seed=0,
    save_folder=None,
    num_train_episodes=100,
    test_agent_every=25,
    replay_size=int(1e6),
    gamma=0.99, 
    decay=0.995,
    mu_lr=1e-3,
    q_lr=1e-3,
    batch_size=100,
    start_steps=10000, 
    action_noise=0.1,
    max_episode_length=1000)

Source from the content-addressed store, hash-verified

85
86	### Implement the DDPG algorithm ###
87	def ddpg(
88	env_fn,
89	ac_kwargs=dict(),
90	seed=0,
91	save_folder=None,
92	num_train_episodes=100,
93	test_agent_every=25,
94	replay_size=int(1e6),
95	gamma=0.99,
96	decay=0.995,
97	mu_lr=1e-3,
98	q_lr=1e-3,
99	batch_size=100,
100	start_steps=10000,
101	action_noise=0.1,
102	max_episode_length=1000):
103
104	tf.set_random_seed(seed)
105	np.random.seed(seed)
106
107	env, test_env = env_fn(), env_fn()
108
109	# comment out this line if you don't want to record a video of the agent
110	if save_folder is not None:
111	test_env = gym.wrappers.Monitor(test_env, save_folder)
112
113	# get size of state space and action space
114	num_states = env.observation_space.shape[0]
115	num_actions = env.action_space.shape[0]
116
117	# Maximum value of action
118	# Assumes both low and high values are the same
119	# Assumes all actions have the same bounds
120	# May NOT be the case for all environments
121	action_max = env.action_space.high[0]
122
123	# Create Tensorflow placeholders (neural network inputs)
124	X = tf.placeholder(dtype=tf.float32, shape=(None, num_states)) # state
125	A = tf.placeholder(dtype=tf.float32, shape=(None, num_actions)) # action
126	X2 = tf.placeholder(dtype=tf.float32, shape=(None, num_states)) # next state
127	R = tf.placeholder(dtype=tf.float32, shape=(None,)) # reward
128	D = tf.placeholder(dtype=tf.float32, shape=(None,)) # done
129
130	# Main network outputs
131	with tf.variable_scope('main'):
132	mu, q, q_mu = CreateNetworks(X, A, num_actions, action_max, **ac_kwargs)
133
134	# Target networks
135	with tf.variable_scope('target'):
136	# We don't need the Q network output with arbitrary input action A
137	# because that's not actually used in our loss functions
138	# NOTE 1: The state input is X2, NOT X
139	# We only care about max_a{ Q(s', a) }
140	# Where this is equal to Q(s', mu(s'))
141	# This is because it's used in the target calculation: r + gamma * max_a{ Q(s',a) }
142	# Where s' = X2
143	# NOTE 2: We ignore the first 2 networks for the same reason
144	_, _, q_mu_targ = CreateNetworks(X2, A, num_actions, action_max, **ac_kwargs)

Callers 1

ddpg.pyFile · 0.85

Calls 11

storeMethod · 0.95

sample_batchMethod · 0.95

CreateNetworksFunction · 0.85

get_varsFunction · 0.85

ReplayBufferClass · 0.70

get_actionFunction · 0.70

test_agentFunction · 0.70

runMethod · 0.45

resetMethod · 0.45

sampleMethod · 0.45

stepMethod · 0.45

Tested by

no test coverage detected