Value Iteration
What does MDP stand for?
What is reward hypothesis?
Which of the following is true, regarding MDP? \ 1.The environment is fully observable. \ 2.The future is dependent on the present and past states. \ 3.The probability to reach the successor state only depends on the current state.
An agent receives an representation of the environment at time step 't' and performs an action 'a'. Then agent receives a reward 'r' at time step ____ for the action 'a'.
Imagine, an agent is in a maze-like gridworld. You would like the agent to find the goal, as quickly as possible and you need the path to be small. You give the agent a reward of +1 when it reaches the goal and the discount rate is 1.0, because this is an episodic task. When you run the agent its finds the goal, but does not seem to care how long it takes to complete each episode. How could you fix this?