In this section you can find information about the projects, including the project specification document and FAQs.

Project 3 - MCTS

During last week's laboratory classes I've addressed several problems you will face when applying the MCTS to the game you're using for the project.

I've forgot to mention a very important one. 

Dealing with a flat reward landscape.

Some of you who implemented the first version of MCTS might have found odd that the MCTS is returning actions at random. By taking a closer look you will also see that the Q value for all initial actions is 0. This happens exactly because of the very flat reward landscape associated with this game. Let me show you with an illustrative example, that exemplifies what the reward for this game looks like:

In an overwhelming majority of playouts, the reward function will just return 0 because Sir. Uthgard lost the game. It doesn't matter if he died immediately, or if he was almost getting the final chest but ran out of time, the reward will be 0. Given the stochastic nature of MCTS, and the gigantic number of possible states (aproximated by 25!), MCTS will have lots of problems finding out what are the best actions to win the game.

Now that the problem is identified, how to deal with this?

a) One of the solutions, as pointed out in the slides is to used Biased Playouts with good heuristics to help guide the playouts towards the few victory states. However, this might still be problematic if your heuristic is not good enough, or because there is still a stochastic nature in the Biased Playout.

b) Change the reward function so that the game stops having a flat reward landscape. Your reward landscape should now look like this:


This reward landscape is much easier for MCTS to handle. 

How to implement a reward function in this way? Well, think how can you differentiate between a game where Sir Uthgard lost immediately, and a game where Sir Uthgard almost won. In the case he lost, how much time did he survived? How many enemies he killed, how many chests he got? In the case he won, did he barely won with a few seconds remaining, or did he win easily and with plenty of time left?

However, be carefull with one aspect. The constant factor C (usually square root of 2) used in the UCT formula assumes that the reward function is usually ranged between 0 and 1, or -1 and 1. If your revised reward function does not have this interval then you need to either normalize it, or change the C factor proportionally.

Also take into consideration that using options A and B together is also a good idea. And some of things that you use for this revised reward function might also be good for the heuristic used for the biased playout.

Dealing with stochastic action effects

In the laboratory classes I've also talked about how to deal with stochastic action effects. However, in two of the laboratories I've forgot to mention one additional alternative that is very easy to implement.

For example, in the same time you would run 1 MCTS with 1000 iterations, you can do 5 with 200 iterations, or 10 with 100 iterations each, and then when selecting the initial best action to perform you can use the average value of a child node over the n runs. This approach makes it easier to deal with random effects for the nodes stored in the search tree during the selection (the playout is always easy do deal with).

 

Project 2 - Goal Bounding

The description for the Goal Bounding technique is not in the Book. You can find a generic description of GoalBounding in the following paper: http://www.cs.du.edu/~sturtevant/papers/goalpruning.pdf

Algorithm 1 presented in the paper describes in a very generic way the implementation of the algorithm that calculates the GoalBounds. There are however a couple of aspects I would like to point out:
  • The initial part of the algorithm, which corresponds to the initialization (lines 2 - 10) is already done by me in the IAJMenuItem.cs file (you will find this class under the Editor folder).
  • Line 11, (the initial for) is also already done. You just need to implement the Dijkstra Floodfill and the calculation of the bounding boxes (lines 13-22).
  • The calculation/update of the BoundingBoxes can be performed by two distinct approaches
    • After Dijkstra Floodfill finishes (as seen in Algorithm 1)
    • While Diskstra Foodfill is running, as soon as a node is placed in the closed set.
  • Due to Dijkstra's properties, both approaches are equivalent. I think the second is simpler to implement, but it is up to you.
  • Lines 16--20 might seem confusing, but they are just updating the bounding box for all defined axis. In our case we are using two axis, x and z. Anyway, this update is already implemented in the method UpdateBounds of the class Bounds. Feel free to use it.

I also got a question asking me what exactly is Dijkstra Floodfill? Let me explain.

Dijkstra Floodfill

It is a standard implementation of the Dijkstra algorithm, but with an important addition. Lets assume that we create a faucet (torneira) in each of the exits of the initial node where Dijkstra starts. Each of these faucets will have a different color (for illustration purposes). In addition to the normal properties in a node record (parent, g), each noderecord will now have an additional field representing the corresponding color or faucet for that noderecord. This color represents the closest faucet to a particular node. The next figure illustrates how the noderecords would look like after Dijkstra Floodfill finishes. 


The implementation of this is rather straightforward, whenever a new noderecord is generated or updated, the corresponding color is determined by the color of the parent, except for the initial red/blue/green noderecords, because their parent does not have a color (the initial node is colourless). 


Project 1 Errata: 

In the slides regarding RVO Movement Algorithm, 

the lines:

if(tc > 0) //future collision
  timePenalty = w/time

should be

if(tc > 0) //future collision
  timePenalty = w/tc

and the line :
penalty = distancePenalty + timePenalty
should be:
penalty = distancePenalty + maximumTimePenalty

These changes have been updated in the slides.

Also, you can find an implementation for the TimeToCollision method under the MathHelper class. 


Attachments