Dev Log

Week 5 : More Experiments

As the title suggests, this week involved a lot of experiments. The bad news is that the AI is still not close to performing optimally. We have not been able to achieve convergence in any of our experimental setups yet. However, on the positive side, we are getting closer to having a graphical interface for playing the game (one boss fight of Slay the Spire). Once this is up, we can send this out to others to test and it can help us visualize how the AI is playing our game. We are adding a replay and recording functionality into the graphical interface to achieve this. Additionally, we identified a few bugs in our game and dedicated some time to fix them as well.

Experiments with the AI

Reward Calculation Modifications

The week started with us making an important change to the AI which involved updating the reward function to manually calculate the rewards of certain cards that involve buffs (Flex, Double Tap, Disarm, Clothesline) and a couple of block cards (Defend, Shrug It Off, Iron Wave). There is a base positive reward for every point of damage done. Since these cards do not deal damage directly, it makes sense to calculate through backtracking their effectiveness.

  • Flex – Playing the flex card gains reward equal to 50% of damage done by attacks played after it in the same turn. This provides an incentive to play it as early as possible. ‘50%’ sure seems high but keep in mind that Flex is a zero cost card hence the value obtained from this card is quite high (thinking about value in terms of damage/energy use).
  • Double Tap – Gains reward equal to 50% of the damage dealt by the next attack played.
  • Disarm – Gains reward proportional to the instances of boss damage (throughout the remainder of the game) after this card was played.
  • Clothesline – Gains reward proportional to the amount of boss damage inflicted on the player in the next two turns after this card has been played.
  • Defend / Shrug It Off / Iron Wave – Gains reward proportional to the amount of block used (calculated by looking at the next boss turn after playing this card). Block used is the actual block value that mitigated damage from the boss.

These reward modifications did not yield any favorable results at first glance which led us to drop them completely in successful experiments. However, we do think that the code used to do these reward modifications can be useful sometime in the future since it is a good way for us to calculate a card’s effectiveness.

Choosing Action Functionality

The next big thing that we tried this week involved modifying the way the actions were being chosen out of the array of q-values generated by the AI prediction process.

Earlier, the way of choosing the best action was to look at the predicted q-values for each card and then choose the card with the highest q-value AND was playable. One of the reasons this could be a potential problem is that the neural network is unaware of this mechanism for the choice of cards and hence it makes predictions believing that all cards are available for it to play. Logically there is an obvious flaw here. However, this effect is even more pronounced when considering the Q-Learning rule for estimating reward ( q(s,a) = reward + max a’ q(s’,a’) ). In words, the equation states that the expected reward from taking action a (playing a card) in state s (the current game state) is equal to the immediate reward from playing the card added to the maximum q-value possible for the next state. Hence, when the AI tries to predict q-values, it assumes that even in the next state, all cards are available to play. This has an inflating effect on the q-values and we have seen q-values accelerating to very high numbers.

To fix this, we now force the neural network to choose an action that is playable in the given state. We do this by giving a negative reward when the neural network attributes the highest q-value to a card that is unplayable in the given state. By doing this, the neural network is forced to pick a card that is playable. This has its own set of challenges but for now we are experimenting more in this direction.

Curse of Dimensionality (High Number Of States)

Although we are breaking down the game state into a few variables (around 100) with discrete values, the complexity of this task has a lot to do with the high number of possible states. Right now we know that there two things for sure:

  • AI Agent is unable to find the best strategy to play the game
  • AI Agent is unable to optimize the card play order in each turn

One of the reasons for this could be that it will take a high number of games for the AI to visit enough different states to converge. Right now, it takes about an hour to play 1000 games (each game has multiple turns and each turn has multiple card play steps).

One of the things that we are looking at right now is feature engineering. Is there a clever way to extract important information from the game state and represent it in a fewer number of states? The following is what we are looking at:

  • Generalize card states by changing different cards to only genre of cards (Card’s name → Genre of the cards): Instead of checking for cards on the player’s hand, we only check how many attack cards, how many block cards and how many buff cards are on player’s hand.
  • Player/Boss Buffs: Maybe this does not impact the game state too much and we can remove this altogether.

Unity Front End

Last week we built a unity front-end system which can only render a static scene. Since animation is so important for players to know what is happening, we added animations to our front end this week!


In a normal unity game, adding animation is easy since we can easily get references to all gameobject instances we want for animations. But since our game is running in python, all the object instances are stored in python runtime. An object instance sharing mechanism is hard to fit into our current request/response architecture between C#/C++.


We extended our “markup-language” to support animation. All the markups in one “game sequence markup file”  share the same ID space, so they can find instance by those ID. The flow of this is shown below.

Playback System in Unity

As mentioned in the introduction above, we need a playback system in Unity in order to watch replays of games played by the AI (or humans). This tool can be incredible useful to us because:

  • Store valuable information permanently : Previously we didn’t have efficient ways to store how AI/players play the game but with this system, we can store records as along as we version them and have backward compatibility.
  • Clear graphical visualization : Even a single minute of AI’s gameplay generates hundreds of lines of logs which are hard to read. With graphical interface, life would be easier in evaluating what the AI is doing.
Playback Tool based on Current Architecture

This week we implemented a playback tool for the reasons cited above that is completely compatible with our current architecture. This includes:

  • Font-Backed Rendering System which completely relies on the game data in markup files.
  • Mechanism to Encode Gameplay Data into markup files.
  • We store the data generated during runtime, and use the same Unity based rendering system to replay it.
 Refactoring the Data Management Module

As part of integrating the python gameplay system with Unity rendering and the replay mechanism, we also implemented a database module to handle game data drawn from markup files. The following are some reasons why we needed to do this:

  • Easier for developers to configure and modify data. This is in preparation of the design tools that we are looking to build.
  • Decoupling data from gameplay is one of the most important aspects of our original design 
  • Preparation for game application builds

The following design considers support for multiple games versions such that each game version’s data is isolated from others. Each game supports multiple decks, and its own rule sets and card set.

The following is a brief description of what the database module handbook looks like:

Bug Fixes

We identified a few bugs this week during manual playtests and AI training. In the previous version, the boss transformed from offensive mode to defensive mode only when it is attacked by more than 30 damages in one round. In the real Slay the Spire, the damage dealt to the enemy for transformation is accumulated since the last transformation.

In order to fix this, we had to take another look at our game flow and make some changes to it. The following is a representation of how our (cleaned up) current game loop looks like:

Although the AI training is still not yielding favorable results, we are learning a lot about the application of reinforcement learning to strategy games. It will surely be fun to watch the AI play in the coming week. We can then judge to see how smart it is!

Dev Log

Week 4 : Dawn Of The AI

This week involved many things. We had a chance to interact with many ETC faculty during the 1/4 walkarounds. This was a great chance for us to showcase what we had accomplished to far and talk about our direction. We ran AI training multiple times with different model structures and hyperparameter values. Unfortunately, we could never achieve a win rate of more than 15.7% which means that we still have a lot of work to do towards the machine learning aspect of the project. Meanwhile, we also came up with a Unity frontend system for visualizing the gameplay and created a Django (python) website on localserver that gives designers a simple GUI to edit card values and even create new ones.

Feedback from Quarter Walkarounds

Here’s a list of some of the most important feedback we received:

  • We need to come up with one or two metrics that we ‘target’. By targeting a metric, we want to measure the impact of making game balance updates on these metrics. For example, one of the metrics that we are leaning towards is win rate. By focusing on win rate, we can put game balance changes in perspective to judge whats good and whats not.
  • We need to reach out people in the reinforcement learning space who can help us out with the AI part. Since this space is technically challenging and it is is still quite unexplored, it would be a good idea to get in touch with someone more experienced in the space.
  • We need to create an organized system that can help game designers interact with the AI. This can be to help with inference from general statistics by creating data visualizations or even an interface to train the AI after making game design updates.
  • We need to write a paper as our final deliverable. This makes sense because what we are doing is highly exploratory. A good to extract value from the work that we have done is to document it all for people who may want to work in the same space after us.

AI Training Experiments

This week involved a lot of AI training experiments. In all honesty, none of our experiments were successful. The highest win rate that we ever got was 15.7% and that is not very good. However, it is still a little better than a random bot. There were two models that we implemented this week:

Model 1 : Single Model With All Input Data

This is a single model that takes in all of the state input at the same time. This results in 98 input neurons that look like the following:

The output consists of estimated Q-values for each card. The neural network is setup to estimate the expected reward values from playing each card given a particular state. The AI agent will then go ahead and play the playable card with the highest Q-value. This does not always mean that the card with highest Q-value gets played since it might not be possible to play that card (player energy, card not in hand, etc).

There were several issues with this model that we had to iron out as we ran training. For one, the estimated Q-values were getting very large and would eventually resulted in a Nan error. This was rectified by changing the activation function of the middle layers to sigmoid instead of relu. We also tried regularization as a way to handle this but that resulted in the estimated Q-values being too small.

All in all, we got the model to work but it did not show any improvement in the average reward over the duration of the training. Perhaps, it is difficult for the AI to infer insights from the data because too much of it is being presented at once inconsistently.

Model 2 : Multiple Small Models Each Predicting Independently

The other model we implemented involved getting rid of a single big model but instead making multiple small models. For the sake of experimentation, we created three smaller models as follows:

  • Buff Model – Consists of 7 input neurons, each indicating whether a certain buff is present on the player (or boss).
  • Cards Model – Consists of 13 input neurons indicating which cards are in the players hand. One additional neuron to indicate player’s energy level.
  • Boss Intent Model – Consists of 7 inputs neurons, indicating what is boss’s intent.

Out of the 98 values that indicate the game’s current state in model 1, here we are only taking 7 + 13 + 7 of them. By filtering out some of the information, we are losing out out on the AI’s knowledge of the state space but at the same time we want to try this out to see if the AI does any better.

By taking a weighted average of the independent Q-values predicted by these three smaller models, we were able to get a win rate of 15.7% which is higher than the random bot. This is good news, because we know that we are doing in the right direction.

However, with that being said, the performance is still quite bad. We are now looking to implementing a more complex and fine-tuned reward function that could potentially help with improvement of training. If that too does not prove to be helpful, we shall look towards policy gradient algorithms to try and solve this problem with a fresh approach.

Unity Visualization

We made some progress on our quest to create a Unity visualization of the game. In the last week (Week 3) we had implemented a python version of the game ‘Slay the Spire’ which involved a single boss fight against the level 1 boss ‘The Guardian’. For starters, we worked towards implementing a UI for playing the game which looks like the below.

The following are some key features of the Unity system:

  • Decoupled Front End : Front-end includes all animations, UI, and graphics, but does not have any control over the gameplay logic.
  • Data Driven Rendering : Rendering of the game is based on data, just like browsers render HTML. So when we change game logic in python, the rendering part would change according to the data sent back from python. This ensures that we dont need to modify the python code too much.
  • Art Resources Hot Update : Art-side resources shouldn’t be embedded in the application. All of them are configurable and can be hot updated.  Art assets can also be provided as tools (for the game designer) and results can be immediately seen without restarting the application.
  • Playback System and Debugger : Support for a playback system based on log data. This is to see how AI plays the game by visualizing its moves.

The implementation of the Unity Visualization system follows a front-end/back-end architecture. This is largely inspired from the modern web stack: HTML/Browser/CSS/JS because we have a similar situation.  Instead of a game application, Unity is simply a renderer along with some input management. It renders the information from the python module, and gathers user input to send back to python. It is very similar to the relationship between browser and server. The following is a diagram of how this architecture looks:

Web GUI for Card Modifications

An important piece of our project is to make our system more friendly for game designers. To that end, we began work towards a web GUI for editing the json card files. The idea behind this system is to let the users modify existing cards or create new cards. This system uses a Django backend that is hosted on localserver for editing the card files locally. The following is a screenshot of how this looks right now:

The fields in the image are from the card json files. These are all values that would need to be edited by opening up the json file. Instead, this system can also be used to make similar changes.

In conclusion, this week involved progress on multiple fronts. However, the biggest concern right now is that the AI is not performing well. In the coming week, more of our efforts will likely be to push towards success in this area.

Dev Log

Week 3 : The Card Game

The biggest highlight this was the pivot. Instead of designing our own game, we decided to go with the popular card game ‘Slay the Spire’. There were several reasons for doing this. During a recently held Brainstorming Workshop that involved a lot of second year students, we got some feedback that resembled a lot of the same things we had been hearing from our faculty instructors and others who we had described our project to. Since we were implementing the AI and designing the game at the same time, it would have been easy for us to change the game’s design in order to make it more suitable for AI training. This has never been the purpose of our project. We dont want to be changing the game in order to suit the AI. We want the game and its design to drive the AI instead of the other way round. This is because we want to work towards something that can be extended to other games and other settings. And although we are not looking towards developing an AI that can play multiple card games (this is an extremely difficult task), we need to be able to develop AI for a game not designed by us to convince anyone that our work has merit.

The Card Game – Slay the Spire

The game that we chose is Slay the Spire. There are several reasons for choosing this game:

  1. Simple and Well Designed : Slay the Spire is a well designed game that is a lot of fun to play. The buff system is uniform and consistent which makes it a good setting for AI training. There arent a lot of programming exceptions when it comes to programming which makes the implementation easier.
  2. PvE : The game is not player versus player which makes the gameplay more deterministic. With the game concept we had earlier, we were heading towards adversarial AI agents that compete with each other to win. Instead, here we have the AI agent playing against a scripted boss. This makes the training process less complex because the environment is relatively much less stochastic.
  3. Data Driven Design : Each card is basically a data structure with specific values for damage, block and buffs. Thus each card can be stored as its own JSON object and it is simple to modify. It also makes it easier for us to add new cards.

That being said, we are not looking to build an AI system for the entire game. We are only looking at boss fights with a predetermined deck of cards. As a result, we are only looking to train the AI to play out boss fights. If the AI can learn how to play the boss fight, we can get an understanding of how easy or difficult it is to beat the boss when the damage, block or buff values are altered.

Game Jam – Implementation

After deciding that we wanted a already popular card game and then choosing ‘Slay the Spire’, it was time to move towards the implementation. We had already worked towards the implementation of the previous game and as a result we didnt like the fact that we were back to square one. To overcome this, we decided to do a game jam (Wednesday to Friday) where all three programmers drop everything else and work together to quickly implement a playable version of Slay the Spire. The following are some of the requirements we came up with for this rapid implementation to direct us:

  • Friendly for AI to play, and also can by played by humans using the terminal
  • Clear protocol for the deck and cards for future extensibility (ties into the Data Driven Design which we wanted to preserve)  
  • Flexible and agile. It is a rapidly made prototype, but the code will serve as the foundation for what we do in the future

Calling this our own little game jam was a great idea. We quickly divided up the work and started while staying connected on Discord to answer each other’s questions. After three long days, we had a terminal-playable prototype of the card game on our hands. It consisted of one boss combat (The Guardian) and twelve cards for the player to choose from. The cards that we implemented are as follows:

Currently this game is human playable in the terminal. This will change in the future and the game will be playable through a UI. However, for now the play experience is a little more tedious with the player having to choose every card to play one by one. The following is a screenshot of how that looks.

Here is a list of different game systems that have been implemented so far:

  • GameManager for managing the game flow, providing API for AI and input management
  • Deck management, which provides the ability import cards and deck information from JSON files
  • Behavior system for entities such as player and enemy
  • Implementation of a decision tree for the boss, which has different modes and various tactics.

The next steps include getting back to developing an AI that can play this game. We are simultaneously getting ready for quarters next which will give us a great chance to get feedback about our plan from ETC faculty.

Dev Log

Week 2 : First Version

Our goal for the week was to start building the first version of our card game. The components of our system architecture include the game kernel, the AI agent and visualization of the game states in Unity. The other important aspect is the design of the card game which is really what everything is for.

Card Game Design

As discussed in the last week’s post, our strategy is to start with a simple card game and progressively introduce more complexity. This week, we came up with the first iteration of our game and here are the rules.

  • Each player will have the same health amount and attack damage.
  • Only two kinds of cards inside the deck: Heal and Damage.
  • First round each player will have 5 cards and will draw 2 cards each round from the second round.
  • Players will take turns, and they will use all of their cards in hand.

Game Kernel

The following is a list of files that we have implemented and what each of them does:

  • Allows us to simulate the game multiple times and view results.
  • Kernel class that instantiates a game state object to ‘play’ the card game.
  • Defines game state related classes in order to have a concise interface for Unity and AI agent training.
  • A bot that randomly picks one card from its hand to play out in each round.
  • Defines constant values used in our project.

AI Agent

Since we cannot use a tabular approach for our game (as the state-action space is very large), we need to use a neural network. The input layer to the neural network will pass in state defining variable values. We still working on a comprehensive list of such variables. The output layer will then give us a probability distribution over the action space. Using this, we can sample from an action from it or simply take the action that has the highest probability (we plan to try both).

The design is of the Policy Gradient RL method. The algorithm that we will begin with is the Advantage Actor Critic. The reason for beginning with policy based methods (over value based methods) is that they perform better in stochastic environments. This applies to our project, because the same actions can result in a different future states based on the moves made by the opponent. During the course of our project, it is likely that we shall try several different approaches to train our AI. We are open to trying value based methods if the Advantage Actor Critic algorithm does not perform well in our setting.

Another important aspect to consider is the number of states that we pass into the input layer during training. Since playing a particular card could lead to victory multiple turns later, the reward received after taking a particular action must include discounted future rewards as well. This means that we follow a N-step reward where N is the number of future steps to consider for the reward.

Visualization and Input Handling in Unity

There are several parts to the Unity system that we are building alongside with our Python game.

1. Networking System

For playtesting reasons, we want to allow people to play this game remotely. To that end, we are building a networked Unity system that can allow people to provide inputs remotely through the client. We are using Photon to do this.

The reasons for using Photon:

  • Provides free server usage within a limit.
  • Provides easy-to-use API for building room/lobby-based game service.
  • Great support and tutorial resources in Unity.

Using Photon, we have built a basic matching system that works for us. It currently includes functionality for:

  • Connecting players to the service. Manage the connection and disconnection.
  • Logic of ‘Ready’, ‘Start Game’, and player chat.

Below are some screenshots of how this looks currently.

2. Rules of communication between Python and Unity (C#)

We have discussed the rules of how AI, Unity logic, python gameplayer will communicate (to ensure less work when we connect everything).

  • Definition and format of important classes such as gamestate/user input is predefined and stored in a python file (
  • C#/Python communication: 
    • Only C# accesses Python, not the other way around
    • C# can only do the following:
      • Call functions, share data structures, and get the snapshot of object instances in python. 
    • C# and Python will share the definition of classes which are commonly referenced by both.
    • Const values and settings will be stored in separate files(XML, JSON, etc.) and will be dynamically loaded when required.
3. C# Coding Standard

We added a coding style guide of Unity/C# in our Github repository’s readme file, including rules for class layout, nomenclature, declaration and brace style.

Dev Log

Week 1 : Getting Started

The objective of our project is to create an AI that can playtest card games and help balance it. We expect that the AI will identify dominant strategies to win the game, and this can lead us to making game design updates to the game. To do this, we are going to design our own card game because we want flexibility over the rules of the game. We want to test our AI with different game mechanics to see what problems (in the context of card game mechanics) it can solve easily and where it struggles.

We plan to use Python to build our game kernel and have some communication with Unity for visualization of game play.

For our project to work successful, we have divided it up into multiple aspects – the core game kernel, the AI, communication with Unity and game design for the card game. We need to work towards implementing each of these and ensure that they work well together.

Game-AI-Input System

The following is a simple representation of our system that implements the above.

Here are the requirements from this system:

  1. All the gameplay logic executed in a module called”gameplayKernel”
  2. Other modules are only able to provide userInput and get what gameState and triggered gameEvents after this input. Exactly how game logic is executed is 100% invisible to them.
  3. All the gamedata is separated into a structure call game state.
  4. Game kernel is 100% decoupled with graphics/UI and AI training

In the first week, our objective was to come up with a playable Rock-Paper-Scissor-Dragon card game and then train an AI agent to play it. RPSD is our take on the classic (and balanced) Rock-Paper-Scissor game with the exception that there is a Dragon action that defeats rock, paper and scissors but draws to another Dragon. The idea behind doing this is to prove that in the simplest of settings, the AI agent can identify the obviously broken part of the game.

Unity – Python Connection

Research on the ways to achieve this:

  1. To  have a python  interpreter in .NET environment ( . The cons is this is not stable, and will have some trouble when using external python packages like numpy.
  2. Use command line to execute the python program, and redirect the standard input/output from cmd to C#/Unity environment. This is doable, but using standard input/output to communicate  is inconvenience. And this need player to install the python environment 
  3. Python programs run in another process, and Unity in another process. Use socket to transfer the information  between to language. This is what we finally chose, but it still needs more work. We need to think about how to call a function and share object instances. 

AI Agent

Since the AI problem is so simple, we implemented followed a simple tabular reinforcement learning approach. To elaborate, we maintain a table of the expected reward from taking each of the four actions (rock, paper, scissor or dragon). We update these expected reward values based on the rewards from each game. The agent uses an epsilon-greedy approach with a constant epsilon value of 0.1.

Since the dragon will naturally has the highest chance of winning any game, over a period of time the expected reward for a dragon is the highest. Here are some charts of the training results.

Each plot shows the number of times the AI agent picked each action. We see that after 100,000 steps, the AI agent predominantly picks the dragon action. The minority of actions that are not dragon can be attributed to the epsilon probability value.

Although, this worked, we are well aware that this problem is trivial compared to building an AI to play a complex card game. Since the action and state space can be very large, a tabular reinforcement learning approach can no longer work. In this situation, we need to use a neural network to estimate expected reward. The input layer can be a vector of integer values that represent the current state space. The output layer is more tricky because it involves a fairly complex action selection.

Game Design

Last but not the least, we arrive at the game design component of our project. We plan to have a game that starts with a simple rules and few mechanics but eventually evolves into something more complex. This can help with us progressively increase the reinforcement learning model complexity.

The first version of the game only involves two systems:

  • Hero system
  • Card system

The hero system consists of two set of actions to choose from. Player has to decide the positioning of each hero in each round, because normally a hero can only attack the enemy that is in front of it. Additionally, each hero will have two extra skills that require a particular resource to unlock, which is the second set of actions that needs decision making.

The card system also requires players to make decisions between short long term rewards. The player may choose to deal maximum damage each turn, or choose to gain resources to level up and fight back. Ideally, we want a variety of different strategies to be viable

The game, of course, requires large amounts of balancing. For example, the HP and attack damage of each hero, the amount of resources that are required to level up a skill, and the drawing probabilities of each card are all crucial to this game being balanced. This is where we hope the AI can help us out.

In conclusion, we believe we have had a strong start to the project. Our objectives that we need to achieve are clear. We want to explore relatively unexplored territory that is technically very challenging. All we know for certain is that we are excited!