Consistency and Data Compatibility

The first versions of Record, MatchRecord and MatchRecorder classes are implemented and their current output format looks like this:

out << MatchRecord:

First record at: 2880
Record step size: 360
Revision number: 1
Records: 84
Map: (2)Astral Balance.scm
Bot race: Terran
Opponent race: Terran
Victorious: 1
Kill Score: 23800
Building Score: 5865
Razing Score: 7800
Unit Scrore: 15150

out << Record:

Record ID: 11
Frame: 3960
Own Bases: 1
Own Units: 
Opponent Bases: 1
Opponent Units: 
Dead units: 0
Killed units: 0
Minerals: 52
Minerals (spent): 1350
Minerals (gathered): 1402
Gas: 84
Gas (spent): 100
Gas (gathered): 184
Supply (total): 36
Supply (used): 36

The captured data is selected based on the idea, what I would use to make decisions about my strategy.

This format is just for testing purposes. I think of defaulting this directly to the JSON format, because of it’s great readability and cross-language support. The actual feature vector is represented through a separate class and it’s output will be much less formatted.

The Problem With Consistency Between Model and Data

While implementing these ideas I stumbled upon a problem. If I make changes to the contents of my records, or the way I capture them, the already captured data becomes incompatible. Especially the additional FeatureVector class is a critical interface between the machine learning model and the data I capture with the bot. This also affects the other end of the learning chain, the expected output of the model, which will be used for decision making.

Another interesting fact about the data is, that it is highly dependent on all other components of the bot. The learning algorithm implicitly takes the capabilities in other areas like micro management into account. The generated strategy and it’s optimization become invalid as soon as changes are made to playstyle affecting code. While I plan to assume compatibility to little optimizations and non-critical bug fixes, the model has to be retrained with new data after a major change.

While changes to the input and output of the model should break it entirely, leading to a complete new model, with enough data it should be able to adapt to implicit changes from other parts of the bot.

A Solution: Revision Numbers

In order to keep track of the generated, exported data and the current version of the code I’ll use revision numbers as a first simple alternative to data version control.

The data and therefore the revision number depends on the following:

Changes to the source code and therefore overall playstyle of the bot, including bug fixes and optimizations, are simply indicated through the bots version number. This will be the first part of the revision number, although only the major and minor version part. Other changes are assumed to not break compatibility.

The MatchRecorder is configured to use a specific start frame time and a frame distance between new records. In terms of training a machine learning model, this is a critical implicit information which can break compatibility. These too numbers will be part of the revision number.

To indicate a state to the vector it receives its own revision number indicating the interface version for integrating model scripts. This is the third and last part of the revision number.

Data Version Control

An alternative to tackle this problem from a technical point of view there exist data version control tools, to couple your data directly to the version of the source code. Data is stored on a data storage (e.g. AWS S3, GCP) while it is associated with the code through the used source control software.

If possible I tend to use the data version control instead, due to it’s simplicity. I have to see if I’m able to integrate it fast enough.

Any thoughts of your own?

Feel free to raise a discussion with me on Mastodon or drop me an email.