Notes from cory.li

Keyboard Science with Cherry MLs

2016-05-11T16:00:00-07:00

I was recently inspired by a post / rallying call to “blog little things.” So after failing to find detailed information on the Cherry ML pushbutton switch, I’ve decided to publish all my measurement notes in the hopes that it will be helpful to someone else.

The Switch

With mechanical keyboards being all the rage, most enthusiasts are familiar with the Cherry MX series of pushbutton switches. Less known, however, is Cherry’s ML series of low-profile switches, meant for laptops or other space-constrained designs. At 6.9mm total board height, it’s less than half the height of a 15.6mm MX switch.

Travel Distance: 3mm
Force: 45 cN
Bounce Time: 5ms
Current Rating: 10mA

The ML switch is technically a “family” of switches, but there’s really only one ML part readily available, the ML1A-11JW. Fortunately, this variant is widely available in single quantities, making it decent for hobbyist use.

The best description I’ve heard for the tactile “feel” of an ML switch is that it’s like a Cherry MX Blue, but “scratchier.” While preference for a particular switch is mostly dependent on the eye of the beholder, I think these switches feel great in my application: a fully mechanical controller (if you’re interested, follow me on twitter for updates!)

Unfortunately, Cherry’s official datasheets are frustratingly bare and underannotated, especially in regards to mechanical features, which makes it difficult to use in designs. I will try to elucidate some of the missing information below.

Schematic

The J in the part number stands for Jumper, meaning that two of the pins are just shorts to each other. However, it isn’t clear on exactly which ones they are out of the four. I’ve redrawn the footprint with an actual schematic symbol to make it clear that pins 2 and 4 are the actual switch, while 1 and 3 serve as a jumper wire.

Mechanical Dimensions

While the MX series has the mounting post on the switch, the ML switch instead has slots that the keycaps insert into. The exact dimensions and spacing of the hole are unclear, so I have filled them out here. These are the dimensions to the best of my knowledge (and the tolerance of my calipers).

The actuator is directly centered over the bottom mounting post, with two 3.1mm x 1.1mm slots separated 5.3mm apart. It is important to note that the large bottom mounting post is not in the center of the switch; it’s only 5mm down the edge of the full 11.4mm length. I’ve exported a very blocky 3D STEP file out for visualization use while working in your favorite EDA package.

Keycap Test

To check my measurements, I designed a small button cap with 3 x 1mm mounting posts as a test.

The button yielded a reasonable fit after printing it out on an SLA 3D-printer.

It may be possible to increase the size of the mounting post a bit to make the fit tighter, but so far I’m pretty happy with how it turned out.

Final Thoughts

I’ll try to make this a “living” blog post in that I’ll periodically update it with more accurate measurements and models as I make them – feel free to check back for the latest information. At some point I plan to take the keycap apart and optically scan and measure each piece, but for now the above is sufficient for my purposes.

Also, if there’s anyone out there with better measurements or models, please feel free to reach out to me either through twitter or email so I can correct my drawings!

Changelog

5/11/2016: Initial Posting

Fun with solving puzzles (and dragons)

2016-03-26T15:00:00-07:00

From the end of 2012 to early 2014, I found myself enchanted (read: entrapped) by the mobile game known as Puzzles and Dragons. A deceptively simple and charming game, this post is a collection of my musings on the gameplay and design. It’s also some closure for myself so that I can finally say that I have “beaten” the game and put it to rest.

Anyways, there are really four separate posts contained within this mind dump, feel free to skip to the parts of interest to you:

Introduction to PAD and gameplay overview
Some PAD maths and algorithms
PAD hacking/automation
Thoughts on PAD’s game design

Core Gameplay

Puzzles and Dragon (or PAD for short) is a match-three puzzler from the Japanese studio GungHo entertainment. Featuring Pokemon-like collection and progression elements, it ranks among one of the most profitable apps in the world. It still pulls in around 3 million dollars daily and was the first mobile app ever to hit $1 billion dollars in revenue.¹

The goal of the game is very simple: Eliminate 3 or more orbs in a row.

Your browser doesn’t support HTML5 video tag.

Matched combos on the bottom half of the game board build attack power which you use to launch attacks against cute enemy monsters shown on the top half.

But notice that there is a subtle difference in game design which sets PAD apart from other match-threes, like Candy Crush or Bejeweled. Instead of your typical swap-two-elements, a single piece in PAD can be moved an arbitrary length, displacing other pieces as it travels.

The ingenuity in this design is that it’s actually a strict superset of Bejeweled’s gameplay. It makes the game incredibly beginner-friendly, since you can still play in a very simplistic swap-two manner:

Your browser doesn’t support HTML5 video tag.

But the more you play, the more it becomes clear that the skill ceiling is actually incredibly high, as the player learns to massage the board into their desired configuration with lengthy combos:

Your browser doesn’t support HTML5 video tag.

With this simple core mechanic, PAD is able to create one of the best gameplay skill-progression tracks I have seen in any mobile game. Without the need for preprogrammed experience bars or player buff handouts, there is still an invisible but very prominent feeling of “leveling-up”. The player gains new advanced techniques, combo setups, and become more dexterous at manipulating the orbs all on their own.

And most impressive of all, this invisible progress track guides the player all the way from being a casual bejeweled player to tackling PAD’s version of ruthless World of Warcraft-like endgame raiding.

As an example, study the following hypothetical 4x3 board:

A beginner might go for one of the two easy double-combos:

A more experienced player should immediately see the path for the full four vertical combo:

Try it for yourself if you’re having trouble understanding the path. And of course these pattern identifications are much more impressive when in context of the full 5x6 board.

This incredibly high skill ceiling actually makes PAD quite entertaining to watch, as skill is so visibly demonstrable. If you don’t believe me, check out one of my favorite PAD videos, or even watch some of the official AppBank streams. A spectator may mentally plan out her own solution, only to be enlightened when a master player steps up to move the orbs. To the untrained eye, it look as if orbs are being magically expelled from the board under a ruthless finger with machine-like precision².

Solvable Gameplay

So after one dropped orb too many, it occured to me that I should just program the computer to play PAD for me.

The way orb manipulation works makes it very similar to the classic 15-puzzle, in which you slide around numbered tiles in a grid to rearrange them into numerical order.

Roughly speaking, PAD is an MxN generalization of 15-Puzzle. The only difference is that in PAD there is no explicit “hole,” the hole is instead the tile that you are currently dragging under your finger.

Using this as a bit of scaffolding, we can break the plan of attack for “solving” PAD into 2 parts: calculate the board with the maximal score, and then calculate the shortest path to get from our current state to our desired state.

An easy way to produce the maximal scoring board is to sort all the orbs by color, and then pack groups of three starting from the bottom³.

With some handwaving, you can show you can do no better than this configuration (i.e. breaking up a group to produce a falling combo does not increase your total combo count, so there is no advantage to not packing tight adjacent groups of threes). Note of course that this doesn’t take into account the mechanic of skyfalls, that is, the additional combos scored serendiptiously from orbs refilling the board. To maximize this, you’ll want to simultaniously pack in the largest number of cascades.

Because the game refills orbs from the top each time it clears away a matched combo, having this cascade of combos means that the falling orbs are permuted several times on the way down for a statistically higher chance of matches.

A tricky question is whether we can actually get to this desired configuration. Going back to our scaffolding, consider the fact that in 15-puzzle not every board state is actually reachable! There’s a neat little theorem showing that any move in 15-puzzle preserves the parity of inversions⁴ – that is, the number of times a higher-numbered tile precedes a lower-numbered tile. This fact partitions the space of possible board states into two disconnected graphs: those of even parity and those of odd parity. From any board, you can reach every other board of the same parity, but never one of the other parity! This is why if you physically pull out and swap two consecutive pieces on a 15-puzzle board, the puzzle is no longer solvable.

Unlike 15-puzzle where there are 15 unique pieces however, there are only 7 unique orb colors in PAD (if you also count poison as a color). Every board therefore must have a duplicate orb somewhere, and the existence of that duplicate means you can always swap the two duplicates to “change” the parity without actually changing the state of the board. Therefore, we can show that in PAD, it is possible to achieve any desired board state – the only limitation is your skill (and time to manipulate the orbs).

So, given that we know the reachable maximal-scoring board, we just need to write a solver to get there!

Turns out this is somewhat challenging, as finding the shortest solution for 15-puzzle is NP-hard. Likewise, while figuring out the highest scoring board in PAD is reasonably easy, finding the shortest path to achieve the highest-scoring board in the alloted time is non-trivial.

Fortunately for us, we can really only do so many moves in the alloted four seconds, so a non-exhaustive depth-first search is “good enough” for all intents and purposes. Pndopt is one such app⁵, which lets you weight certain colors for any given situation. Like a lot of F2P games these days, the game time-gates you on the number of plays you can do in a day, so for players who are running hard dungeons, it is not unusual to input every move through pndopt to maximize chances of success – something of which I am quite guilty.

To PAD’s credit, using a computer to solve the puzzles surprisingly doesn’t ruin everything – it just removes the puzzler cornerstone and transforms the game into more of a RPG team management simulation.

Complete Computer Control

Given that most people are using computer solvers, why not just have the computer play the game entirely? Back when I was still ~~addicted~~ playing, I hacked up a proof of concept solver & runner and threw it up on github just for myself.

Here’s how it all works:

Screen capture is accomplished with idevicescreenshot on iOS and adb screencap on android.
Once the image is on the computer, the location of the 6x5 grid is calculated from the screenshot aspect ratio and then divided into 30 individual images.
The average hue of the individual image determines the color.
SIFT is run against a grayscale version of the image to give a list of key points, which is then matched against a list of possible orb modifiers (e.g. the plus modifiers, which give a 1.05x bonus to matches)
Candidate combo paths are obtained via an extremely lazy DFS written in python, which runs a “multicore” solver by spawning a bunch of pypy instances for different regions of the board.
Solutions with a score above a certain threshold are presented to the user along with the required path. The user can then sort through the solutions by relevant parameters such as damage done or health healed.
On android only, the chosen path can then be executed on the device via android’s monkeyrunner tool. (I wasn’t able to figure out a way to programmatically simulate touches on iOS).

Straight Up Cheating

Of course, this is an absurd amount of work just to play a game that is entirely client-side. Turns out PAD mothership doesn’t even care about the state of the game. Sniffing the traffic shows that there are only a total of three requests made per dungeon:

sneak_dungeon - Client makes this request in order to enter a dungeon. Server responds with dungeon encounters and loot table in response.
sneak_dungeon_ack - Client responds that the dungeon layout has been received and that the player is now playing. This is done in case of connectivity issues.
clear_dungeon - Client responds that the dungeon is cleared. Server acknowledges, confirms the received loot, and updates the player’s account.

Note that what most people could call the “core game” is actually entirely clientside. This includes the board state, monster attacks, monster damage, player health, etc. The entirety of the player’s efforts is boiled down to either a single http request – a success request nets them the entire loot table, while a failure request leaves them with nothing.

One nice/convenient aspect of this design is that you can actually “queue” dungeons before losing connectivity. I’d often load a dungeon before entering the subway, play through it during my 10 minute commute downtown, then re-sync once I surfaced at the destination station.

Some other fun notes from packet sniffing:

PAD to me is the best testament of the “just ship it” mentality as it appears the whole thing was written in PHP (i.e. the request is made to sneak_dungeon.php), showing that a fancy stack isn’t necessary to build a billion dollar game.
The API endpoint to enter the dungeon sneak_dungeon is probably an amusing mistranslation of sorts - probably originally along the lines of “to enter the dungeon discretely / carefully”
Somewhere around the 5.X series patches, they started encrypting the JSON payload so that it wasn’t easily over-the-wire sniff-able. Clever players were checking the loot-table ahead of time to determine whether a dungeon was even worth running. It’s now sent as a encoded binary base64 blob – seems like a fun and reasonably straight-forward reverse engineering project for someone’s weekend.
Monsters are actually referred to as cards in all the API calls. Maybe early prototypes of the games were meant to feel more like a collectible-card game?

Design Thoughts

If you’re willing to ignore how easy it is to cheat and just play the game as it’s meant to be played, it’s actually quite an enjoyable experience. There are a lot of minor annoyances in PAD, but I think that they designed two high-level mechanics down quite well:

Resource Management

I’m not sure if the monster fusion mechanic was invented by PAD, but I find it to be a very clever bit of design. The basic gist is this: monsters are the primary form of “currency” in PAD.

You use teams of monsters to clear dungeons
Clearing dungeons sometimes rewards you with additional monsters
Excess monsters can be used as a source of experience points for other monsters by “feeding” them

Often, you’ll want to save the strongest monsters or put them on your team while feeding the weaker ones away. When feeding, feeding five fodder monsters at a time is slightly more efficient than feeding one at a time. So overall, the player is encouraged to hoard monsters.

Even getting duplicate monsters is exciting as fusing duplicates together not only provides experience, but also levels up the monster’s powerful “active skill”.⁶

Countering the natural hoarding tendency is the concept of “Box Space,” or the total number of monsters you’re allowed to hoard at a time. Exceeding the allocated box space prevents the player from being able to enter new dungeons, forcing them to make decisions about consolidating powerful creatures together, or spending IAP purchases on box space expansions.

I find this single monster resource system to be quite elegant⁷, as it both simplifies the number of resources in the game, but also provides interesting decisions that players can think about in the downtime between dungeons: given a limited amount of box space, which monsters should I keep, and which ones should I feed away for experience?

Raiding

Another mechanic I really like a lot is PAD’s treatment of the end-game.

The designers were either really clever, or got really lucky, in their design of the limited-time event system. In the game, there is a list of unique “special dungeons” called “descends” which rotates every 24 hours. Each special dungeon guarantees the drop of a unique monster only if you are able to clear it on the hardest difficulty. However, specific special dungeons only come around about once a month. So if you want a particular monster, you have to train your team and plan to be ready by the dungeon date.

The whole preparation and timing feels very much like “gearing up” for a raid, as is common in other MMOs. Players will often only have two to three shots at the dungeon due to the time-gating, so they will often spend the days leading up to the descend training their monsters, reading up on the boss mechanics, and browsing the community to find friends with monsters who can help tackle the level⁸.

Due to the power spike granted by the newly acquired monsters from beating the descend, there’s also a natural progress of descends, just as there is often a natural raiding progression in MMOs. Often, the first real descend new players tackle is Hera,⁹ which provides an ability called Gravity, dealing an unconditional 30% damage to enemy monsters. Using Hera, they work their way through harder and harder stages, like Valkyrie, Goemon, eventually building a team that can tackle Zeus, Satan, and the other end-game descends.

This feeling of end-game progression complements the skill progression well, making me unsurprised that the game is still doing well after four years.

Closing

Anyways with over 500 days logged into the game, I think it’s time to put this to rest. Here’s a screenshot of my core team.

Farewell Karin!

I recently started using twitter more, feel free to follow me @cixelyn if you enjoy my writing. Also, special thanks to Ruwen Liu, Haitao Mao, Sam Powers, and YP Chen for reading drafts of this post.

And as a self-congratulatory note, I am proud to say that I managed to spend less than $100 on IAP, making this one of the best time/money sources of entertainment I have ever played. Pyrrhic victory I suppose. ↩
To be completely fair, there is a part of PAD that can be quite inscrutable to uninitiated viewers: intentional board stalling. The idea is that the player does not actually want to trigger a big combo because it would prematurely advance them to the next part of the level before all their special abilities are charged. So they make a calculated (and often short) move that makes only a single match, while still manipulating the overall board layout to trigger a big combo later. ↩
This is of course for the very basic case where you have a rainbow colored team which each member of equal power. The analysis is much more nuanced if you care about a non-uniform team (i.e. you’re stuck with an integer linear programming problem). For those that care, the basic damage formula is $$ (1+\frac{combos-1}{4}) \cdot \sum_{n=0}^{combos} \mathrm{attack}(n)\cdot(1 +\frac{orbs(n)-3}{4}) $$where combos is the total number of combos, attack(n) is the total monster attack power of combo_n, and orbs(n) is the total number of orbs in combo_n. Throw in board modifiers, monster multipliers, and a whole host of other powerups, and the calculation becomes really messy. ↩
For the general MxN puzzle, any transposition will preserve the invariant N mod 2, where N is the number of inversions plus the row number of the empty square. For a more thorough treatment, see the excellent resource at Interactive Mathematics Miscellany and Puzzles. ↩
If you do use Pndopt, I find their default MAX_SOLUTIONS_COUNT a bit too low. Open the console and bump the variable to something reasonable like 20,000. ↩
Each monster may have up to one active skill, which is a player-activated ability that provides some sort of positive benefit during battles. “Orb Changers” are the most sought-after active skill as they typically convert all orbs of one color to another, serving as play-makers for difficult board situations. Leveling up a skill reduces the skill’s cooldown timer, allowing the player to use it more frequently in battle. ↩
To be completely fair, PAD actually has a secondary resource called “gold” which I find quite inelegant. Except very early on in the game, you never run out of gold, making it a non-resource. I think the designers realized this mistake and started adding gold sinks in the form of purchasable dungeons around the 6.0 patch. ↩
When battling a dungeon, you provide one leader and four team members. You also have the option of using a friend’s monster, who serves as the team’s second leader. Leader monsters give huge team buffs, so having a roster of strong friends is paramount to fielding an overall strong team. Many higher-level players will often lend their monsters to beginners during the big descend days. Common places to look for specific friends include /r/puzzleanddragons and puzzledragonx’s friend finder. ↩
I know this isn’t entirely the case as Hera was bumped to a normal dungeon now, but it was true throughout over half of PAD’s life and the entirety of my PAD career. ↩

Java bytecode hacking for fun and profit

2014-01-06T12:59:00-08:00

With the 2014 season of battlecode starting tomorrow, I figured now would be a good as time as any to finally write up my notes on bytecode hacking. If you’re unfamiliar with Battlecode, a good introduction is my previous post (tldr: it’s an intense open-to-all programming competition where teams write AIs for virtual robot armies).

You might be wondering what bytecodes have to do with battlecode. Well, one of the most intriguing parts of the battlecode engine is the cost model applied to each team’s AI. In order to hard limit each team’s total computation, yet guarantee equal computation resources to each team, each team is given a bytecode limit, and their code is instrumented and allowed to run only up to that limit before it is halted. This is pretty counter-intuitive for people who are used to more traditional time-based computational limits.

For those unfamiliar with bytecodes, they are the atomic instructions that run on the JVM – your Java source compiles down to them, similar to assembly. The tricky part is that Battlecode keeps this bytecode limit low – typically in the 6-10k range. To give a rough sense of scale, an A* search through a small 8x8 grid can easily blow through the whole computational budget; Battlecode maps, however, can be anywhere from 20x20 to 60x60 tiles in size.

This bytecode limit, then, is actually quite interesting, as it forces teams to come up with novel and creative ways to solve problems rather than just implementing well-known algorithms. Unfortunately, it also serves as one of the major contributors to a relatively steep learning curve. My goal with this post is to elucidate just what is happening under the hood, as well as provide some tips and tricks for teams to squeeze every last drop of performance out their AIs.

As a disclaimer, these optimizations should be performed last, once the majority of your AI framework is built; writing good code is better than optimizing incorrect or algorithmically poor code. But that being said, when you’re tight for bytecodes, any small optimization can very well mean the difference between victory and defeat.

When searching for resources on the net, it becomes apparent that bytecode optimization is something of a lost art – any article on the topic comes from pre-2000, before the HotSpot JIT compiler was introduced in Java 1.3. With JIT compilation and also modern obfuscation engines like proguard, there hasn’t been much reason to pay attention to things like emitted bytecode or total class file size. Battlecode is somewhat unique as contestants are required to turn in their source, rather than compiled code (as students can take it as a course and count it for university credits). Thus, we must turn to these old techniques to to control emitted bytecodes from high-level source.

There are some who may scoff at bytecode optimization, reasoning that it’s a worthless skill for modern computer science, especially those working in high-level languages. Understanding what the compiler emits however is a skill still very much alive and well in embedded programming, FPGA programming, and other performance-oriented disciplines. In FPGA programming, one must have a mental model of what hardware will be synthesized before writing the code. In embedded programming, the frequency of software-based signal generation is limited by instruction count in the loop body.

Honestly, most of the reward of bytecode optimization comes from being able to play with the battle-tested JVM architecture in a particularly novel way. It’s incredibly fun, especially when it gives you the edge against rival teams.What more justification does an interested hacker need?

JVM Bytecode Basics

To understand how to work around a bytecode limit, we must first understand the JVM’s execution model. The inner workings of the JVM are well documented elsewhere on the net – feel free to skip this section if you’re already familiar, but for those who aren’t, here’s a brief overview of the important parts.

The JVM is a relatively simple stack-based architecture with a fairly comprehensive instruction set allowing for manipulation of both primitives and full objects. An atomic instruction is called a bytecode, roughly equivalent to a single assembly instruction in native code. These bytecodes are stored as a stream within the compiled Java .class file, and are executed within the context of a stack frame.

Each stack frame contains:

The current operand stack
An array of local variables
A reference to the constants pool of the class of the current method

Bytecodes perform computation by pushing and popping values onto the current frame’s operand stack. If a method is invoked, a brand new frame is created and pushed on top of the execution stack. Upon method completion, the frame is destroyed and the return value is passed to the previous frame.

The easiest way to understand bytecode execution is to see an example. Given the following Java code:

1
2
3
4
public int sumSquares(int a, int b) {
  int rv = a*a + b*b;
  return rv;
}

Lets disassemble it and see how it works. The standard Java SDK conveniently comes with the javap disassembler. javap -c Main will give you the bytecode stream for Main.class in the same directory, which I’ve (overly)annotated¹ to explain how it works:

1
2
3
4
5
6
7
8
9
10
11
12
public int sumSquares(int, int);
  Code:
     0: iload_1   // push local variable 1 (int a) onto the stack
     1: iload_1   // push int a onto the stack again
     2: imul      // pop two ints, multiply them, then push the result onto the stack
     3: iload_2   // push local variable 2 (int b) onto the stack
     4: iload_2   // push int b onto the stack again
     5: imul      // pop two ints, multiply, and push back the result
     6: iadd      // pop the two results, add them, and push back the result
     7: istore_3  // store the result to local variable 3 (rv)
     8: iload_3   // load local variable 3
     9: ireturn   // return what's on the stack (rv)

Note how the assignment of local variables to array positions is determined at compile time and baked directly into the byte code stream. The two parameters are passed in as positions 1 and 2 on the locals array while rv has been assigned position 3 on the array. In fact, the bytecode output from the Java source was fairly predictable – we’ll use this fact to our advantage later on.

Here’s another simple example that contains branching:

1
2
3
4
5
public int sign(int a) {
  if(a<0) return -1;
  else if (a>0) return 1;
  else return 0;
}

1
2
3
4
5
6
7
8
9
10
11
12
public int sign(int);
  Code:
     0: iload_1
     1: ifge          6
     4: iconst_m1
     5: ireturn
     6: iload_1
     7: ifle          12
    10: iconst_1
    11: ireturn
    12: iconst_0
    13: ireturn

As you may have noticed, the numbers on the left are not actually instruction count, but rather the instruction’s byte-offset from the beginning of the stream. The jump targets for ifge and ifle are specified in terms of these offsets.

With these basics in mind, we can now take a look at how to optimize algorithms from within the Battlecode engine.

Bytecode Counting

The current generation of Battlecode’s instrumentation engine uses the OW2 ASM framework for bytecode counting. Before a team’s code is executed, the engine walks through the generated program tree, and computes the bytecode cost of each basic block. At each block’s exit, a checkpoint is injected with the block’s total cost. During live execution, these checkpoints increment the AI’s internal total bytecode counter. If at any checkpoint the running tally exceeds GameConstants.BYTECODE_LIMIT, the AI’s execution is halted and execution of the next robot’s AI begins. This essentially means that the executing robot’s turn is skipped – preventing it from moving or firing its weapons if it hadn’t already done so that turn, which can be devastating in combat.

The system’s design allows the engine to simulate hundreds of AIs efficiently, with only moderate overhead. Earlier versions of battlecode ran on a custom JVM implementation written in Java, and while it could instrument on a per-instruction-basis, was a lot slower.

The biggest takeaway is that you are penalized only for the total number of bytecodes you use.

A common mistake when looking at disassembled code is that the size of the bytecode does not matter: iload_0 (0x1a) which is a one byte special compact instruction for loading the integer from local variable 0, is the same cost as the the two byte iload #5 (the iload opcode 0x15, followed by the argument 0x05):

1
2
0: iload_0          0: iload #5
1: return           2: return

When checking output from javap or other disassemblers, you must remember to renumber from byte-offset to instruction-offset in order to know your total cost.

The complexity of the bytecode instruction doesn’t matter either. As an example, here are two equivalent statements that emit two different bytecodes:

1
2
3
4
if(choice == 1) methodA();
else if(choice == 2) methodB();
else if(choice == 3) methodC();
else if(choice == 4) methodD();

while compiles to²:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 0: aload_0
 1: getfield      #45                 // Field choice:I
 4: iconst_1
 5: if_icmpne     15
 8: aload_0
 9: invokevirtual #47                 // Method methodA:()V
12: goto          57
15: aload_0
16: getfield      #45                 // Field choice:I
19: iconst_2
20: if_icmpne     30
23: aload_0
24: invokevirtual #49                 // Method methodB:()V
27: goto          57
30: aload_0
31: getfield      #45                 // Field choice:I
34: iconst_3
35: if_icmpne     45
38: aload_0
39: invokevirtual #51                 // Method methodC:()V
42: goto          57
45: aload_0
46: getfield      #45                 // Field choice:I
49: iconst_4
50: if_icmpne     57
53: aload_0
54: invokevirtual #53                 // Method methodD:()V

and the same expression written as a switch statement:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
switch(choice) {
case 1:
  methodA();
  break;
case 2:
  methodB();
  break;
case 3:
  methodC();
  break;
case 4:
  methodD();
  break;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 0: aload_0
 1: getfield      #45                 // Field choice:I
 4: tableswitch   { // 2 to 6
               2: 40
               3: 47
               4: 54
               5: 65
               6: 61
         default: 65
    }
40: aload_0
41: invokevirtual #47                 // Method methodA:()V
44: goto          65
47: aload_0
48: invokevirtual #49                 // Method methodB:()V
51: goto          65
54: aload_0
55: invokevirtual #51                 // Method methodC:()V
58: goto          65
61: aload_0
62: invokevirtual #53                 // Method methodD:()V

Notice how the if-else statements emit sequential if_icmpne instructions which much be evaluated in order, while the switch statement emits a single lookupswitch³ instruction that will jump directly to the correct block. It is to your advantage to use complex instructions.

Loop Optimizations

With these general ideas in mind, we can begin to explore more advanced optimization techniques. When optimizing bytecodes, our primary goal is to reduce the total instruction count to a bare minimum. We’re lucky in that we don’t have to benchmark to determine performance – we only have to count the total number of instructions⁴. The easiest way to illustrate optimization is to walk through a complete example of optimizing a tight loop.

Lets begin with a hypothetical controller class that encapsulates an array of objects that we care about, say enemy_robots. We want to build a method called scanAll that will iterate through all the enemy robots one by one and call the scan method on each.

1
2
3
4
5
6
public class Controller {
  public RobotInfo[] enemy_robots;
  public void scanAll() {
    /* code to iterate through enemey_robots and scan them */
  }
}

Since Java 5, there has been an easy way to write these for-each loops, which will do nicely for our first pass:

1
2
3
for(RobotInfo rinfo : enemy_robots) {
  rinfo.scan();
}

I’ve done the following things to the below disassembly – I’ve heavily annotated each opcode, and I’ve also renumbered the indices given from javap from a byte-offset to an instruction index (as the instruction count is what we are penalized for).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
aload_0                  // Variable 0 (the "this" object reference)
getfield      #14        // Field enemy_robots:[Lbytecodetests/RobotInfo;
dup                      // Duplicates the last object on the stack (enemy_robots)
astore        4          // locals[4] = enemy_robots
arraylength
istore_3                 // locals[3] = enemy_robots.length
iconst_0                 // loads the value 0 onto the stack
istore_2                 // locals[2] = 0 (or the loop index)
goto          17         // LOOP BEGINS HERE:
aload         4
iload_2
aaload                   // loads index (locals[2]) of enemy_robots
astore_1                 // locals[1] = enemy_robots[index]
aload_1
invokevirtual #21        // Method bytecodetests/RobotInfo.scan:()V
iinc          2, 1       // locals[2]++
iload_2
iload_3
if_icmplt     10         // if index < enemyrobots.length, jump to instruction 10

In the above disassembly, the compiler has assigned the following variables into the locals array as such:

variable rinfo
implicit loop index
enemy_robots.length
enemy_robots

The zeroth position is special – it’s almost always this, that is, the current enclosing object. We’ll see later why that is important.

Our main loop body is from instruction 10 to instruction 19, a total size of 10 bytecodes. So our total bytecode count for this routine is the overhead (12) plus the loop body (10) times the number of iterations. Assuming that the number of iterations is large, how can we reduce this cost? One way is to write the loop in a more old fashioned way:

1
2
3
4
int length = enemy_robots.length;
for(int i=0; i<length; i++) {
  enemy_robots[i].scan();
}

Note that we pre-compute length so that we don’t incur the cost of computing enemy_robots.length every loop iteration (doing so would be an extra aload, getfield and arraylength per loop instead of a single iload call). The emitted byte code is below, again annotated and re-indexed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
aload_0
getfield      #16     // Field enemy_robots:[Lbytecodetests/RobotInfo;
arraylength
istore_1
iconst_0
istore_2
goto          14      // LOOP BEGINS HERE:
aload_0
getfield      #16     // Field enemy_robots:[Lbytecodetests/RobotInfo;
iload_2
aaload
invokevirtual #23     // Method bytecodetests/RobotInfo.scan:()V
iinc          2, 1
iload_2
iload_1
if_icmplt     8

In this example, the compiler has actually only assigned three local variables:

locals[0]: the this reference
locals[1]: the precomputed array-length length
locals[2]: the loop index i

The total loop comes out to 9 bytecodes per iteration. We saved exactly 1 bytecode for increased code complexity. Sadistic, isn’t it? The saved instruction comes from not having to astore, aload the extra local variable rinfo that was required in the for-each example. We did however lose a bytecode having to getfield the implicit class-variable enemey_robots. Lets try to recover it.

Pulling things into local scope

In our above example, because enemy_robots is actually a class-level variable, in order to reference enemy_robots, the implicit this must be pushed onto the stack first.

Each access thus requires an aload_0 followed by a getfield. If we instead assign enemy_robots to a local variable, it becomes a single bytecode aload #x, grabbed directly from local variable array. So let’s pay the overhead cost to bring enemy_robots down into local scope and rewrite the loop:

1
2
3
4
5
RobotInfo[] local_enemy_robots = enemy_robots;
int length = local_enemy_robots.length;
for(int i=0; i<length; i++) {
  local_enemy_robots[i].scan();
}

Again, for additional complexity, we save another bytecode!⁵ Can we do even better?

Comparisons against zero

It turns out we can actually save one additional bytecode if we rewrite the entire loop structure as follows:

1
2
3
for(int i=local_enemy_robots.length; --i >= 0;) {
  local_enemy_robots[i].scan();
}

1
2
3
4
5
6
7
8
9
10
11
aload_1
arraylength
istore_2
goto          9
aload_1
iload_2
aaload
invokevirtual #32      // Method bytecodetests/RobotInfo.scan:()V
iinc          2, -1
iload_2
ifge          5

Because the loop now decrements from the array length to zero, our loop termination conditional is a check against zero, which java has a special bytecode for: ifge. This means that we only have to push one number onto the stack instead of the two required for if_icmplt, cutting out an iload. This now brings us down to 7 bytecodes!

Putting it all together

The following table shows the loop bodies of each step of our optimization with the extraneous instruction bolded:

	for-each	for-index	with-locals	reversed
1	aload	aload_0	aload_1	aload_1
2	iload_2	getfield	iload_2	iload_2
3	aaload	iload_2	aaload	aaload
4	astore_1	aaload	invokevirtual	invokevirtual
5	aload_1	invokevirtual	iinc	iinc
6	invokevirtual	iinc	iload_2	iload_2
7	iinc	iload_2	iload_1	ifge
8	iload_2	iload_1	if_icmplt
9	iload_3	if_icmplt
10	if_icmplt

To recap, just by tweaking and reorganizing the structure of the loop itself, we managed to reduce overhead instruction count by 30%. And in tight loops that run hundreds/thousands of times, we bank appreciable bytecode savings that we then can spend on more critical code paths – like running a pathfinding algorithm several steps further, or processing a few more enemies in a weapons targeting system.

Generating GOTOs

Any discussion of loop optimization wouldn’t be complete without a brief discussion of loop termination. In Java, similarly to most languages, you can early terminate a loop with the break keyword, or skip a loop iteration with the continue keyword. It’s the closest thing we have in Java to a general purpose goto statement.

1
2
3
4
5
while(true) {
  if(condition_one) continue;
  if(condition_two) break;
  /* code */
}

By putting those two conditional checks early, we can force the compiler to generate a goto instruction and prevent wasteful execution of the loop body.

In Java, you can also break out of two nested loops by labeling the first loop and using a labeled break statement as follows:

1
2
3
4
5
outerloop: while(first_condition) {
  while(second_condition) {
    if(third_condition) break outer loop;
  }
}

This neat trick can help prevent the need for a sentinel value in the outer loop. A corresponding trick that most people don’t know is that you can actually break out of arbitrary labeled blocks. This gives you an ugly but capable forward-jumping “goto” statement:

1
2
3
4
5
6
7
8
label1: {
  label2: {
    /* code */
    if(first_conditional) break label2;
    if(second_conditional) break label1;
  }
  /* more code */
}

As a disclaimer, I would never, ever, ever use this in normal day-to-day code, but if you need to squeeze some extra bytecodes in a pinch, it’ll do.

Closing Thoughts

My hope is that this post has given you some insight into how the Java compiler emits bytecodes and how you can use that to your advantage to reduce your total instruction count.

In the 2012 Battlecode competition, we used the above techniques extensively in writing what we called the hibernation system, a very tight loop that consumed only 69 bytecodes per turn. In that year’s game spec, an AI’s unused bytecodes could be directly refunded for energy at the end of the turn, so the hibernation system was effectively a low-power state for our AIs that allowed us to stockpile energy. This, in turn, allowed our army to sustain roughly 2x more robots than normally possible with typical (1000-2000) bytecode usage, giving us the edge in combat.

It can’t be stressed enough, however, that many of these optimizations should be done only after all other avenues have been exhausted. There are a large number of algorithmic and data structure tricks to perform which may yield even greater savings, some of which are discussed in our winning 2012 strategy report. I hope to write up these as a stand-alone post at some point, as they are somewhat out of the scope of this article. But for the curious, our strategy report, combined with Steve Arcangeli’s 2011 code snippet notes, should provide a reasonable background to the topic.

To the teams competing in the 2014 competition: best of luck, and don’t forget to have fun!

When looking through Java disassembly, it’s helpful to have a quick reference. Wikipedia has page on Java byte codes and their operations which is useful for at-a-glance lookup. The more detailed (and official) description of the bytecode operations can be found in Oracle’s JVM reference. ↩
When reading the output of javap, the comments following an invokevirtual command denote the signature of the method being invoked. The format is the list of arguments types in parenthesis followed by the return value type. The types are shortened to their one letter code to remain compact. ↩

Source Declaration Method Descriptor

void m(int i, float f) (IF)V

int[] m(int i, String s) (ILjava/lang/String;)[I
As an aside, for speed, the Bytecode format actually has two types of table-based jumps: lookupswitch and tableswitch. If the indices are roughly sequential, the compiler will pack them into a lookupswitch table in which the parameter into the switch statement is the table offset, giving an O(1) lookup of the jump address. If the indices are far apart / non-sequential however, packing them into a fixed-interval table would be very wasteful, and so the lookupswitch table stores both the case value and the jump offset, allowing the JVM to binary search through the possible case statements for the correct jump vector. ↩

The best thing of course is that despite lookupswitch having O(1) complexity and tableswitch having O(log n) complexity, in battlecode they’re equivalent because we’re only counting the bytecodes, and not the true computational cost! So you don’t have to worry about creating compact case statements! :D
That’s not to say benchmarking isn’t important, as it’s often one of the fastest ways to profile new routines or survey where your biggest bytecode expenses are. The engine conveniently provides Clock.getBytecodeNum() to check your usage for the current turn. ↩

If you’re benchmarking large routines or your entire AI framework, you’ll want to make sure you account for bytecode overage due to turn-skipping. In our main framework, one of the first things we wrote was a method to get true bytecode count so we could accurately gauge performance. The following formula should give the correct count:
```
1
2
3
4
int byteCount =
    (GameConstants.BYTECODE_LIMIT-executeStartByte) +
    (currRound-executeStartTime-1) * GameConstants.BYTECODE_LIMIT +
    Clock.getBytecodeNum();
```
In this example, we only access enemy_robots a single time in the loop body – in a real-world example, this technique has the potential to realize even greater savings, especially for member variables that are accessed often. ↩

Source Declaration	Method Descriptor
`void m(int i, float f)`	`(IF)V`
`int[] m(int i, String s)`	`(ILjava/lang/String;)[I`

Battlecode: MIT's longest-running hardcore programming competition

2013-01-07T00:00:00-08:00

As MIT’s Independent Activities Period draws near, I’ve received quite a volume of inquiries from underclassmen about “what should I take?” or “which of these competitions is better?”

It would be nice if I could point all future questions to a single place. So here are my thoughts on (read: pitch for) Battlecode – MIT’s longest-running and most badass programming competition.

What is it?

On a cold Boston January night on a dark stage in front of hundreds of cheering students, two teams will command massive armies and engage in deadly warfare – all within virtual reality. By this point, both teams will have worked tirelessly for four whole weeks to program and polish the best possible AI.

The battlefield itself is rendered in 3D on the center screen of MIT’s main auditorium, with two side screens providing a map overview as well as match statistics. The crowd cheers as the AIs are initialized and the two sides begin constructing robots, capturing objectives, and annihilating each other to the sound of lasers and explosions.

Eventually one team will be violently eliminated in a display of tactical genius on the part of the other – the eliminated team was unable to design an effective counter in the last week leading up to the finals. Sadly, they walk off the stage, a little disheartened, but happy to have made it so far in the finals for their first time. For the remaining team, however, the dream of glory and thousands of dollars in cash prizes lives on.

Overview

Battlecode is a programming competition (course number: 6.370), in which groups of up to four students program AIs to wage war in a Real-Time-Strategy game-like simulator. For those of you who are unfamiliar with the competition, here’s a small collection of matches from previous years:

For a brief overview:

Teams of MIT students submit code that will run on their team’s robots within the simulator.
The AIs are decentralized, meaning that robots do not share global information, and each robot must make its own decision based on the local environment and information nearby robots have radioed in.
The game ends when one side captures the objective (changes from year to year), or eliminates the opposing enemy team.

The finals are held in MIT’s largest auditorium where hundreds of students gather to watch and cheer on as the two AIs duke it out live on stage, while the teams provide a running commentary on their design and strategy. Winners have the chance to dip their toes in prize pools of $40k+, there’s usually a ton of free food, and the whole spectacle is a ton of fun.

(If that doesn’t give you nerd chills, I don’t know what will.)

Why should I care about Battlecode?

If you’re a coder, then no doubt there is some confusion as to the merits of the various competitions on campus as the number increases every single year. Understandably, the web and mobile tracks have seen increased uptake, both due to the general startup craze, and in paraphrased words of fellow alumnuis @cyen, working with web/mobile technologies represents “real-world” vocational training that some students feel they lack.

While it’s true that the AI you write for Battlecode is not going to be the next big webapp or the cool new business entered into MIT’s $100k competition, the experience of hacking on a difficult problem with short deadlines will definitely improve your technical chops – more-so probably than building a pretty CRUD app to win the web-app competition.

From past winner Albert Ni:

Battlecode is such a valuable experience because it’s basically the one software engineering experience you have at MIT where you’re forced to make a ton of tradeoff decisions, but the wrong choice in the wrong place can literally be the difference between victory and defeat. Surprisingly, to me Battlecode turned out to be excellent preparation for the startup world because there you also don’t have enough time to get every last detail right and thus have to decide where to focus your limited energy, but focusing on the wrong things can doom your company.

Ironically you don’t actually get this engineering experience as much in the webapp or mobile app competitions. In both, contestants are expected to quickly cobble together a simple 30-day prototype, but the “rubber never really meets the road” as teams aren’t judged on relevant tangible metrics (and therefore not under pressure to make do-or-die decisions) such as week-over-week growth rate or profits.

As an example, in our team’s bot, I can point to a single commit that was literally the difference between winning the whole competition and being a runner-up: this untested 35 line hack was made one hour before the submission deadline as a last ditch effort to counter the one team we knew we would lose to in the finals (on the last day we consistently lost to them in unranked scrimmage matches). There’s nothing quite like standing on stage in front of hundreds of people hoping that your untested hack deploys correctly.

No one writes bug-free (or even pretty) code in Battlecode, but contestants quickly learn how to avoid the catastrophic bugs that singlehandledly lose games. They learn to allocate time efficiently, quickly prototype new strategies, and test often (you can sit and theorycraft all you want, but at the end of the day, only wins on the scrimmage server and tournaments count). And all this is done working closely with two or three other classmates under immense time pressures – a recipe for long nights, heated arguments, and massive merge conflicts.

These experiences and skills learned through Battlecode are hugely relevant in software engineering, startups, and life in general.

We have awesome alumni too!

If you’re still not convinced that Battlecode is the best IAP competition, allow me to point you to its rich history of alumni who later went on to do some amazing things:

David Greenspan and Aaron Iba, two of the original directors of battlecode cofound a AppJet together (acq. by Google for $10MM), which launched etherpad, still one of the downright most useful web apps I have ever used. Aaron later went on to become a partner at the well-known startup incubator Y Combinator.
Drew Houston (top 8 battlecode finalist) and Arash Ferdowski (battlecode director) cofounded Dropbox together. A significant percentage of Dropbox’s core engineering team are all Battlecode alumns too including Albert Ni, KMod, Zviad M., and Steve Bartel.

And if you’re not a programmer at all, you should still come to watch the matches and listen to the teams discuss strategy. It’s definitely a ton more exciting than watching an animated excel graph of numbers.

But… JAVA (whine)

There seems to be an increasingly allergic reaction to Java in the hacker community for its verbosity, ceremony, and corporate stooginess (see this poem). The JVM-based nature of Battlecode has certainly drawn much criticism in this regard - especially when MIT’s entire computer science curriculum is Python-based. Java’s certainly not the coolest language these days.

You quickly find, however, that the Java necessary to perform well in competition is nothing like anything you’ve ever written before. The AI of every individual robot is encapsulated within it’s own virtual execution environment, and the computational currency is not execution time, or even memory, but rather JVM Bytecodes.

What? Bytecodes? I can already hear some of you say. Hear me out – this strange limitation has some wonderful merits. Within these confines, contestants are forced to throw away everything they know about computer science and build their own data structures, their own architecture, and their own algorithms from scratch to deal with such constraints. It forces code to be lean; it forces contestants to be creative and continually innovate new heuristics that run in linear or sublinear time; most of all, these things all have to be self-discovered or invented, as there’s no immediate book or guide to turn to.

Being creative and arbitraging the bytecode counters is well within the spirit of the game, leading to some of the most remarkable hacks I’ve ever seen. A small oversight in the specs counted java.util.regex as a fixed-cost. Team Little’s 2007 entry exploited this with an impressive regular-expression encoding of dijkstra’s algorithm, allowing them to perform an O(nlogn) algorithm in just O(1) time. The hole was patched the following year after Little won recognition for the clever hack. Never-the-less, the existing cat and mouse game between the constestants and the devs has lead to an ever-evolving game. For more examples, Steve Arcangeli has a very detailed post explaining some of the clever data structures they were able to exploit during the 2011 competition.

Despite being severely bytecode-limited, you still have to program an AI for a game that is insanely complex. Consider that not only do you have to write good heuristics for pathfinding and attack code, but you also need to share information between robots by broadcasting data packets (as each robot is its own autonomous entity). Thus, in addition to the actual attacking / firing of lasers, there is a second unseen battleground of information warfare. This leads to all sorts of shenanigans. In 2009, teams thought they were being clever by exploiting the fixed cost of Arrays.hashCode() to hash messages cheaply in O(1) time. But Greg Little went even one step further, looked up the OpenJDK hashCode implementation, and designed a messaging attack that mutated the contents of the array without changing the resulting hash. His bot was able to completely disrupt enemy communication, and in some cases, cause them to overload and stop moving via erroneous messages. Badass.

So, despite the Java heritage, Battlecode still offers something for everyone – the hacker dreams up neat ways of executing computations within the confines of the engine; the computer scientist drafts the algorithm to compute on the platform; and the low-level performance engineer optimizes it to make it bytecode efficient. All the past winners of battlecodes have been amazingly talented hackers that I seriously look up to and respect for their ability.

To conclude

Having participated 3 consecutive years in Battlecode, I can definitely say that, at least for me, it was the best use of my time during IAP. There’s really nothing like four friends moving together into a single dorm room for one month with a bunch of monitors, a ton of ramen, and a battleplan for victory (In that sense, it’s probably more “startup-y” than doing 6.470/6.570). And despite being a competition, the top teams have always been amazingly generous, offering to share code secrets, tips, and techniques even when I was first a beginner just getting off the ground. Learning from and interacting with some of the best hackers MIT has to offer is quite a humbling experience.

To any MIT students or prospectives considering Battlecode: find a few friends and dive in! You definitely won’t regret it.

To read further

The brand new 2013 game spec will be released January 7th late afternoon. Meanwhile you can check out the old 2012 spec to get a feel for the game details.

In the spirit of Steve Arcangeli’s postmortem on the 2011 winning bot, my next few posts wil be on our 2012 experiences as well as tips, tricks, and techniques relevant to the current battlecode engine. We were fortunate enough to utilize a few of Team Gunface’s tricks in the design of our own bot and would love to continue the tradition by writing up a few of our own. (By the way, if you’re interested in participating in Battlecode, both of Steve’s posts are required reading. One of the only resources out there that really details the thought process behind some of the design while also swapping old war stories).

I’m also going to kick the generousity up a notch and in addition to just writing up code snippets, with the permission of my teams, I’ve also open sourced all of my previous battlecode bots. In order, they are:

Year	ID	Team Name	Members	Repo
2010	#161	“lazer guns pew pew”	Stephen C., Cory L., Kevin L., Sajith W.	Bitbucket
2011	#068	“In the Rear with the Gear”	Cory L. Max N. Justin V.	Bitbucket
2012	#047	“fun gamers”	Yanping C., Cory L., Haitao M., Justin V.	Bitbucket

Feel free to browse through the code, though the commenting density is somewhat inconsistent, and a lot of last-minute changes are undocumented and/or inconsistent with the provided documentation. Also, don’t read through the commit log if you’re easily offended.

Lastly, if you’re interested in sponsoring future battlecode competitions, check out their sponsorship info.

Edit: Check out comments from past competitors and devs on the Hacker News thread