stripe ctf 3 thoughts

30 January 2014

Stripe runs a yearly programming challenge which presents a sequence of increasingly difficult programming challenges. This year the focus was on distributed systems. I was able to finish all the levels in the first day but had to spend more time to maintain a place on the leaderboard for level4. I didn't spend any time optimizing earlier levels, so most of my thoughts will be on level4.

level0: a spell checker that was too slow

The baseline implementation reads a dictionary into a list then searches the list for every word in the input, marking the word as misspelled if it doesn't appear in the dictionary. I just loaded the dictionary into a set instead of a list so the lookup complexity for each word in the input was O(1) instead of O(n). Good enough to pass but there are much faster ways to do this, particularly if you do some pre-processing on the dictionary.

level1: The great SHA war

This level boiled down to calculating SHA hashes faster than the test bot could. The baseline implementation is a bash script that uses git commands directly to calculate a commit hash. The trick was to lookup the format git uses to calculate the commit hash and directly calculate the hash without going through a git command. I wrote a simple ruby script to do this. My first implementation managed ~165kh/sec on a 2012 i5 Macbook Air. I didn't think this would be enough but I left it running for a bit while I worked on parallelizing the code to use multiple cores. To my surprise the single threaded implementation succeeded before I finished working on the parallel version.

level2: DDoS filtering with node.js

This level was a node.js proxy that needed to filter out a DDoS attack as it was overloading the backend service. The baseline implementation did no filtering and routed all requests to a single backend server even though there were 3 backend servers available. I just round-robined requests to the backend nodes and stored a request count for each source address. I dropped requests once a source made 15 requests, chosen by fair dice roll. This worked for the test case as it was only 20 seconds long i.e. no need to track requests per second or anything like that to pass such a sort duration test.

level3: distributed text search engine in Scala

The search cluster had 1 master node and 3 worker nodes, each node had a 500mb memory limit. The input was a directory tree full of text files which you had a few minutes to pre-process before the test harness would start sending queries. The test harness sent a query word and expected the file name and line number for any occurrence of the word. Matching was 'contains style' within a word, but not across white space.

My first cut at this was to build a tri-gram index and just round-robin to the worker nodes. This scored really well on the local tests and was surprisingly quick to implement given that I'd never written anything in Scala before. Unfortunately, it failed miserably on the remote tests. The remote tests contained much more data to index than the local tests and that blew my memory usage way beyond the 500MB limit. I optimized for memory usage and sharded the index across the worker nodes, using the master to merge the results, but I was still way over the memory limit. In the process I also realized the test harness sends requests sequentially, so round-robining provides no performance benefit over a single node.

Working on the memory issue was a painful process. Iteration times were horrible due to Scala's slow compilation times and the JVM's out of memory behavior, which is to freeze up as hard as 3 mile deep arctic ice. Eventually I caved and just wrote a simple inverted index search. It wasn't great but good enough to pass. This could have been a fun problem, but I'd had enough Scala for the day and was trying to stick to using the languages the skeletons were written in for the sake of variety.

level4: Distributed database in Go

I really enjoyed this level. I've been meaning to learn Go and the concepts were interesting.

The setup was a 5 node SQL cluster. The cluster had to accept queries to all nodes and maintain consistency while the test harness randomly caused network issues between the nodes. Stripe wrote a library to handle all the network fiddling called octopus.

The core issue for this level is fault tolerant consensus. To deal with this I tracked down a library that implemented the raft protocol in Go. Unfortunately the library had no documentation to speak of. Luckily there was a reference implementation which looked a lot like the skeleton code Stripe provided, bingo.

Implementing goraft wasn't as bad as I thought it was going to be. Mostly it was just fumbling around learning Go and dealing with the way the test harness used unix sockets to connect nodes. Luckily, the real fun begins after implementing goraft.

proxying

The first challenge is that raft is designed for only the cluster leader to handle requests but the test harness sends requests to all nodes. The obvious solution here is to proxy requests to the leader, wait for a response, then return that response to the client. Enter the first edge case.

the first edge case

  1. Client sends request to a follower
  2. Follower proxies the request to the leader
  3. Before leader returns a response, the connection between the leader and the follower is cut
  4. Follower sends a 500 to the client due to the connection error, but the request was actually processed(sometimes)
  5. Client is now stuck waiting for a sequence number

To deal with this I tried two approaches, both relied on making note of the request SQL as it arrives, then watching for the a raft command with the same SQL to be executed. This works only because the test doesn't contain any duplicate queries. A better solution would generate a unique ID for each request, but generating a guaranteed unique ID in a cluster can be a challenge itself.

asynchronous

My first attempt used the fact that if you return a non-200 response to the client it will wait 100ms and try again. I used this as a polling mechanism, thinking of every request in an async manor. This worked quite well, and was fast, but seemed to have a high likelihood of triggering the second edge case (described below). The other problem with this approach is that while the cluster processes a lot of queries, it gets way ahead of sending responses to the clients. Your limited by the 100ms client retry timeout.

synchronous

Instead of relying on the client to retry a failed request. I used the fact that the client <-> node connections are perfect to hold the connection open until I saw a result on the backend. This slowed the cluster down but made it more resilient to the second edge case. It also kept the cluster query execution much more in sync with client response sending, no longer was the cluster 1000 queries ahead of what had been sent to the clients.

replacing the SQL implementation

The default skeleton called out to the SQLlite binary for every request. I timed this and was actually surprised by how fast it was, ~20ms on average But, it had highly variable latency, sometimes taking 150ms to execute the query. I replaced this implementation with an in memory, in process SQLlite implementation which cut query times down to <1ms. This change greatly improved performance. On clean runs I was getting 1200-1400 queries completed. If I hit the edge case, I'd get stuck at only 100 or so.

the second edge case

This one is a bit wonky, but I believe this is what was happening:

  1. Client sends a request to a follower node
  2. Follower node proxies the request to the leader
  3. Leader sends out the raft command but it doesn't make it to enough nodes to be committed
  4. An election is trigger by network failures, but the old leader still has it's connection to the proxy node
  5. Upon losing leadership, the old leader returns an error to the follower node indicating that the command failed to be committed
  6. Proxy receives this error and sends a 500 to the client
  7. A node which did receive the command gets elected to be the new leader
  8. New leader broadcasts the command and succeeds in committing it
  9. Client is now stuck waiting for a sequence number since it believes the command failed

I never came up with a reasonable way to deal with this edge case. I did see legitimate cases where there were commands that failed so I couldn't just assume the error was a false failure. Any approach I came up with would have just made the error less likely and slowed down the cluster so much that the score would be garbage anyway. It didn't happen all that often so I just hoped to avoid it.

scoring

Scoring was highly variable due to a couple of issues. The largest was that Stripe's reference implementation bombed a few of the test cases very badly. Since scoring was handled by comparing your score to the reference implementation's score this meant you could get very high scores on these test cases. My guess is that the reference implementation also didn't find a way to handle the second edge case causing the occasional very low score.

There are also a couple bugs in goraft which made scoring difficult. The most annoying one is a race condition causing a crash. Octopus seemed to be very good at triggering this bug, I'd say somewhere around 20-33% of the time it would be triggered. It seemed to be just luck to avoid triggering it, even on multiple runs of the same test case.

Due to these issues competing on the leaderboard required you to loop git push and hope to land on a test case that the reference scored poorly on, and didn't trigger the goraft race condition. I had runs where I completed 1300 queries score only ~500 and runs where I completed only 700 queries score a few thousand, just luck of the draw.

keys punched by Mark Bell


comments powered by Disqus