Foursquare’s Swarm app allows users to “check-in” to a venue to share with their friends what they are up to, and to keep a record of places they have been. At the time I was working on this project, Foursquare was reporting that they had up to 8 million check-ins per day, and around 12 billion check-ins in total. Foursquare used MongoDB as their primary data store, and the Checkins collection was one of the largest, most heavily queries collections.
Foursquare had a write-once, distributed, ordered key-value store we called HFile Service. Generally, one would write a Hadoop or Spark task that output some kind of key-value data, and then schedule this task to run once every day / week / month to update the data in HFile Service. This was great for data that could be a little bit stale, because serving from HFile Service was faster and cheaper than serving from MongoDB, and deploying updates to HFile Service was easier than backfilling a live MongoDB collection.
We naturally wondered if we could expand the benefits of HFile Service to some of our more dynamic data, and thus was born Delta Service. The idea behind Delta Service was that we would move long-lived records out of MongoDB and into HFile service. Record creation, updates, and deletions would all be written to a MongoDB collection, as normal, but the collection would expire records after a certain period of time, generally 14 days [note to self, remember that we extended this to 30 for checkins]. Each day, an offline process would generate a new HFile collection that included any records in MonogoDB. At query time, we would query both the MongoDB collection and the HFile Service collection, and then merge the results. That way a collection can have most of its data in the cheaper, faster HFile service, but still support real-time updates and additions. This tended to work especially well for collections that had a lot of record creation, but very few updates or deletions.
I was asked to migrate Foursquare’s Checkins collection to Delta Service.
My standard recipe for a migration looks lik this:
Creating a new Delta Service collection was straightforward, requiring only a bit of boiler plate in a handful of files. But in order to support the full range of queries that we performed against the Checkin
collection, I had to FIXME.
In order to perform this migration, I had to ensure that all queries to the Checkins collection went through a code path that I controlled. There were hundreds of existing queries, and engineers were always adding more. Foursquare used their own DSL, called Rogue for writing queries against MongoDB. A typical query would look like services.mongo.fetch(Q(Checkins).where(_.id eqs checkinId)…)
. I wanted to ensure that all existing Checkins queries were migrated to look like services.checkins.fetch(Q(Checkins).where(_.id eqs checkinId)…)
, and also that no one could write a new query directly against MongoDB once the migration was underway. A reasonable way to handle this would be to add code to the Rogue query driver, services.mongo
, that would detect a query against the Checkins collection and raise a warning or exception. However, Scala’s type system allows us to detect this at compile-time instead.
To start with, our Rogue query driver looks something like this;
class MongoQuery {
def fetch[R <: RecordType](
query: Query[R],
): Seq[R] = ...
}
To cause queries against the Checkin
collection to fail at compile time, first we will create a sentinel type, class MongoDisallowed {}
. We will then update our fetch method to include an implicit evidence parameter that ensures that the query class is not a subclass of MongoDisallowed
.
class MongoQuery {
def fetch[R <: RecordType](
query: Query[R],
)(implicit ev: R !<:< MongoDisallowed): Seq[R] = ...
}
In addition to ensuring that other engineers do not create new Checkin
queries directly against MongoDB, this technique also gives us a convenient way to audit that we have migrate all existing queries to our new codepath, namely, our code will not compile until we have done so.
There existed several hundred existing queries against the Checkins c
Challengs
What queries were hard?