Whether to distribute event notifications or to support data streaming across a plethora of devices, publish-subscribe is the go-to paradigm for asynchronous message exchange. Messages are sent through the system publishers to subscribers over message channels known as topics, usually guaranteeing delivery even when some nodes within the system have failed. Designing and implementing such a pub/sub service for extensive scalability and fault-tolerance is a challenging task. First, the system should maintain availability even under a number of node failures, regardless of the roles of these nodes. Second, the system should scale gracefully and independently in four dimensions: the number of subscribers, publishers, topics, and active topic subscriptions.
Third, the throughput and latency of the system should not be bottlenecked by disk. Our paper details the design and implementation of a decentralized publish-subscribe system, Firehose, that accommodates future distributed applications by achieving two primary goals: scalability, encompassing the number of topics, the volume of requests, the number of clients, and sophisticated fault-tolerance, guaranteeing that subscribers receive all messages sent on a topic after they join it.
Internally, Firehose meets these goals by integrating low-overhead state-machine replication directly into the dissemination protocols, implementing staggered topic sharding to maximize availability and making failures more graceful, and by performing buffered logging to disks.