Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Monday November 12 2018, @12:25PM   Printer-friendly
from the high-performance-sharding dept.

Submitted via IRC for Bytram

Cross shard transactions at 10 million requests per second

Dropbox stores petabytes of metadata to support user-facing features and to power our production infrastructure. The primary system we use to store this metadata is named Edgestore and is described in a previous blog post, (Re)Introducing Edgestore. In simple terms, Edgestore is a service and abstraction over thousands of MySQL nodes that provides users with strongly consistent, transactional reads and writes at low latency.

Edgestore hides details of physical sharding from the application layer to allow developers to scale out their metadata storage needs without thinking about complexities of data placement and distribution. Central to building a distributed database on top of individual MySQL shards in Edgestore is the ability to collocate related data items together on the same shard. Developers express logical collocation of data via the concept of a colo, indicating that two pieces of data are typically accessed together. In turn, Edgestore provides low-latency, transactional guarantees for reads and writes within a given colo (by placing them on the same physical MySQL shard), but only best-effort support across colos.

While the product use-cases at Dropbox are usually a good fit for collocation, over time we found that certain ones just aren’t easily partitionable. As a simple example, an association between a user and the content they share with another user is unlikely to be collocated, since the users likely live on different shards. Even if we were to attempt to reorganize physical storage such that related colos land on the same physical shards, we would never get a perfect cut of data.

For data that was not easily collocatable, developers were forced to implement application-level primitives to mask over a lack of cross-shard transactionality, slowing down application development and incurring an unnecessary technical burden. This blog post focuses on our recent deployment of cross shard transactions, which addressed this deficiency in Edgestore’s API, allowing atomic transactions across colos. What follows is a description of our design, potential pitfalls one may encounter along the way, and how we safely validated and deployed this new feature to a live application serving more than ten million requests per second.

The standard protocol for executing a transaction across database shards is two-phase commit, which has existed since at least the 1970s. We applied a modified version of this protocol to support cross-shard transactions in Edgestore.

[...] Two-phase commit is a relatively simple protocol in theory, but unfortunately there are a lot of practical barriers to implementing it. One key problem is read and write amplification: an increase in the number of reads and writes in the protocol path. Write amplification is inherent in the fact that not only do you need to write a transaction record, but also you need to durably stage a commit, which incurs at least one additional write per participant. The extra writes increase the critical section of the transaction, which can cause lock contention and application instability. Moreover, on every read, the database also needs to perform filtering to ensure that the read doesn’t observe any state that is dependent on a pending cross-shard transaction, which affects all reads in the system, even non-transactional ones.

Therefore, in order to translate two-phase commit to Edgestore, we needed a design that answered three questions:

  1. How to efficiently determine transaction state.
  2. How to mitigate the performance penalty of staging and applying commits.
  3. How to minimize the filtering penalty for reads.

[...] Of course, it’s one thing to have a design for a system you believe is correct but entirely another to prove it. Since Edgestore has been running for years and backs almost every request to Dropbox, the design itself had to accommodate validating our assumptions about consistency, correctness, and performance while providing a path for safe rollout.

[...] Although two-phase commit was a fairly natural fit for Edgestore’s existing workload, it is not a silver bullet for those looking to improve their consistency guarantees. Edgestore data was already well-collocated, which meant that cross-shard transactions ended up being fairly rare in practice—only 5-10% of Edgestore transactions involve multiple shards. Had this not been the case, upstream applications might not have been able to handle the increased latency and lock contention that comes with two-phase commit. Moreover, in many cases cross shard transactions replaced more expensive, application-level protocols, which meant the change was a net win for performance in addition to simplifying developer logic.

[...] For a strongly consistent, distributed metadata store such as Edgestore—serving 10 million requests per second and storing multiple petabytes of metadata—writes spanning multiple physical storage nodes are an inevitability. Although our initial “best-effort” approach to multi-shard writes worked well for most use cases, over time the balance of complexity shifted too heavily on developers. Therefore, we decided to tackle the problem of building a scalable primitive for multi-shard writes and implemented cross-shard transactions.

Although the basic protocol underlying our implementation has been known for a long time, actually retrofitting it into an existing system presented many challenges and required a creative approach to both implementation and validation. In the end, the up-front diligence paid dividends and enabled us to make a fundamental change to an existing system while maintaining our standards of trust and the safety of our users’ data.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Monday November 12 2018, @05:29PM

    by Anonymous Coward on Monday November 12 2018, @05:29PM (#760974)

    This is an ad, not an article.