Real-World Concurrency(翻译)
第一次翻译,发现很费劲~~
Know your cold paths from your hot paths. If there is one piece of advice to dispense to those who must develop parallel systems, it is to know which paths through your code you want to be able to execute in parallel (the hot paths) versus which paths can execute sequentially without affecting performance (the cold paths). In our experience, much of the software we write is bone-cold in terms of concurrent execution: it is executed only when initializing, in administrative paths, when unloading, etc. Not only is it a waste of time to make such cold paths execute with a high degree of parallelism, but it is also dangerous: these paths are often among the most difficult and error-prone to parallelize. In cold paths, keep the locking as coarse-grained as possible. Don’t hesitate to have one lock that covers a wide range of rare activity in your subsystem. Conversely, in hot paths—those that must execute concurrently to deliver highest throughput—you must be much more careful: locking strategies must be simple and fine-grained, and you must be careful to avoid activity that can become a bottleneck. And what if you don’t know if a given body of code will be the hot path in the system? In the absence of data, err on the side of assuming that your code is in a cold path and adopt a correspondingly coarse-grained locking strategy—but be prepared to be proven wrong by the data.
(译)开发并发系统的时候,最重要的一个建议是区分热路径与冷路径。我们写的代码的一个部分都是冷路径并发执行的代码,这些代码只有在初始化,管理配置,卸载的时候才会用到,这不仅仅是浪费时间,最重要的是这样写代码是非常危险的,这些路径在多线程下是最容易出错的。 回顾自己接触过的代码,例如设备的增删改等等,是绝对的冷路径,对于性能考虑可以适当减少。 (译)在冷路径代码中可以用粗粒度的锁,相反,在那些必须并行的热路径中,要投入大量的精力来使得锁的粒度小,加锁机制简洁,而且要避免写出有瓶颈的代码。如果不能够区分冷热路径,把热路径当作冷路径的方式处理,那么这个错误将在有相应数据的时候暴露出来。
Know when—and when not—to break up a lock. Global locks can naturally become scalability inhibitors, and when gathered data indicates a single hot lock, it is reasonable to want to break up the lock into per-CPU locks, a hash table of locks, per-structure locks, etc. This might ultimately be the right course of action, but before blindly proceeding down that (complicated) path, carefully examine the work done under the lock: breaking up a lock is not the only way to reduce contention, and contention can be (and often is) more easily reduced by decreasing the hold time of the lock. This can be done by algorithmic improvements (many scalability improvements have been achieved by reducing execution under the lock from quadratic time to linear time!) or by finding activity that is needlessly protected by the lock. Here’s a classic example of this latter case: if data indicates that you are spending time (say) deallocating elements from a shared data structure, you could dequeue and gather the data that needs to be freed with the lock held and defer the actual deallocation of the data until after the lock is dropped. Because the data has been removed from the shared data structure under the lock, there is no data race (other threads see the removal of the data as atomic), and lock hold time has been decreased with only a modest increase in implementation complexity.
(译)清楚情况下把粗粒度锁拆成细粒度锁。全局的锁经常成为性能的瓶颈,当通过数据验证了一个热锁影响到性能的时候,把它拆成单cpu锁,hash-table锁或者每个数据结构自己持有的小锁看起来很合理。这样做看起来是正确的,不要盲目的这样处理,在拆分锁之前要检查下在锁的范围内所做的工作:拆分锁并不是减少锁竞争的唯一方法,锁竞争能够通过减少持有锁的时间来大幅的减少。这可以通过提高算法效率的方式或者找到不需要持有锁的操作来实现。这里有个典型的例子:如果数据表明从共享的数据结构中释放子元素的时候花费了很多的时间,那么在释放内存的时候可以在持有锁的情况下先把要需要释放的数据从共享数据队列出队,出队之后就可以释放锁了,这样通过增加一定量的编码复杂度,锁持有的时间就短了。 java有垃圾回收机制,有单独的垃圾回收线程,可以说了先天的优势在这个地方,但是也增加了不可控的一个因素。
Be wary of readers/writer locks. If there is a novice error when trying to break up a lock, it is this: seeing that a data structure is frequently accessed for reads and infrequently accessed for writes, one may be tempted to replace a mutex guarding the structure with a readers/writer lock to allow for concurrent readers. This seems reasonable, but unless the hold time for the lock is long, this solution will scale no better (and indeed, may scale worse) than having a single lock. Why? Because the state associated with the readers/writer lock must itself be updated atomically, and in the absence of a more sophisticated (and less space-efficient) synchronization primitive, a readers/writer lock will use a single word of memory to store the number of readers. Because the number of readers must be updated atomically, acquiring the lock as a reader requires the same bus transaction—a read-to-own—as acquiring a mutex, and contention on that line can hurt every bit as much. There are still many situations where long hold times (e.g., performing I/O under a lock as reader) more than pay for any memory contention, but one should be sure to gather data to make sure that it is having the desired effect on scalability. Even in those situations where a readers/writer lock is appropriate, an additional note of caution is warranted around blocking semantics. If, for example, the lock implementation blocks new readers when a writer is blocked (a common paradigm to avoid writer starvation), one cannot recursively acquire a lock as reader: if a writer blocks between the initial acquisition as reader and the recursive acquisition as reader, deadlock will result when the recursive acquisition is blocked. All of this is not to say that readers/writer locks shouldn’t be used—just that they shouldn’t be romanticized.
这两段说的是要慎用读写锁,具体的解释与例子见:慎用读写锁
Consider per-CPU locking. Per-CPU locking (that is, acquiring a lock based on the current CPU identifier) can be a convenient technique for diffracting contention, as a per-CPU lock is not likely to be contended (a CPU can run only one thread at a time). If one has short hold times and operating modes that have different coherence requirements, one can have threads acquire a per-CPU lock in the common (noncoherent) case, and then force the uncommon case to grab all the per-CPU locks to construct coherent state. Consider this concrete (if trivial) example: if one were implementing a global counter that is frequently updated but infrequently read, one could implement a per-CPU counter protected by its own lock. Updates to the counter would update only the per-CPU copy, and in the uncommon case in which one wanted to read the counter, all per-CPU locks could be acquired and their corresponding values summed. Two notes on this technique: first, it should be employed only when the data indicates that it’s necessary, as it clearly introduces substantial complexity into the implementation; second, be sure to have a single order for acquiring all locks in the cold path: if one case acquires the per-CPU locks from lowest to highest and another acquires them from highest to lowest, deadlock will (naturally) result.
每个cpu使用一个锁,待学习。
Know when to broadcast—and when to signal. Virtually all condition variable implementations allow threads waiting on the variable to be awakened either via a signal (in which case one thread sleeping on the variable is awakened) or via a broadcast (in which case all threads sleeping on the variable are awakened). These constructs have subtly different semantics: because a broadcast will awaken all waiting threads, it should generally be used to indicate state change rather than resource availability. If a condition broadcast is used when a condition signal would have been more appropriate, the result will be a thundering herd: all waiting threads will wake up, fight over the lock protecting the condition variable, and (assuming that the first thread to acquire the lock also consumes the available resource) sleep once again when they discover that the resource has been consumed. This needless scheduling and locking activity can have a serious effect on performance, especially in Java-based systems, where notifyAll() (i.e., broadcast) seems to have entrenched itself as a preferred paradigm; changing these calls to notify() (i.e., signal) has been known to result in substantial performance gains.7
知道什么时候用broadcast什么时候用signal.几乎所有的条件变量实现都允许线程通过singnal唤醒一个或者使用broadcast唤醒所有阻塞在条件变量的上的线程。这两个方法有着不同的语义,因为broadcast将唤醒所有的阻塞的线程,它一般是用来指示状态变化而不是用来表示资源可用性。当应该使用signal的场景使用了broadcast,这将导致惊群现象,所有等待的线程将被唤醒,都去竞争保护条件变量的锁(假设第一个唤醒的线程消耗了可用资源)而且会在此进入sleep状态当发现资源被占用后。这个无用的调度和锁活动会对性能上造成严重的影响,特别是基于java的系统,在使用java中,似乎notifyAll(broadcast)方法成了优先的选择,改成notiyf(signal)会提高性能。
Learn to debug postmortem. Among some Cassandras of concurrency, a deadlock seems to be a particular bogeyman of sorts, having become the embodiment of all that is difficult in lock-based multithreaded programming. This fear is somewhat peculiar, because deadlocks are actually among the simplest pathologies in software: because (by definition) the threads involved in a deadlock cease to make forward progress, they do the implementer the service of effectively freezing the system with all state intact. To debug a deadlock, one need have only a list of threads, their corresponding stack backtraces, and some knowledge of the system. This information is contained in a snapshot of state so essential to software development that its very name reflects its origins at the dawn of computing: it is a core dump.
学习调试崩溃的程序。在一些并发难题中,死锁应该是基于锁的并发程序中最难应付的问题。死锁问题之所以难发现是因为在软件定义中当死锁发生时会阻塞一个进程的运行,死锁会冻结系统中所有的状态。在调试死锁的时候,需要线程的列表和他们对应的堆栈回溯。这些信息是非常必要的,可以在core dump文件中取得。
Debugging from a core dump—postmortem debugging—is an essential skill for those who implement parallel systems: problems in highly parallel systems are not necessarily reproducible, and a single core dump is often one’s only chance to debug them. Most debuggers support postmortem debugging, and many allow user-defined extensions.8 We encourage practitioners to understand their debugger’s support for postmortem debugging (especially of parallel programs) and to develop extensions specific to debugging their systems.
调试core dump文件对于开发并行系统的人来说是一个必要的技能:在一个高度并发的系统中,并不是所有的问题都可以容易的复现,有的时候dump文件是唯一能能够调试找到问题的途径。大多数调试器有调试core dump文件的功能。
Design your systems to be composable. Among the more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable: “Locks and condition variables do not support modular programming,” reads one typically brazen claim, “building large programs by gluing together smaller programs[:] locks make this impossible.”9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating systems into larger systems that remain entirely unaware of lower-level locking.
把系统设计的可以组合。很多人抱怨锁和条件变量使得系统不能灵活组装,这个显然是错误的,数据库与操作系统都是很大型的系统,然而他们都是都是lower-level locking.
There are two ways to make lock-based systems completely composable, and each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.
有两种可以让系统更灵活组装的方式。第一种也是最明显的一种是让锁存在于子系统的内部。例如,在并行的操作系统中,控制流程从来不给用户提供内核级别的锁,这些锁用来实现系统内部的系统调用。
Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be no global subsystem state—subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they do not access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used concurrently by different subsystems and in different contexts. A concrete example of this is the AVL tree implementation used extensively in the Solaris kernel. As with any balanced binary tree, the implementation is sufficiently complex to merit componentization, but by not having any global state, the implementation may be used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized. Don’t use a semaphore where a mutex would suffice. A semaphore is a generic synchronization primitive originally described by Dijkstra that can be used to effect a wide range of behavior. It may be tempting to use semaphores in lieu of mutexes to protect critical sections, but there is an important difference between the two constructs: unlike a semaphore, a mutex has a notion of ownership—the lock is either owned or not, and if it is owned, it has a known owner. By contrast, a semaphore (and its kin, the condition variable) has no notion of ownership: when sleeping on a semaphore, one has no way of knowing which thread one is blocking upon.
在mutex可以使用的情况下不要使用semaphore.信号量Dijkstra描述是一类能够影响大范围的同步原语。使用semaphore替代metex看起来是方便诱人的,但是两者之间有非常大的不同:与semaphore不同,metex有被持有的概念,一个锁要么被持有要不不是,当锁被持有的时候他有一个持有者,相反一个信号量(和他的家族条件变量)没有所属者的概念:当一sleep在一个信号量上面的时候,没有方法知道哪个线程阻塞了它。
The lack of ownership presents several problems when used to protect critical sections. First, there is no way of propagating the blocking thread’s scheduling priority to the thread that is in the critical section. This ability to propagate scheduling priority—priority inheritance—is critical in a realtime system, and in the absence of other protocols, semaphore-based systems will always be vulnerable to priority inversions. A second problem with the lack of ownership is that it deprives the system of the ability to make assertions about itself. For example, when ownership is tracked, the machinery that implements thread blocking can detect pathologies such as deadlocks and recursive lock acquisitions, inducing fatal failure (and that all-important core dump) upon detection. Finally, the lack of ownership makes debugging much more onerous. A common pathology in a multithreaded system is a lock not being dropped in some errant return path. When ownership is tracked, one at least has the smoking gun of the past (faulty) owner—and, thus, clues as to the code path by which the lock was not correctly dropped. Without ownership, one is left clueless and reduced to debugging by staring at code/the ceiling/into space.
缺少所有权在保护临界区的时候会表现出一些问题。首先,没有方法把阻塞线程的调度优先级传递给临界区上的线程,这种传递调度优先级的继承对于实时操作系统来说是非常的关键的,在缺少其他协议的支持下,基于信号量的系统将非常容易的产生优先级反转的问题。缺少所有权导致的第二个问题是它剥夺了操作系统断言的功能。例如,当所有权被跟踪的时候,操作系统能够检测到死锁,递归锁获取和一些致命的错误。最后缺少所有权会使得调试更加困难。在并行系统中一个常见的问题就是在错误处理路径中没有释放锁,当所有权被跟踪的时候,会有确切的证据表明过去发生的错误的所有权问题,没有所有权的话就没有任何调试的线索。
All of this is not to say that semaphores shouldn’t be used (indeed, some problems are uniquely suited to a semaphore’s semantics), just that they shouldn’t be used when mutexes would suffice.
上面这些并不是说信号量不应该被使用(在一些情况下必须使用信号量),只是在能够使用metex的时候不应该使用信号量。
Consider memory retiring to implement per-chain hash-table locks. Hash tables are common data structures in performance-critical systems software, and sometimes they must be accessed in parallel. In this case, adding a lock to each hash chain, with the per-chain lock held while readers or writers iterate over the chain, seems straightforward. The problem, however, is resizing the table: dynamically resizing a hash table is central to its efficient operation, and the resize means changing the memory that contains the table. That is, in a resize the pointer to the hash table must change—but we do not wish to require hash lookups to acquire a global lock to determine the current hash table! This problem has several solutions, but a (relatively) straightforward one is to retire memory associated with old hash tables instead of freeing it. On a resize, all per-chain locks are acquired (using a well-defined order to prevent deadlock), and a new table is then allocated, with the contents of the old hash table being rehashed into the new table. After this operation, the old table is not deallocated but rather placed in a queue of old hash tables. Hash lookups then require a slight modification to operate correctly: after acquiring the per-chain lock, the lookup must check the hash-table pointer and compare it with the hash-table pointer that was used to determine the hash chain. If the hash table has changed (that is, if a hash resize has occurred), it must drop the lock and repeat the lookup (which will acquire the correct chain lock in the new table). There are some delicate issues in implementing this—the hash-table pointer must be declared volatile, and the size of the hash table must be contained in the table itself—but the implementation complexity is modest given the alternatives, and (assuming hash tables are doubled when they are resized) the cost in terms of memory is only a factor of two. For an example of this in production code, the reader is directed to the file descriptor locking in Solaris, the source code for which can be found by searching the Internet for “flist_grow.”
考虑用内存“退休”法来实现哈希表的按桶加锁。
Be aware of false sharing. There are a variety of different protocols for keeping memory coherent in caching multiprocessor systems. Typically, these protocols dictate that only a single cache may have a given line of memory in a dirty state. If a different cache wishes to write to the dirty line, the new cache must first read-to-own the dirty line from the owning cache. The size of the line used for coherence (the coherence granularity) has an important ramification for parallel software: because only one cache may own a line a given time, one wishes to avoid a situation where two (or more) small, disjoint data structures are both contained within a single line and accessed in parallel by disjoint caches. This situation—called false sharing—can induce suboptimal scalability in otherwise scalable software. This most frequently arises in practice when one attempts to defract contention with an array of locks: the size of a lock structure is typically no more than the size of a pointer or two and is usually quite a bit less than the coherence granularity (which is typically on the order of 64 bytes). Disjoint CPUs acquiring different locks can therefore potentially contend for the same cache line.
知道什么是伪共享。在多核的操作系统上有着各种不同的协议来保证内存的一致性。典型的,这些协议都指出:只有一个高速缓存可能导致一部分内存处于dirty状态。如果另外一个cache想写入到dirty line的时候,这个cache必须先把这部分的dirty line读到自己的cache中。line的大小为了保证一致性对于不同的软件并不一样:因为在同一个时间一个cache只能拥有一个line,必须避免一种两个很小的数据结构包含在同一个line中而被不同的cahche所读取到,这种情况就称为伪共享,这会导致一些软件的扩展性变差。在实践中,这种情况在使用数组锁减少竞争的时候会遇见:一个锁的结构的大小不会大于一个指针或者两个指针的大小,通常情况下会小于一致性粒度(经常是64byte)。不同的cpu获取不同的锁可能竞争同一个cache line。
False sharing is excruciating to detect dynamically: it requires not only a bus analyzer, but also a way of translating from the physical addresses of the bus to the virtual addresses that make sense to software, and then from there to the actual structures that are inducing the false sharing. (This process is so arduous and error-prone that we have experimented—with some success—with static mechanisms to detect false sharing.10) Fortunately, false sharing is rarely the single greatest scalability inhibitor in a system, and it can be expected to be even less of an issue on a multicore system (where caches are more likely to be shared among CPUs). Nonetheless, this remains an issue that the practitioner should be aware of, especially when creating arrays that are designed to be accessed in parallel. (In this situation, array elements should be padded out to be a multiple of the coherence granularity.)
动态的检测伪共享是非常烦人的:它不仅仅要求对于总线进行分析,而且要找到一种从软件使用的虚拟地址到物理地址翻译的方法,然后与真实的结构比较能够得出找到伪共享的地方。(这个过程太复杂而且容易出错以至于我们尝试使用一些静态的机制来找到伪共享)。幸运的是,伪共享在系统中被大范围的抑制住了,而且在多核的操作系统中(cpu经常会共享caches)更是不多见了。尽管这样,这个问题应该被从业者所了解,特别是创建允许被并行访问的数组的时候(在这种情况下,数据元素大小应该扩大至conherence 粒度的倍数)。
Consider using nonblocking synchronization routines to monitor contention. Many synchronization primitives have different entry points to specify different behavior if the primitive is unavailable: the default entry point will typically block, whereas an alternative entry point will return an error code instead of blocking. This second variant has a number of uses, but a particularly interesting one is the monitoring of one’s own contention: when an attempt to acquire a synchronization primitive fails, the subsystem can know that there is contention. This can be especially useful if a subsystem has a way of dynamically reducing its contention. For example, the Solaris kernel memory allocator has per-CPU caches of memory buffers. When a CPU exhausts its per-CPU caches, it must obtain a new series of buffers from a global pool. Instead of simply acquiring a lock in this case, the code attempts to acquire the lock, incrementing a counter when this fails (and then acquiring the lock through the blocking entry point). If the counter reaches a predefined threshold, the size of the per-CPU caches is increased, thereby dynamically reducing contention.
考虑使用非阻塞的加锁来观察线程争用。许多同步原语有着不同的出发点来规定不同的行为当原语是不可用的时候:默认的出发点是阻塞,还可以选择返回一个错误号而不是阻塞住。第二个变化有很多的应用,但是一个特别的有趣的例子是可以监控自己的竞争状态:当尝试获取一个同步原语失败后,子系统能够知道有竞争发生。这个特性可以实现动态的减少竞争,例如,Solaris的内核分配器每个cpu都有自己的memory buffers,当cpu消耗完他的缓存后,它必须获取一系列新的buffer从全局的pool中。获取的时候不是简单的分配下内存,而是尝试首先获取一个锁,增加一个计数器,如果计数器到达一个预先给定的数据后,cpu 的缓存才增长,从而可以实现动态的减少竞争。
When reacquiring locks, consider using generation counts to detect state change. When lock ordering becomes complicated, at times one will need to drop one lock, acquire another, and then reacquire the first. This can be tricky, as state protected by the first lock may have changed during the time that the lock was dropped—and reverifying this state may be exhausting, inefficient, or even impossible. In these cases, consider associating a generation count with the data structure; when a change is made to the data structure, a generation count is bumped. The logic that drops and reacquires the lock must cache the generation before dropping the lock, and then check the generation upon reacquisition: if the counts are the same, the data structure is as it was when the lock was dropped and the logic may proceed; if the count is different, the state has changed and the logic may react accordingly (for example, by reattempting the larger operation).
在重新加锁时,考虑使用版本号来检测状态变更。当锁的顺序比较复杂的时候,可能出现先释放一个锁,获取另外一个锁,然后再获取释放掉的锁的情况。这样做的时候需要特别注意,在一开始释放锁的时候,被这个锁所保护的状态数据可能改变在第二次获取这个锁的时候,这样做是低效甚至是不可能的。在这种情况下,考虑给数据结构增加一个版本号,当数据改变的时候版本号增加。这样释放在获取锁的时候在释放前保存版本号,再次获取后比较版本号,如果版本号一直,那么数据结构说明没有改变。如果版本号变化了逻辑可能要相应调整(例如可以尝试更大范围的锁操作)。
Use wait- and lock-free structures only if you absolutely must. Over our careers, we have each implemented wait- and lock-free data structures in production code, but we did this only in contexts in which locks could not be acquired for reasons of correctness. Examples include the implementation of the locking system itself,11 the subsystems that span interrupt levels, and dynamic instrumentation facilities.12 These constrained contexts are the exception, not the rule; in normal contexts, wait- and lock-free data structures are to be avoided as their failure modes are brutal (livelock is much nastier to debug than deadlock), their effect on complexity and the maintenance burden is significant, and their benefit in terms of performance is usually nil.
除非万不得已不要使用无锁编程。(无锁编程可参考:http://blog.csdn.net/zzulp/article/details/6259866)。使用无锁编程时候复杂性和维护比有锁大很多,而获得的性能收益是十分微小的。
Prepare for the thrill of victory—and the agony of defeat. Making a system scale can be a frustrating pursuit: the system will not scale until all impediments to scalability have been removed, but it is often impossible to know if the current impediment to scalability is the last one. Removing that last impediment is incredibly gratifying: with that change, throughput finally gushes through the system as if through an open sluice. Conversely, it can be disheartening to work on a complicated lock breakup only to discover that while it was the impediment to scalability, it was merely hiding another impediment, and removing it improves performance very little—or perhaps not at all. As discouraging as it may be, you must return to the system to gather data: does the system not scale because the impediment was misunderstood, or does it not scale because a new impediment has been encountered? If the latter is the case, you can take solace in knowing that your work is necessary—though not sufficient—to achieve scalability, and that the glory of one day flooding the system with throughput still awaits you. The Concurrency Buffet There is universal agreement that writing multithreaded code is difficult: although we have attempted to elucidate some of the lessons learned over the years, it nonetheless remains, in a word, hard. Some have become fixated on this difficulty, viewing the coming of multicore computing as cataclysmic for software. This fear is unfounded, for it ignores the fact that relatively few software engineers actually need to write multithreaded code: for most, concurrency can be achieved by standing on the shoulders of those subsystems that already are highly parallel in implementation. Those practitioners who are implementing a database or an operating system or a virtual machine will continue to need to sweat the details of writing multithreaded code, but for everyone else, the challenge is not how to implement those components but rather how best to use them to deliver a scalable system. While lunch might not be exactly free, it is practically all-you-can-eat—and the buffet is open!
准备迎接胜利的喜悦和失败的打击。