Publications

Fast forward to

 

Multi- and Manycore Systems

Sane Semantics of Best-effort Hardware Transactional Memory

Stephan Diestelhorst, Michael Hohmuth, Martin Pohlack. 2nd Workshop on the Theory of Transactional Memory (WTTM), September 2010. Boston, MA

Transactional memory’s (TM) biggest promise is that of making it easier to devise scalable multi-core programs. Arguably the biggest simplification of reasoning about parallel code with TM comes from atomicity: transactions either take effect instantaneously, or not at all. TM frees the programmer from reasoning how this atomicity is achieved and asks only where it should be employed. Commercial proposals and implementations of hardware TM, such as Sun’s Rock and AMD’s ASF, face a number of limitations and also propose extensions to the all-or-nothing semantics of TM, essentially permitting a set of visible side-effects on various levels. In this abstract (and talk), we will outline several spots of weakened semantics, discuss implications for applications through some examples, and provide a solution within our ASF framework. Because this is work in progress, we would like to discuss whether our specification is useful and whether it is sufficiently detailed and clear.

Paper (Extended Abstract): PDF
Talk: PDF

Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack

Dave Christie, Jae-Woong Chung, Stephan Diestelhorst, Michael Hohmuth, Martin Pohlack, Christof Fetzer, Martin Nowack, Torvald Riegel, Pascal Felber, Patrick Marlier, Etienne Rivière. In the proceedings of the 5th European conference on Computer systems (EuroSys '10), April 2010. Paris, France

In this paper, we report our experiences using ASF for implementing transactional memory. We have extended a C/C++ compiler to support language-level transactions and generate code that takes advantage of ASF. We use a software fallback mechanism for transactions that cannot be committed within ASF (e. g., because of hardware capacity limitations). Our evaluation uses a cycle-accurate x86 simulator that we have extended with ASF support. Building a complete ASF-based software stack allows us to evaluate the performance gains that a user-level program can obtain from ASF. Our measurements on a wide range of benchmarks indicate that the overheads traditionally associated with software transactional memories  can be significantly reduced with the help of ASF.

Paper (preprint): PDF

Compilation of Thoughts about AMD Advanced Synchronization Facility and First-Generation Hardware Transactional Memory Support

Jaewoong Chung, David Christie, Martin Pohlack, Stephan Diestelhorst, Michael Hohmuth, Luke Yen. 5th ACM SIGPLAN Workshop on Transactional Computing: Transact 2010, April 2010. Paris, France

After we had released the ASF specification to the public, we contacted various transactional memory (TM) experts in academia and industry to get their opinions on ASF and suggestions for improvements. We found their feedback invaluable in understanding what the first-generation TM hardware support should look like and how to improve ASF. In this paper, we present the summary of their likes, dislikes, and concerns about ASF and explain our opinions on their suggestions. By sharing the reviews, we hope to encourage further involvement of TM experts in defining a desirable set of requirements for the first-generation TM hardware support. We believe that this will greatly help to bring out a better TM support sooner in commercial processors.

Paper: PDF
Talk: PDF

Implementing AMD's Advanced Synchronization Facility in an out-of-order x86 core

Stephan Diestelhorst, Martin Pohlack, Michael Hohmuth, Dave Christie, Jae-Woong Chung, Luke Yen. 5th ACMSIGPLAN Workshop on Transactional Computing: Transact 2010, April 2010. Paris, France

We report our experiences implementing ASF in an out-of-order (OoO) CPU core simulator and our lessons learned for a future real (silicon) implementation of ASF. Specifically, we describe
how we integrated ASF into the pipeline of the simulated OoO core and how we handle the intricacies caused by the the inherently asynchronous multiprocessor memory-coherence protocol that can cause transaction aborts in any CPU state. We present our ASF implementation’s answers for four of ASF’s key requirements: providing an architectural interface, rather than exposing microarchitecture directly; providing sequential memory access semantics; early abort semantics; and, capacity guarantees. We find relatively lightweight solutions for all of these
requirements, but the OoO nature of the core necessitates many small changes to several CPU data structures to provide complete tracking of protected memory locations and timely reactions to conflicting memory access.

Paper: PDF
Talk: PDF

Hardware acceleration for lock-free data structures and software transactional memory

Stephan Diestelhorst, Michael Hohmuth. In the proceedings of the Workshop on Exploiting Parallelism with Transactional Memory and other Hardware Assisted Methods (EPHAM), April 2008. Boston, MA

In this paper, we report on a new CPU-architecture extension proposal, named Advanced Synchronization Facility (ASF), which is geared toward accelerating and easing lock-free programming and software transactional memory (STM). We present an initial performance simulation and usability study of ASF’s application to a lock-free data structure (a singly linked list) and to accelerating a state-of-the-art STM system, TinySTM. Our results indicate that ASF can significantly increase the throughput and scaling behavior of both workloads: The lock-free implementation has doubled single-threaded performance and maintains a 66 % increase for eight CPUs, while application-transparent enhancement of the STM increases single-thread performance by up to 15 %, and the factor of scaling to eight CPUs by up to 20 %.

Paper: PDF
Talk: PDF

Hardware acceleration for software transactional memory

Stephan Diestelhorst. Diploma thesis, Technische Universität Dresden, January 2008. Dresden, Germany

Stephan's diploma thesis originated during his internship at AMD's OSRC in 2007.

Thesis: PDF

 

Virtualization

How to Deal with Lock-Holder Preemption

Thomas Friebel. Presentation at the Xen Summit North America, July 2008. Boston, MA

Lock-holder preemption is the preemption of a virtual CPU (VCPU) holding a spinlock.  Other VCPUs of the same guest that try to acquire the same lock will have to wait until the lock-holder is scheduled again and releases the lock.  On a multi-core machine, lock-holder preemption can cause Xen guests to waste about 7% of their time waiting for spinlocks. In this presentation we will show the effects of lock-holder preemption, show two ways to counteract it, and analyze one approach in detail.  We will give a short overview of our modifications to the Xen scheduler, and show how we regained the lost performance.

Extended abstract: PDF
Talk: PDF, PDF with comments, Video

Nested paging hardware and software

Benjamin Serebrin, Joerg Roedel. Presentation at the KVM Forum, June 2008. Napa, CA

This presentation covers the ASPLOS paper 'Accelerating two-dimensional page walks for virtualized systems', and implementation details and performance of nested paging support for KVM.

Talk: PDF

Accelerating two-dimensional page walks for virtualized systems

Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, Srilatha Manne. In the proceedings of the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS),  March 2008. Seattle, WA

Nested paging is a hardware solution for alleviating the software memory management overhead imposed by system virtualization. Nested paging complements existing page walk hardware to form a two-dimensional (2D) page walk, which reduces the need for hypervisor intervention in guest page table management. However, the extra dimension also increases the maximum number of architecturally-required page table references.

This paper presents an in-depth examination of the 2D page table walk overhead and options for decreasing it. These options include using the AMD Opteron processor's page walk cache to exploit the strong reuse of page entry references. For a mix of server and SPEC benchmarks, the presented results show a 15%-38% improvement in guest performance by extending the existing page walk cache to also store the nested dimension of the 2D page walk. Caching nested page table translations and skipping multiple page entry references produce an additional 3%-7% improvement.

Much of the remaining 2D page walk overhead is due to low-locality nested page entry references, which result in additional memory hierarchy misses. By using large pages, the hypervisor can eliminate many of these long-latency accesses and further improve the guest performance by 3%-22%.

Paper: PDF

Partitioning the physical TLB with SVM ASIDs

Sebastian Biemueller. Presentation at Xen Summit, April 2007. Yorktown Heights, NY

Slide deck used at the 2007 Xen Summit.

Talk: PDF

Nested paging support in Xen

Wei Huang. Presentation at Xen Summit, April 2007. Yorktown Heights, NY

Slide deck used at the 2007 Xen Summit including an introduction to the AMD Barcelona technology by Elsie Wahlig.

Talk: PDF

 

Miscellaneous

Myths and facts about 64-bit Linux

Andreas Herrmann, Andre Przywara. Presentation at Chemnitzer Linux-Tage, March 2008. Chemnitz, Germany

Since the dawn of 64bit-Linux on PCs there are some myths circulating around the 64bit topic. These slides will deliver some technical details to create some facts. An overview of the hardware changes of the x86-64 architecture is followed by a small report on necessary changes to Linux and the GCC toolchain. A focus lies on the compatibility to 32bit, detailing both the hardware parts and the Linux implementation. Some real life experiences and traps are shown, as well as some hints for porting old 32bit programs to 64bit. A range of benchmark results will conclude this presentation providing a view on actual performance of 64bit applications.

Talk: PDF