It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate these complicated operations. As these bit manipulation operations are relevant to applications that are becoming increasingly important, we propose direct support for them in microprocessors. In particular, we propose fast bit gather (or parallel extract), bit scatter (or parallel deposit) and bit matrix multiply instructions, building on previous work which focused solely on instructions for accelerating general bit permutations.
We show that the bit gather and bit scatter instructions can be implemented efficiently using the fast butterfly and inverse butterfly network datapaths. We define static, dynamic and loop-invariant versions of the instructions, with static versions utilizing a much simpler functional unit than dynamic or loop-invariant versions. We show how a hardware decoder can be implemented for the dynamic and loop-invariant versions to generate, dynamically, the control signals for the butterfly and inverse butterfly datapaths. We propose a new advanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation instructions and then show how this functional unit can be extended to subsume the functionality of the standard shifter unit. This new unit represents an evolution in the design of shifters.
We also consider the bit matrix multiply instruction. This instruction multiplies two n × n bit matrices and can be used to accelerate parity computation and is a powerful bit manipulation primitive. Bit matrix multiply is currently only supported by supercomputers and we investigate simpler
Additionally, we perform an analysis of a variety of different application kernels taken from domains including binary compression, image manipulation, communications, random number generation, bioinformatics, integer compression and cryptology. We show that usage of our proposed instructions yields significant speedups over a basic RISC architecture - parallel extract and parallel deposit speed up applications 2.4× on average, while applications that benefit from
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





