An Evolutionary Normalization Algorithm for Signed Floating-Point Multiply-Accumulate Operation

2022-08-24 12:57RajkumarSarmaCherryBhargavaandKetanKotecha
Computers Materials&Continua 2022年7期

Rajkumar Sarma, Cherry Bhargavaand Ketan Kotecha

1Department of Electrical & Electronics Engineering, Faculty of Engineering & Technology, Jain(Deemed-to-be-University), Ramanagar, 562112, Karnataka, India

2Symbiosis Institute of Technology, Symbiosis International (Deemed University), Lavale, Pune, 412115, India

3Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Lavale, Pune, 412115,India

Abstract: In the era of digital signal processing, like graphics and computation systems, multiplication-accumulation is one of the prime operations.A MAC unit is a vital component of a digital system, like different Fast Fourier Transform (FFT) algorithms, convolution, image processing algorithms, etcetera.In the domain of digital signal processing, the use of normalization architec-ture is very vast.The main objective of using normalization is to perform com-parison and shift operations.In this research paper, an evolutionary approach for designing an optimized normalization algorithm is proposed using basic logical blocks such as Multiplexer, Adder etc.The proposed normalization algorithm is further used in designing an 8×8 bit Signed Floating-Point Multiply-Accumulate (SFMAC) architecture.Since the SFMAC can accept an 8-bit signific and and a 3-bit exponent, the input to the said architecture can be somewhere between -(7.96872)10 to + (7.96872)10.The proposed architecture is designed and implemented using the Cadence Virtuoso using 90 and 130 nm technologies (in Generic Process Design Kit (GPDK) and Taiwan Semiconductor Manufacturing Company (TSMC), respectively).To reduce the power consumption of the proposed normalization architecture,techniques such as“block enabling”and“clock gating”are used rigorously.According to the analysis done on Cadence, the proposed architecture uses the least amount of power compared to its current predecessors.

Keywords: Data normalization; cadence virtuoso; signed-floating-point MAC; evolutionary optimized algorithm; block enabling; clock gating

1 Introduction to Multiply & Accumulate (MAC) Architecture

In digital signal processing, the MAC operation is considered a significant and critical operation.The Digital Signal Processing (DSP) algorithms execute many mathematical calculations repeatedly and rapidly on various data sets.DSP algorithms can be effectively executed by the majority of operating systems and general-purpose microprocessors.Unfortunately, DSP algorithms have energy efficiency issues while operating with portable devices such as Personal Digital Assistants (PDAs)and mobile phones.Considering delay and power optimization, the exponential growth of portable electronics has imposed a major challenge to Very Large-Scale Integration (VLSI) design engineers.A MAC unit is a vital component of any digital system, such as various FFT algorithms, convolution etc.The actual MAC block is not just limited to the fixed-point number system.For audio and image processing applications, floating-point MAC architecture is much needed.MAC’s simple operation is to multiply two variables (XiandYi) and add the product to the last cycle’s output.Therefore, the MAC architecture includes the key operational blocks of a multiplier, adder, and register/accumulator[1-14].The multiplier multiplies the two input operands; the adder attaches the multiplier’s output to the previous cycle’s result, and the register or accumulator preserves the final addition output.Fig.1 shows the generalized block diagram of N×N bit MAC.

The popularity of portable devices and the requirement to limit the power consumption(and therefore heat dissipation) in heavily-dense VLSI chips have resulted in rapid advances in low-power design over the past few years.Mobile applications necessitating low-power dissipation and high throughput,let us say notebook Personal Computers (PCs), mobile communication devices, and PDAs, are the driving forces behind these innovations.In most cases, low power consumption requirements need to be met along with equally challenging targets of high chip density and high speed.Therefore,the low-power IC design surfaced as a beneficial and fast-developing area of Complementary Met al Oxide Semiconductor (CMOS) circuit design.Usually, the restricted battery life places very stringent demands on the portable system’s overall power requirements.New types of rechargeable batteries,say“Nickel-Met al Hydride (NiMH)”is being produced with better energy storage capacity than the traditional“Nickel-Cadmium (NiCd)”batteries.Still, there is no prospect of a significant increase in energy capacity in the foreseeable future.The energy density (the energy stored/unit weight) provided by new advancements in technologies (such as NiMH) is approximately 30 Watt-hour/pound, which is quite lesser considering the growing applications of portable systems.Scaling down the energy dissipation of Integrated Circuits (ICs) by improving functionality is, therefore, a significant task in developing portable devices.

In high-performance digital systems, such as microprocessors-microcontrollers, DSPs, etc., the need for low-power circuit development is also becoming a significant concern.Targeting higher chip density and higher processing speed contributes to developing a high-clock rate in very complex circuits.If the chip’s clock speed rises, then the chip’s energy dissipation, thereby increasing the temperature linearly.As the dissipated heat has to be efficiently removed to maintain the chip’s temperature at an optimum level, the packaging cost, cooling, and heat extraction become important aspects.A few elite microchips structured in the mid-1990s (such as Intel Pentium, Digital Equipment Corporation (DEC) Alpha, PowerPC) which operates in a frequency ranging from 100-300 MHz, and the total average power is ranging from 20-50 W.VLSI’s reliability is one more critical factor to look after for the design engineers, as it emphases to the demand for energy-efficient design.There is a near connection between electronic circuit maximumpower-dissipation and reliability concerns like electro-migrationand and system degradation caused by the carriers.Additionally, the thermal stress caused by chip heat dissipation is also a significant issue to look after in terms of reliability.As a consequence,increasing power consumption is also critical for improving performance.The procedures used in digital systems to achieve low-power consumption vary from device to device, technology to technology or algorithm to algorithm level.The standard system features (say threshold voltage),device dimension and interconnection properties are essential factors in reducing power consumption.Circuit level approaches such as a careful selection of circuit design logic family, decrement in the total number of voltage transitions, and clocking approaches can be used to minimize transistor level energy dissipation.Measures at the architecture level include intelligent power management of different system components, pipeline and concurrent usage, and bus layout design.

In recent years, different researchers have done several works [2-3,5-21].Reference [22] proposes a high throughput MAC architecture that promises the optimized area in 2007.To maximize speed,it employs 4:2 compressor circuits.Reference [23] in 2012 suggests a novel multiplier architecture.Reference [12] proposes a novel architecture based on a transformed“Wallace tree multiplier”in 2013.The architecture is 64-bit compatible.Reference[24]uses an updated Braun Multiplier to create a MAC unit in 2013.NCSim and RTL Compiler are used in the implementation.In the year2014,reference[9]proposes a“low-power Baugh-Wooley multiplier-based MAC”unit.A pipelined-based architecture has been proposed in this work.Reference [25] explains a split MAC architecture in 2009.To increase the speed of operation, even more, a strategy to compact the“partial product using interleaved adders”and a“modified hybrid partial product reduction tree (PPRT)”scheme is proposed.A double carry-save addition algorithm is proposed in [26], where its prototype is also verified on a six-input Look-up Table (LUT) based Field Programmable Gate Array (FPGA).In 2016, an“embedded logic full adder (PRO-FA)”was presented in [14], which offers better improvements on the basic design constraint.In 2019, a“low-complexity asynchronous pipelined adder”that guarantees significant energy saving & latency is proposed [27].At the same time, a Pro-LA architecture is proposed in [28]that targets error-tolerant applications.Reference [29] proposes an optimizing approach for“gripper mechanism”using appropriate bi-algorithms in a separate approach.An optimization technique for a“dragonfly-inspired compliant joint”is proposed in [30], whereas reference [31] proposes an optimization technique for a“linear compliant mechanism of nanoindentation tester”.

As shown in Fig.1, the multiplier block collects and multiplies two n-bit inputs and results in 2N-bit output, further processed to the register/accumulator unit.The register cum accumulator temporarily stores the data and sends the data to the adder as an input.The adder sums up the register unit output together with the accumulated value resulting from the previous cycle.Thus,the MAC unit’s overall output is taken from the accumulator register output.Hence, the MAC architecture consists of an“N-bit multiplier”,“2N bit register”,“(2N+1) bit adder”, and two“(2N+1)-bit accumulators/registers”(one for storing the output value and the other for reading the previous output).As shown in Fig.1, the conventional MAC architecture is capable of performing MAC operation on the unsigned fixed-point numbers only.At the same time, today’s digital systems demand floating-point signed operation.In the case of floating-point arithmetic, the conventional adder/subtractor or multiplier algorithms cannot be applied directly because of the presence of the decimal point in the inputs.Therefore, to standardize the floating-point inputs, normalization operations are essential.Normalization means standardization where the decimal point location of the mantissa part is fixed & the exponent value is varied in a particular range based on the shifting of the decimal point.This paper proposes a multiplexer-based normalization architecture that can execute MAC operations on signed floating-point inputs.A unique input data format is created that accepts 9-bit binary data and 4-bit exponential input to perform the same.As a result, the new input data format is 13 bits (it also includes the MSB bits reserved as the sign bit for the mantissa and the exponent).Exponent-Comparator-Circuit (ECC) and Exponent-Shifter-Circuit (ESC) are the two main algorithms in the proposed normalization architecture.

This manuscript is divided into six subsections: Section 2 explains the Exponent-Comparator-Circuit(ECC)&its operation.Section3 describes the Exponent-Shifter-Circuit(ESC)&its operation.Section 4 describes the proposed SFMAC architecture using ECC & ESC architectures.Section 5 explains the comparison of the proposed SFMAC with the existing one.At last, the conclusions and future work are explained in Section 6.

2 ECC Block

The product of the input exponents and the previous cycle’s output exponent are used as inputs to the ECC (Exponent-Comparator-Circuit).The most important thing to remember here is that difference between two ECC block’s input is calculated as arithmetic difference, if both of the ECC block’s input terms have the same sign.On the other hand, if both inputs have separate signs, the difference between the two is equal to the arithmetic sum of the two inputs.Fig.2 shows the flowchart of the ECC block.

Figure 2: ECC flowchart

Multiplexers are used in the architecture to compare the inputs.The ECC operation generates a 5-bit output used to execute binary shifts (as shown in Fig.3).The MUX-based architecture of the ECC block is shown in Fig.3.The Multiplexer based design of the ECC block is as follows:

Figure 3: MUX based ECC architecture

i) The ECC’s inputs are expressed in 2’s complement form depending on the input sign bits.

ii) The operation of the ECC is further segregated based on the sign bits of the inputs as follows:

a.If both the sign bits are different, then add the inputs of the ECC to produce a 4-bit output(i.e., discard the carry bit) but introduce the 5th bit as‘1’if the product of the exponents of the inputs is negative, but the previous exponent is positive.Make the 5th bit as‘0’in the other circumstances.

b.If both the sign bits of the inputs to the ECC are the same, then find out the input which is higher among the two and find the difference between the inputs as per the following procedure:

•To find the higher number, compare both the numbers bit by bit, i.e., start comparing MSB to LSB, as shown in Fig.4.

Figure 4: MUX based ECC with same sign bit

•For finding the difference, use the 2’s complement approach.The difference produces a 4-bit output (i.e., discard the borrow bit) but introduces the 5th bit as‘0’if the product of the exponents of the inputs is higher than the previous cycle exponent.Make the 5th bit as‘1’in the other circumstances.

•In this architecture, multiplexers are used to compare the inputs.

iii) This method yields a 5-bit output that is utilized to do binary shifts in the ESC block.

3 ESC Block

The ESC (Exponent-Shifter-Circuit) block is in charge of shifting the smaller number by an amount of the difference between the exponents of the product of the 8-bit inputs and the previous cycle MAC output (preceding output).The ECC block’s 5-bit output, a 16-bit product of the inputs,and the previous cycle’s 16-bit output (preceding output) are the ESC block’s inputs.The multiplexer-based design of the ESC block is shown in Fig.5.The following is the step-by-step procedure:

Figure 5: MUX based ESC architecture

1.Based on the ECC result, the smallest number is identified (5-bits).If the MSB of the ECC block output is 1, the product of the inputs is moved to the right by the corresponding decimal value of the ECC block output’s remaining 4-bit binary.If the MSB of the ECC block output is 0, the preceding output is moved to the right by the corresponding decimal value of the ECC block output’s remaining 4-bit binary.

2.The MSB of the ECC block output also identifies the input to the ESC block, which does not need shifting.If the MSB of the ECC block output is 1, the previous output is retained (not shifted).If,on the other hand, the MSB of the ECC block output is 0, the product of the inputs is passed in its entirety (not shifted).

4 SFMAC Architecture

To represent positive and negative numbers, the architecture employs sign-magnitude and 2’s complement representations.Signed magnitude form is used to describe SFMAC input-output, but these inputs are converted to 2’s complement form for the internal calculations.The proposed MAC architecture’s final output (MAC output) has 17 bits, including one sign bit.

The SFMAC’s inputs are two 8-bit binary numbers formatted as shown in Fig.6.Each SFMAC input is 13 bits long, with two bits set aside for the number’s and exponent’s sign bits.Depending on whether the number is positive or negative, the sign bit might be 0 or 1.The remaining eleven bits are utilized to indicate an 8-bit binary representation and a 3-bit binary exponent.One important thing to remember is that the 3rd bit of the exponent in binary representation is set to 0 by default since 2-bit binary takes 3 bits to be represented in 2’s complement form.

Figure 6: Input format representation of SFMAC

As a result, the exponent term in this architecture will vary from‘-4’to‘+3’.The input numbers will range from -(0.11111111)2×2+3to +(0.11111111)2×2+3& hence the new SFMAC architecture’s inputs range from -(7.96872)10to +(7.96872)10.Furthermore, the SFMAC architecture’s inputs can only be entered in fractions.For instance, the numbers (001)2& (010)2should be entered as(0.00100000)2×2+3& (0.0100000)2×2+3respectively as the inputs to the SFMAC.Similarly, (101)2& (10)2should be represented as (0.10100000)2×2+3& (0.10000000)2×2+2respectively to process it through the SFMAC.The 8-bit multiplier, 16-bit register, 16-bit adder, 2:1/4:1 multiplexer of various sizes, and Exponential Adder are the main building blocks of the SFMAC architecture (other than the Exponent Comparator Circuit (ECC) and Exponent Shifter Circuit (ESC) explained earlier).SFMAC’s overall architecture is depicted in Fig.7.

Figure 7: SFMAC architecture using ECC & ESC blocks

CMOS technologies are used to develop and execute the overall SFMAC architecture.A thorough study is carried out using the Cadence Virtuoso.To limit the power consumption, the architecture employs a “clock gating scheme” and a pipeline mechanism.The clock pulse pipeline system is ensured by triggering successive blocks after a predetermined period.

The SFMAC architecture is implemented in 90 and 130 nm CMOS technology (GPDK and TSMC, respectively).Tab.1 compares the influence of the SFMAC architecture in various CMOS technologies for a particular input vector.Cadence Spectre Tool is used to measure the power usage of the implemented designs.The average power (PAverage) is calculated over a simulation time (Tsim) of 40 ns and at a clock frequency (fclk) of 83.33 MHz, while the static power is evaluated for a 2 V supply voltage (VDD).Since the transistor sizing is greater in 130 nm technology, the average power (PAverage)consumption in 130 nm (TSMC) is higher than 90 nm (GPDK) as it affects the load capacitanceCload.In the same way, device geometry affects static power consumption.As a result, a circuit with a larger device dimension can consume more static power.If αTis the activity factor, then CMOS dynamic power is calculated as Eq.(1):

Tab.2 shows a comparison of the proposed SFMAC architecture and existing MAC architectures in terms of power consumption.Since most of the available architectures in the literature use an HDL-based approach, comparing the proposed SFMAC architecture to those already present in the literature is difficult.On the other hand, the proposed architecture is implemented in a Cadence

Virtuoso 90 or 130 nm technologies.Furthermore, almost all of the architectures described in the literature do not support signed operations & floating-point designs.

Table 1: Performance of SFMAC at 90 and 130 nm CMOS technologies (GPDK and TSMC respectively)

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator),most of the architectures in the literature do not use any clocking signals.Asynchronous circuits do not have real-time applicability.As a result, the architecture’s functional applicability must be further investigated.The architecture shown in [32] is designed for floating-point operation (signed), whereas most of the reported architectures, as discussed in Tab.2, are dedicated to implementing fixedpoint Multiply-Accumulate (unsigned) operation.

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator), the majority of the architectures in the literature do not use any clocking signals.Asynchronous circuits don’t have real-time applicability.As a result, the architecture’s functional applicability must be further investigated.The architecture shown in [32] is designed for floating-point operation (signed), whereas most of the reported architectures, as discussed in Tab.2 are dedicated for implementing fixed-point Multiply-Accumulate (unsigned) operation.

Tab.2 reveals that the architectures in [12,33,34] consume considerably higher static and average power (in mW) than the proposed SFMAC architecture.The architectures in [35,36] are examined for 16-bit operations at 1 V and 8-bit operations at 1.8 V in 90 and 180 nm technologies.Even though the existing work described in [35,36] requires less power than the proposed SFMAC (the existing circuit’s performance analysis is done with a supply voltage less than 2 V, while the SFMAC uses a supply voltage of 2 V), these two existing implementations can only execute MAC operations on unsigned fixed-point numbers.As a result, the MAC architectures in [35,36] have a restricted scope.Although the architecture defined in [37] is implemented in 180 nm technology with a 1.8 V supply voltage for 16-MAC operation, it consumes substantially more power than the SFMAC architecture.The implementation of the architecture listed in [38] is for 1-bit unsigned fixed-point MAC operation in 32 nm CMOS & CNTFET technology, so a comparison with an 8-bit SFMAC is meaningless.Despite the fact that the architecture described in [32] is the only existing MAC architecture capable of performing on signed floating-point operations, a comparative study with the proposed SFMAC reveals that SFMAC’s efficiency in terms of power consumption is much better.

Table 2: Proposed SFMAC vs. already reported architectures

Table 2: Continued

5 Conclusion

A novel approach for performing normalization is explained in this paper.The proposed normalization operation is categorized into Exponential Comparator Circuit (ECC) & Exponential Shifter Circuit (ESC).The ECC block performs a comparison between the exponents; at the same time, ESC is responsible for shifting the smaller number by the amount of difference between the exponents of the inputs.Further, a signed floating-point MAC architecture is also proposed using the novel normalization architecture.For design & implementation, the Cadence Spectre tool is used at CMOS 90 nm and TSMC 130 nm technologies.The results have proved that the proposed SFMAC architecture has used the least power than its recent counterpart & therefore, has applicability in lowpower DSP architectures.

Funding Statement:This work was supported by Research Support Fund (RSF) of Symbiosis International (Deemed University), Pune, India

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.