CMOS Logic

a summary of chapter 2 of ASIC the book by Yusdi Saliman.
The original text is available at http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/Book/CH02/CH02.htm

A MOS transistor (or device) has four terminals: gate , source , drain , and bulk. A MOS transistor is a switch.

We turn a transistor on or off using the gate terminal. There are two kinds of MOS transistors: n -channel transistors (nMOS) and p -channel transistors (pMOS). An nMOS requires a logic '1' on the gate to make the switch conducting. A pMOS requires a logic '0' on the gate to make the switch nonconducting (to turn the transistor off ).

2.1. CMOS Transistor

The region between source and drain in a MOS transistor is normally nonconducting. To make an nMOS conducting, we must apply a positive voltage V_GS that is greater than the nMOS threshold voltage , V_{t n} (a typical value is 0.5 V) and a positive voltage V_DS to the drain with respect to the source. We must also connect the bulk to the most negative potential, GND or VSS, to reverse bias the bulk-to-drain and bulk-to-source pn -diodes.

The drain-to-source current, I_DSn , is

I_DSn = (W/L)k ' _n [ ( V _GS – V _{t n} ) – 0.5 V _DS ] V _DS

where

W = parallel-plate capacitor width
L = parallel-plate capacitor length
k ' _n = process transconductance parameter (or intrinsic transconductance)
V_GS = gate-to-source voltage
V_{t n} = nMOS threshold voltage
V_DS = drain-to-source voltage

The process transconductance parameter k ' _n is

k ' _n = m _n C _ox

where m_n is the electron mobility and C_ox is the gate capacitance per unit area.

The transistor gain factor, b _n, is

b _n = k ' _n (W/L)

The above equations describes the linear region (or triode region) of operation. This equation is valid until V_DS = V_GS – V_{t n} and then predicts that I_DS decreases with increasing V_DS.

For V_DS > V_GS – V_tn(the saturation region or pentode region) the drain current I_DS remains approximately constant at the saturation current, I_DSn_(sat), where

I_DSn_(sat) = ( b _n /2)( V_GS – V_{t n} )²; V_GS > V_{t n}

2.1.1. P-Channel Transistors

In a pMOS V_{t p} is normally negative, so we can use negative signs in a consistent fashion.

I_DSp	=	– k ' _p (W/L) [ ( V_GS – V_{t p} ) – 0.5 V_DS ] V_DS ; V_DS > V_GS – V_{t p}
I_DSp_(sat)	=	– b _p /2 ( V_GS – V_{t p} )² ; V_DS < V_GS – V_{t p} .

In these two equations V_{t p} is negative, and the terminal voltages V_DS and V_GS are also normally negative (and –3 V < –2 V, for example). The current I_DSp is then negative, corresponding to conventional current flowing from source to drain of a pMOS.

2.1.2. Velocity Saturation

The electrons cannot move any faster than about v _{max n} = 10⁵ ms^–1 when the electric field is above 10⁶ Vm^–1 (reached when 1 V is dropped across 1 m m); the electrons become velocity saturated . In this case t_f = L_eff/ v_max
n , the drain-source saturation current is independent of the transistor length, and I_DSn_(sat) becomes

I_DSn_(sat)

Wv_{max n} C_ox ( V_GS – V_{t n} ) ; V_DS > V_DS_(sat) (velocity saturated).

2.1.3. SPICE Models

The simulation program SPICE (which stands for Simulation Program with Integrated Circuit Emphasis ) is often used to characterize logic cells. Below is a typical set of model parameters for a generic 0.5 m m process (G5):

MODEL CMOSN NMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=1 VTO=0.65 DELTA=0.7
+ LD=5E-08 KP=2E-04 UO=550 THETA=0.27 RSH=2 GAMMA=0.6 NSUB=1.4E+17 NFS=6E+11
+ VMAX=2E+05 ETA=3.7E-02 KAPPA=2.9E-02 CGDO=3.0E-10 CGSO=3.0E-10 CGBO=4.0E-10
+ CJ=5.6E-04 MJ=0.56 CJSW=5E-11 MJSW=0.52 PB=1
.MODEL CMOSP PMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=-1 VTO=-0.92 DELTA=0.29
+ LD=3.5E-08 KP=4.9E-05 UO=135 THETA=0.18 RSH=2 GAMMA=0.47 NSUB=8.5E+16 NFS=6.5E+11
+ VMAX=2.5E+05 ETA=2.45E-02 KAPPA=7.96 CGDO=2.4E-10 CGSO=2.4E-10 CGBO=3.8E-10
+ CJ=9.3E-04 MJ=0.47 CJSW=2.9E-10 MJSW=0.505 PB=1

2.1.4. Logic Levels

Following are the logic levels characteristics of both MOS transistor:

An nMOS provides a strong '0', but a weak '1'.
A pMOS transistor provides a strong '1', but a weak '0'.

Sometimes we refer to the weak versions of '0' and '1' as degraded logic levels . In CMOS technology we can use both types of transistor together to produce strong '0' logic levels as well as strong '1' logic levels.

2.2. The CMOS Process

Steps in a IC fabrication process: Grow crystalline silicon (1); make a wafer (2–3); grow a silicon dioxide (oxide) layer in a furnace (4); apply liquid photoresist (resist) (5); mask exposure (6); a cross-section through a wafer showing the developed resist (7); etch the oxide layer (8); ion implantation (9–10); strip the resist (11); strip the oxide (12). Steps similar to 4–12 are repeated for each layer (typically 12–20 times for a CMOS process).

The CMOS process layers are shown in the table below:

2.3. CMOS Design Rules

The figure below defines the design rules for a CMOS process using pictures. Arrows between objects denote a minimum spacing, and arrows showing the size of an object denote a minimum width. Each of the rule numbers may have different values for different manufacturers—there are no standards for design rules.

2.4. Combinational Logic Cells

The AND-OR-INVERT (AOI) and the OR-AND-INVERT (OAI) logic cells are particularly efficient in CMOS.

an example of AOI and OAI, and the naming convention

We can express the function of the AOI221 cell in (a) as Z = (A · B + C · D + E)' or Z = OAI221(A, B, C, D, E).

2.4.1. Pushing Bubbles

Here are the steps to construct any single-stage combinational CMOS logic cell:

Draw a schematic icon with an inversion (bubble) on the last cell (the bubble-out schematic)
Form the n -channel stack working from the inputs on the bubble-out schematic: OR translates to a parallel connection, AND translates to a series connection.
Form the p -channel stack using the bubble-in schematic (ignore the inversions at the inputs—the bubbles on the gate terminals of the p -channel transistors take care of these).

2.4.2. Drive Strength

We can size a logic cell using these basic rules:

Any string of transistors connected between a power supply and the output in a cell with 1X drive should have the same resistance as the n -channel transistor in a 1X inverter.
A transistor with shape factor W 1 /L 1 has a resistance proportional to L 1 /W 1 (so the larger W 1 is, the smaller the resistance).
Two transistors in parallel with shape factors W 1 /L 1 and W 2 /L 2 are equivalent to a single transistor (W 1 /L 1 + W 2 /L 2 )/1. For example, a 2/1 in parallel with a 3/1 is a 5/1.
Two transistors, with shape factors W 1 /L 2 and W 2 /L 2 , in series are equivalent to a single 1/(L 1 /W 1 + L 2 /W 2 ) transistor.

2.4.3. Transmission Gates

CMOS transmission gate ( TG , TX gate, pass gate, coupler) is implemented by connecting a pMOS (to transmit a strong '1') in parallel with an nMOS (to transmit a strong '0'). We can express the function of a TG as Z = TG(A, S), where A is the input and S is the select signal.

We can use two TGs to form a multiplexer (MUX). The MUX function for two data inputs, A and B, with a select signal S, is Z = TG(A, S') + TG(B, S).

2.4.4. Exclusive-OR Cell

The two-input exclusive-OR ( XOR , EXOR, not-equivalence, ring-OR) function is A1 ⊕ A2 = XOR(A1,A2) = A1 · A2' + A1' · A2.

We can implement a two-input XOR using a MUX and an inverter as follows (2 gates):

XOR(A1,A2) = MUX[NOT(A1),A1,A2]

where

MUX(A, B, S) = A · S + B · S '

We can use inverter buffers (3.5 gates total) or an inverting MUX so that the XOR cell does not have any external connections to source/drain diffusions as follows (3 gates total):

XOR(A1,A2) = NOT[MUX(NOT[NOT(A1)],NOT(A1),A2)]

We can also implement a two-input XOR using an AOI21 (and a NOR cell), since

XOR(A1,A2) = AOI21[A1, A2, NOR(A1, A2)]

Similarly we can implement an exclusive-NOR (XNOR, equivalence) logic cell using an inverting MUX (and two inverters, total 3.5 gates) or an OAI21 logic cell (and a NAND cell, total 2.5 gates) as follows:

XNOR(A1, A2) = OAI21[A1,A2,NAND(A1,A2)]

2.5. Sequential Logic Cells

2.5.1. Latch

The first sequential logic cell is latch. A latch has the following characteristics:

When a positive-enable latch is using transmission gates without output buffering, the enable (clock) signal is buffered inside the latch.
A positive-enable latch is transparent while the enable is high.
The latch stores the last value at D when the enable goes low

2.5.2. Flip-flop

A flip-flop constructed from two D latches: a master latch (the first one) and a slave latch. This flip-flop contains a total of nine inverters and four TGs, or 6.5 gates. In this flip-flop design the storage node S is buffered and the clock-to-Q delay will be one inverter delay less than the clock-to-QN delay.

2.5.3. Clocked Inverter

We can derive the structure of a clocked inverter from the series combination of an inverter and a TG. If we wish to build a flip-flop with a fast clock-to-QN delay it may be better to build it using clocked inverters and use inverters with TGs for a flip-flop with a fast clock-to-Q delay. In fact, since we do not always use both Q and QN outputs of a flip-flop, some libraries include Q only or QN only flip-flops that are slightly smaller than those with both polarity outputs. It is slightly easier to layout clocked inverters than an inverter plus a TG, so flip-flops in commercial libraries include a mixture of clocked-inverter and TG implementations.

2.6. Datapath Logic Cells

The layout of buswide logic that operates on data signals in this fashion is called a datapath . The module ADD is a datapath cell or datapath element . Just as we do for standard cells we make all the datapath cells in a library the same height so we can abut other datapath cells on either side of the adder to create a more complex datapath. When people talk about a datapath they always assume that it is oriented so that increasing the size in bits makes the datapath grow in height, upwards in the vertical direction, and adding different datapath elements to increase the function makes the datapath grow in width, in the horizontal direction—but we can rotate and position a completed datapath in any direction we want on a chip.

Datapath layout automatically takes care of most of the interconnect between the cells with the following advantages:

Regular layout produces predictable and equal delay for each bit.
Interconnect between cells can be built into each cell.

There are some disadvantages of using a datapath:

The overhead (buffering and routing the control signals, for example) can make a narrow (small number of bits) datapath larger and slower than a standard-cell (or even gate-array) implementation.
Datapath cells have to be predesigned (otherwise we are using full-custom design) for use in a wide range of datapath sizes. Datapath cell design can be harder than designing gate-array macros or standard cells.
Software to assemble a datapath is more complex and not as widely used as software for assembling standard cells or gate arrays.

2.6.1. Datapath Elements

For a bus, A[31:0] denotes a 32-bit bus with A[31] as the leftmost or most-significant bit or MSB , and A[0] as the least-significant bit or LSB . Sometimes we shall use A[MSB] or A[LSB] to refer to these bits. Notice that if we have an n -bit bus and LSB = 0, then MSB = n – 1. Also, for example, A[4] is the fifth bit on the bus (from the LSB). We use a ' S ' or 'ADD' inside the symbol to denote an adder instead of '+', so we can attach '–' or '+/–' to the inputs for a subtracter or adder/subtracter.

2.6.2. Adders

Some types of adders:

Ripple-carry adder
Carry-save adder
Carry-propagate adder
Carry-bypass adder
Carry-skip adder
Carry-lookahead adder
Carry-select adder
Conditional-sum adder
Serial adder
Parallel adder
Carry-completion adder

2.6.3. A Simple Example

Here is a verilog code that represents an 8-bit conditional-sum adder:

module m8bitCSum (C0, a, b, s, C8); // Verilog conditional-sum adder for an FPGA

input [7:0] C0, a, b; output [7:0] s; output C8;

wire A7,A6,A5,A4,A3,A2,A1,A0,B7,B6,B5,B4,B3,B2,B1,B0,S8,S7,S6,S5,S4,S3,S2,S1,S0;

wire C0, C2, C4_2_0, C4_2_1, S5_4_0, S5_4_1, C6, C6_4_0, C6_4_1, C8;

assign {A7,A6,A5,A4,A3,A2,A1,A0} = a; assign {B7,B6,B5,B4,B3,B2,B1,B0} = b;

assign s = { S7,S6,S5,S4,S3,S2,S1,S0 };

assign S0 = A0^B0^C0 ; // start of level 1: & = AND, ^ = XOR, | = OR, ! = NOT

assign S1 = A1^B1^(A0&B0|(A0|B0)&C0) ;

assign C2 = A1&B1|(A1|B1)&(A0&B0|(A0|B0)&C0) ;

assign C4_2_0 = A3&B3|(A3|B3)&(A2&B2) ; assign C4_2_1 = A3&B3|(A3|B3)&(A2|B2) ;

assign S5_4_0 = A5^B5^(A4&B4) ; assign S5_4_1 = A5^B5^(A4|B4) ;

assign C6_4_0 = A5&B5|(A5|B5)&(A4&B4) ; assign C6_4_1 = A5&B5|(A5|B5)&(A4|B4) ;

assign S2 = A2^B2^C2 ; // start of level 2

assign S3 = A3^B3^(A2&B2|(A2|B2)&C2) ;

assign S4 = A4^B4^(C4_2_0|C4_2_1&C2) ;

assign S5 = S5_4_0& !(C4_2_0|C4_2_1&C2)|S5_4_1&(C4_2_0|C4_2_1&C2) ;

assign C6 = C6_4_0|C6_4_1&(C4_2_0|C4_2_1&C2) ;

assign S6 = A6^B6^C6 ; // start of level 3

assign S7 = A7^B7^(A6&B6|(A6|B6)&C6) ;

assign C8 = A7&B7|(A7|B7s)&(A6&B6|(A6|B6)&C6) ;

endmodule

2.6.4. Multipliers

There are several issues in deciding between parallel multiplier architectures:

Since it is easier to fold triangles rather than trapezoids into squares, a Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger, than a Dadda multiplier—both are less regular than an array multiplier. For cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier.
The overall multiplier speed does depend on the size and architecture of the final CPA, but this may be optimized independently of the CSA array. This means a Dadda multiplier is always at least as fast as the Wallace-tree version.
The low-order bits of any parallel multiplier settle first and can be added in the CPA before the remaining bits settle. This allows multiplication and the final addition to be overlapped in time.
Any of the parallel multiplier architectures may be pipelined. We may also use a variably pipelined approach that tailors the register locations to the size of the multiplier.
Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression and permits the size of the stages to be tuned. Some ASIC cell libraries contain a (7, 3) counter—a 2-bit full-adder . A (15, 4) counter is a 3-bit full adder. There is a trade-off in using these counters between the speed and size of the logic cells and the delay as well as area of the interconnect.
Power dissipation is reduced by the tree-based structures. The simplified carry-save logic produces fewer signal transitions and the tree structures produce fewer glitches than a chain.
None of the multiplier structures we have discussed take into account the possibility of staggered arrival times for different bits of the multiplicand or the multiplier. Optimization then requires a logic-synthesis tool.

2.6.5. Other Arithmetic Systems

There are other schemes for addition and multiplication that are useful in special circumstances. Addition of numbers using redundant binary encoding avoids carry propagation and is thus potentially very fast. Redundant binary addition of binary, redundant binary, or CSD vectors does not result in a unique sum, and addition of two CSD vectors does not result in a CSD vector. Each n -bit redundant binary number requires a rather wasteful 2 n -bit binary number for storage. Thus 10 1 is represented as 010010, for example (using sign magnitude). The other disadvantage of redundant binary arithmetic is the need to convert to and from binary representation.

2.6.6. Other Datapath Operators

A subtracter is similar to an adder, except in a full subtracter we have a borrow-in signal, BIN; a borrow-out signal, BOUT; and a difference signal, DIFF. We can build a ripple-borrow subtracter (a type of borrow-propagate subtracter), a borrow-save subtracter, and a borrow-select subtracter in the same way we built these adder architectures.

A barrel shifter rotates or shifts an input bus by a specified amount. A leading-one detector is used with a normalizing (left-shift) barrel shifter to align mantissas in floating-point numbers. The output of a priority encoder is the binary-encoded position of the leading one in an input. An accumulator is an adder/subtracter and a register. A decrementer subtracts 1 from the input bus. A register file (or scratchpad memory) is a bank of flip-flops arranged across the bus.

2.7. I/O Cells

In a three-state bidirectional output buffer, when the output enable (OE) signal is high, the circuit functions as a noninverting buffer driving the value of DATAin onto the I/O pad. When OE is low, the output transistors or drivers , M1 and M2, are disconnected. This allows multiple drivers to be connected on a bus. It is up to the designer to make sure that a bus never has two drivers—a problem known as contention.

The three-state buffer allows us to employ the same pad for input and output— bidirectional I/O . When we want to use the pad as an input, we set OE low and take the data from DATAin. We can also use many of these output cell features for input cells that have to drive large on-chip loads (a clock pad cell, for example). Some gate arrays simply turn an output buffer around to drive a grid of interconnect that supplies a clock signal internally.

2.8. Cell Compilers

The process of hand crafting circuits and layout for a full-custom IC is a tedious, time-consuming, and error-prone task. There are two types of automated layout assembly tools, often known as a silicon compilers . The first type produces a specific kind of circuit, a RAM compiler or multiplier compiler , for example. The second type of compiler is more flexible, usually providing a programming language that assembles or tiles layout from an input command file, but this is full-custom IC design. n addition to producing layout we also need a model compiler so that we can verify the circuit at the behavioral level, and we need a netlist from a netlist compiler so that we can simulate the circuit and verify that it works correctly at the structural level.

a summary of chapter 2 of ASIC the book by Yusdi Saliman.
The original text is available at http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/Book/CH02/CH02.htm