FPGA design practices and optimization

FPGA design practices and
Gyula Istvan Nagy
Avoid clock-gating for FPGA design
› Clock-gating results in lower performance for FPGA
› Use clock-enable
Page 2
Avoid data for clock
› Using a derived data signal for clock results in low
› Data propagation delay and skew decreases maximal clock
› Use derived signal as clock enable
Page 3
Avoid latch usage (1)
› Latches read input data until
Gate de-asserts
› Fed-back networks must have
higher inherent data
propagation delay than Gate
› Constrains minimal data
propagation delay
› Hard to comply for each route
without excessive data
› FPGA internal structure is
optimized for fastest data
Page 4
Avoid latch usage (2)
› Master-slave latch cancels minimal data propagation delay
› Also reduces maximal allowed data propagation delay that
can lead to unnecesary tight timing requirements
Page 5
Avoid latch usage (3)
› D-type Flip-Flop consists of three RS bistables
Page 6
Avoid latch usage (4)
› Design with Flip-Flops to make timing closure possible
Page 7
Output driving (1)
› Tri-state drivers are only in the output buffers for external
› No internal tri-state signal definition possible
› For synthesized combinational functions control tri-state
drivers directly with Flip-Flop output to avoid hazard
Page 8
Output driving (2)
› Use only Flip-Flop sourced output ports in top-level and
subsequent blocks
› Use I/O block Flip-Flops for better timing match
› Comply maximal SSO count
› Use differential driver nodes for most accurate phase
relationship of a signal pair if required
› Implement parallel bus clock and data with DDR output
drivers for best timing match
› Sort pins by pad-to-pin propagation of used package and
silicon for timing-sensitive peripherals
› Take internal PLL phase swing into count when transmitting
data to synchronous external device
Page 9
Design goals (1)
› Design synchronous network with registered outputs and
registered inputs if possible
› Balance combinational delay between Flip-Flops
› Design low fanout networks
› Keep the design on small area to be able to operate with
low-skew clock network
› Use multiple Flip-Flops to reduce fan-outs if necessary
› Apply pipelining if necessary
› Consider different realizations
› Use only Flip-Flop sourced output ports in Top-Level and
subsequent blocks
Page 10
Design goals (2)
› Use as low resources as necessary for the required
› Implement common reset signal in a clock-domain to avoid
combinational logic on reset signals
Page 11
Consider different realizations (1)
› Example on a single-phase FIR filter
› Direct structure has a lot of adders at output that creates
large combinational delay although it can be pipelined
Page 12
Consider different realizations (2)
› Transposed structure is optimal for one sample per clock
Page 13
Consider different realizations (3)
› MAC structure is optimal for lower data rates because of
less resource consuption
Page 14
Pipelining (1)
› Pipelining is inserting registers into a combinational
network in order to distribute the combinational delay
between many registers and allow higher clock frequency
operation as long as side effects make it possible
› Raises the design Flip-Flop consumption
› Data throughput is increased by the achieved higher clock
› Input-to-output group delay is increased by the number of
inserted FF stages
› Overall propagation time is not necessary higher because
of the higher clock frequency
› Pipelining of fed-back systems requires functional redesign
Page 15
Pipelining (2)
› Example pipelining on an integrator
› Integrator is implemented as a fed-back adder
› Full adders are fed-back bitwise, carry bits are connected
to adjacent adders
› Carry propagation is only fed-forward and can be pipelined
› Network is divided into many smaller networks during
pipelining that are separated with registers in the carrychain
› Input and output data flow requires group-delay balancing
Page 16
Pipelining (3)
› 100-bit adder direct structure
› Can run at 100 MHz in a Spartan 3 -4 speed grade FPGA
› Can run at 243 MHz in a Virtex 5 -3 speed grade FPGA
Page 17
Pipelining (4)
› 1-stage pipelined structure
› Can run at 128 MHz in a Spartan 3 -4 speed grade FPGA
› Can run at 400 MHz in a Virtex 5 -3 speed grade FPGA
Page 18
Pipelining (5)
› 2-stage pipeline structure has two registers in the carry
chain and two stages of group-delay balancing
› Can run at 147 MHz in a Spartan 3 -4 speed grade FPGA
› Can run at 344 MHz in a Virtex 5 -3 speed grade FPGA
› As a side effect design area is increasing during pipelining
› Performance is reduced for Virtex-5 because of higher
clock-skew is comparable to data delay
› Spartan-3 performance is still increased because data
propagation dominates over the increased clock skew
Page 19
Pipelining (6)
Page 20
Fan-out reduction (1)
› Flip-flop multiplication in a parallel manner shares fan-outs
of the single Flip-Flop
› Raises the design Flip-Flop consumption
› Decreased data propagation allows higher clock frequency
resulting in higher throughput
› Can be automated in the design flow
Page 21
Fan-out reduction (2)
Page 22
Signal processing (1)
› Signal truncation reduces signal-to-noise ratio in the overall
› FIR system coefficient truncation reduces filter rejection as
quantization noise spectrum is added to the filter
› IIR system coefficient truncation re-places pole-zero map
and can lead settling to an attractor
› Always separate parts with pole or zero at z=1 and
implement standalone with direct feedback or feedforward
› Take quantization noise accumulation into count
during partial sum calculations
Page 23
Signal processing (2)
› Use spare bits for quantization noise storage during
› Round to half-LSB for minimal quantization noise mean value
› Also consider time-domain amplitudes for implementation
Page 24
General development process
Development time
› Verification phase is at about
three-times longer than source
code development
› By default verification phase
determines the critical path to
project goals
› Process phase series prioritically
affect product quality
Specification recognition
Architecture design and
block definition
Prioritically affecting quality
› Most of development time is
consumed by source code
design and verification
25 %
RTL source code design
simulation and verification
90 %
75 %
Post-synthesis verification
Hardware verification
Page 25
Reusable design (1)
› High-quality system architecture creates clear data
flow definition and block-level functional separation
› Required for module reusability and effective system
Page 26
Reusable design (2)
› Always consider the simplest and easiest developable
solution for the task
› Define blocks for separate able parts of the overall task
such a way that leads to the fewest, most obvious and
reusable block-level port signals which individually reflect
the block functionality
› Separate only different tasks into different blocks
› Avoid the operation-related communication between
blocks, keep the communication on task-related level,
avoid the communication between block FSMs
› Plan the complete functionality of each block before
starting the HDL coding in order to avoid further patching
Page 27
Reusable design (3)
› Consider the most difficult conditions first and plan the
blocks’ operation be able to handle all of them
› Write tidy HDL code with signal names reflect the signal
behavior rather than the operation it’s produced from, the
operation can be in comments
› Patching can lead to side effects affecting the system-level
functionality. Be aware of the causal connections that
changes during patching can result in
› Keep the simplest hierarchy and most clear design during
Page 28
Naming convention (1)
› Helps indentify signal functionality and physical
› Faster development possible
› Improves reusability
Page 29
Naming convention (2)
› Generic parameter:
„g_” initial
› Input port:
› Output port:
› Bidirectional port:
„i_” inital
„o_” initial
„io_” initial
„const_” initial
„t_” initial
„q_” initial
„c_” initial
„w_” initial
„_n” ending
„proc_” initial
„func_” initial
„gen_” initial
„inst_” initial
Combinational logic:
Negated signal:
Generate construct:
Page 30