Who hasn't wanted to get those few extra MHz performance out of their FPGA? Here's how I do it. I'm going to explain what it takes to produce a design that meets timing constraints. The contents of these pages are obviously my opinion. Please feel free to give me feedback.
Let's look at some FPGA design guidelines - what you should and shouldn't do. Some of these are deliberately general in nature, but to get FPGA performance you need to look at each and every aspect of your design.
What to do and what not to do.
Do ...
Properly specify your FPGA design - make sure you know what you , and more importantly you colleagues and/or customer want.
Use as small a number of clocks as possible.
Synchronize FPGA resets to the appropriate clocks.
Simulate the whole FPGA design, block level isn't enough (and if possible the whole board or system).
Synchronize transfers across clock domains.
Make use of the embedded FPGA-specific features e.g. SRLs.
Always do a FPGA test design with the pinout before committing to board layout! Prove that there are no banking or clocking limitations. It doesn't matter what the FPGA test design does (I use a group of sregs with inputs looping to outputs) - make sure that none of the logic is optimized away.
Have some spare FPGA I/O with external pull-ups - these can be connected to for modifying I/O.
Use high speed serial I/O rather than high speed parallel I/O.
As a rule of thumb, allow 5% on top of you required clock speed to account for temperature, clock jitter and noise fluctuations within the FPGA.
Don't...
Use any more clocks than is necessary.
Use asynchronous logic.
Use Latches.
Over-constrain your design.
Write woolly HDL when you want high performance from the FPGA, spell it out to the synthesis tool
Make assumptions; know what the effects of your code are.
Expect IP blocks to out-perform your code, just because it comes from a so-called expert doesn't mean you can't do something better or more efficiently, or more specific to your goals.
Other Hints:
Designing for Speed
For time-critical blocks, keep the code simple - by this I mean keep the levels of logic down to the number that can be fitted in a single LE/CLB immediately before the destination register. Any time you need two LEs/CLBs, then you can forget it.
Don't be afraid to lock logic to an area on the FPGA or some critical registers to specific locations in the FPGA. I hear the argument - "the design will change", so what, if it does, it's better to alter logic that meets timing than logic that doesn't.
Counters - change to LFSR even if big for known lengths, but especially for small (less than 32 bits say).