The article is 3 weeks old but I just read it. Beyond3D published a very good analysis of the Fermi architecture. It is based on many homemade tests they developed to bench individual parts of the GF100 chip. Based on these analysis, they made interesting discoveries and speculations on the GF100 architecture.

In this article, I also discovered "Pomegranate", a parallel hardware architecture for polygon rendering developed at Stanford and that seems to be very close to the way Fermi handle parallel work distribution of the different steps of the graphics pipeline. Pomegranate [Eldrige et al, 2000]

Discussions are on Beyond3D Forum.

Here are some interesting statements:

PS: To understand the following statements, note they call Slimer the GF100 architecture. I did not really get the Joke... anyway.
  • Fermi Load/Store architecture and shared memory:
    "Moreover, Fermi makes a further step towards RISC-ism, being a proper load/store architecture, with all operands having to be moved into/out of registers, an example being shared memory: older architectures could use shared memory operands directly, whereas Slimer uses register load/store."

  • SFU doing triangles attributes interpolation and interpolation precision:
    "An SFU can compute either transcendental functions or planar attribute interpolations.  For transcendental approximation it uses quadratic interpolation based on enhanced minimax approximations.  Three lookup tables holding coefficients for interpolation are used, [...], with accuracy for the resulting approximation ranging from 22 to 24 good bits."

  • Details on Fermi distributed rasterisation:
    "The first thing that must be done is distribute them
    (triangles) to the rasteriser, which is done using each triangle's bounding box (in this case practically a rectangle, we're in a 2D space after all). Triangles that cross multiple tiles get distributed to owners of crossed tiles, and work gets replicated. Once this GTE-controlled distribution of triangles is performed, these get buffered and re-ordered at their destination GPCs, prior to rasterisation, to return to API ordering. Once this is done, rasterisation can proceed, and no further sorts are needed."

  • The L2 cache usage in the graphics pipeline:
    "Data is kept on chip as much as possible, after the initial vertex fetch, with the lowest level in the memory hierarchy that gets hit being the L2, for data marshaling between pipeline stages, and the post geometry processing pre-rasterisation re-ordering. It is our humble opinion that having the L2 was the key to making the parallel approach to geometry tasks feasible, and the rest is mostly peanuts by comparison."

  • Global atomic operations in ROPs:
    "The ROPs are atomic units for at least atomics performed on memory addresses that map to global memory. Since raster rate has no impact here, it means that atomics can benefit from the extra ROPs.  Speaking of atomics, another interesting aspect is that Fermi also adds support for doing atomic ADD or XCH with FP operands (INT atomic units are cheap, FP units not so much). Finally, we believe that writes to the L2 portion that's allocated as ROP cache are serialized between GPCs, so as to prevent conflicts/contention, with each GPC writing at most 128-bytes to it in a round-robin fashion."

  • They suspect the parallel triangle setup to be capped on GeForce:
    "Getting back to the main course, the question remains: why does Slimer need tessellation to expose its parallel setup capability? [...] In fact, we struggled with many potential theories, until a fortuitous encounter with a Quadro made the truth painfully obvious: product differentiation [...]  The Quadro, in spite of being pretty much the same hardware (this is a signal to all those that believe there's magical hardware in the Quadro because it's more expensive – engage rant mode!), is quite happy doing full speed setup on the untessellated plebs. [...] Capping is done in drivers, by inducing artificial delays during the post viewport transform reordering (mind you this hasn't yet been confirmed by NVIDIA, but our own educated conclusion)."

  • Atomic operations: slow in shared memory, fast in global memory
    Shared Memory Atomics:
    "Hoping that everyone who fainted was exposed to smelling salts, let's underline that the above is correct: in our considerable experience, Cypress is ~12 times faster here. "
    => Their theory (But I don't believe in it):
    "[...] we think that it can only perform atomics using the ROPs on operands that are in the L2; as such, when doing atomic ops on shared memory operands, what actually happens is that the bank holding the offending value gets locked and its data is written out to the L2, the operation is performed at the ROP and the result written back, with the bank being unlocked afterwards."

    Global Memory Atomics: 
    "[...] roughly 3 times faster [than Cypress]"  

  • DX Append buffer and counters: slower than Cypress
    "The reason why performance with Counters or Append/Consume buffers in D3D looks comparatively bad is tied to this as well: ATI has some in-hardware tweaks for those usage scenarios, making extensive use of the GDS, which also has hardware for atomics, mind you, which is perky since counter management is also pretty much an atomic op, and some dedicated pathways.  That's in contrast to NVIDIA, who seem to have opted for a fully generic path.