The Cell is a whole different type of nightmare, with various interesting properties:
1. Bitshifts by variable amounts take ~7 clocks each on the main PPU.
2. The SPUs have no cache whatsoever; all caching has to be done explicitly, and all memory access is DMA'd.
3. The SPUs do scalar math no faster than SIMD (in fact, from what I know, scalar math is just calling an SIMD function on a single value).
4. There is no instruction reordering on SPUs, and everything has to be synced exactly for max performance (certain instructions run on odd clock cycles, others on even cycles).
5. The SPUs are not Altivec chips. The SPU SIMD instruction set is similar to Altivec, but more versatile. Of course, this means you can't just run existing Altivec code on them.
6. Overall, the integer SIMD on SPUs is much slower than that on modern Intel processors. Not sure about float, as I have no experience in that arena.