By Jean-Francois Panisset -
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40696.pdf
The following paper is a bit wordy, but has some interesting insights:
Basically says that although in most cases you want threads to access memory directly connected to the CPU socket they are running on (20% faster for some apps they ran), in some applications cache bandwidth contention can flip that around, and you can get better performance by spreading threads on all sockets, thus maximizing cache bandwidth, even if that means accessing remote memory.
Would be interesting to try to measure this effect for industry standard renderers, and see if any of them attempt to do NUMA-aware memory placement.
I wish I had the time to do fun stuff like that...
JF