What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

interesting conversation with AMD CTO.

Marzipan

Well-known member
Joined
Nov 21, 2007
Messages
12,077
Location
Prince Rupert, British Columbia, Canuckistan

the focus is on how most software can multi-thread so they see more cores supplanting higher clock speeds. but the two nifty take aways, for me anyhow, was the mention of SMT4, where a single core could do up to 4 threads and the tidbit about x86 vs ARM and how the demand for ARM based servers wasn't because the demand for x86 was dropping, it was due to everyone wanting to get away from an Intel solution...which gives AMD a real opportunity to capture market share in that department as their CPU's are drop-in solutions and don't require recompiling and such to port them over to the ARM platform.

very cool article!
 

Izerous

Well-known member
Folding Team
Joined
Feb 7, 2019
Messages
3,658
Location
Edmonton
Was an interesting read and a move to SMT4 could be very interesting. Even on the low end of things a super cheap dual core running 8 threads as a HTPC or in NAS devices like synology / qnaps could be a very appealing price/performance argument depending on how well it works.
 

Marzipan

Well-known member
Joined
Nov 21, 2007
Messages
12,077
Location
Prince Rupert, British Columbia, Canuckistan
Was an interesting read and a move to SMT4 could be very interesting. Even on the low end of things a super cheap dual core running 8 threads as a HTPC or in NAS devices like synology / qnaps could be a very appealing price/performance argument depending on how well it works.
it would need to maintain a high clock speed to have the overhead to run 4 threads though. lower core count allows for higher clock, so it's doable.
 

Entz

Well-known member
Joined
Jul 17, 2011
Messages
1,878
Location
Kelowna
I agree with him. SMT4 is a niche of a niche there are few "consumer" work loads that can keep the execution units in 2T perfectly balanced let alone 4. If you get it wrong your performance will tank. It wont be a consumer thing likely ever.

It has already been proven at higher core counts (>4) that you can actually get better gaming performance when you disable HT. Imagine a game trying to run 3 "fake" threads on shared execution units.. yikes.
 

gingerbee

Well-known member
Joined
Jan 22, 2009
Messages
10,053
Location
Orillia, Ontario
yes but is the performance hit caused by the software just not being up to handling the threads available. IE, not enough software written to handle more than 4 threads. so if the software caught up would the performance be that much better. not saying I know this just wondering about the question myself.

PLus IBM had few chips a couple of years ago that had more than 2 threads per core and I think that was built for a particular software. but it does work and if the software is the only reason that each core is leaving performance on the table we could see this happen in the future don't know if it will be on Zen 3 but it may be somewhere CPU design/software encoding is going
 

Entz

Well-known member
Joined
Jul 17, 2011
Messages
1,878
Location
Kelowna
yes but is the performance hit caused by the software just not being up to handling the threads available. IE, not enough software written to handle more than 4 threads. so if the software caught up would the performance be that much better. not saying I know this just wondering about the question myself.
Its not really that simple. A great many tasks just cannot be made to thread well. Not all problems can be broken up into isolated concurrent passes and often times you end up with massive amounts of thread synchronization to access shared resources to make it happen (Memory/IO/State). Forcing it to happen can have unintended consequences.

This is very true for things like games. Even if your AI / Sound / Video / Physics are all on separate threads at some point those all need to synchronize their state, which requires locking, which takes up a massive amount of CPU time. It is just not possible to have your Physics run but not update anything, it cannot be isolated. Your AI could go talk to itself in the corner but nothing will ever move because it cant update the actor state. Just not how it works.

HT/SMT is at best a 30% gain in most workloads. It doesn't take much for that to fall down. 4SMT may hit 50% but you are greatly increasing the number of locks that can happen. If things are not in the cache fetching from 4 locations in memory at once (per core) needs a butt load of bandwidth.

Edit: Removed some company information, I should prob be careful to say even though it wasn't bad lol.
 
Last edited:

gingerbee

Well-known member
Joined
Jan 22, 2009
Messages
10,053
Location
Orillia, Ontario
thanks, entz that answer a lot of my questions. so if we do see it will only be on server chips that are handling butt loads of different requests for loads of different tasks all at the same time.
 

Marzipan

Well-known member
Joined
Nov 21, 2007
Messages
12,077
Location
Prince Rupert, British Columbia, Canuckistan
Its not really that simple. A great many tasks just cannot be made to thread well. Not all problems can be broken up into isolated concurrent passes and often times you end up with massive amounts of thread synchronization to access shared resources to make it happen (Memory/IO/State). Forcing it to happen can have unintended consequences.

This is very true for things like games. Even if your AI / Sound / Video / Physics are all on separate threads at some point those all need to synchronize their state, which requires locking, which takes up a massive amount of CPU time. It is just not possible to have your Physics run but not update anything, it cannot be isolated. Your AI could go talk to itself in the corner but nothing will ever move because it cant update the actor state. Just not how it works.

HT/SMT is at best a 30% gain in most workloads. It doesn't take much for that to fall down. 4SMT may hit 50% but you are greatly increasing the number of locks that can happen. If things are not in the cache fetching from 4 locations in memory at once (per core) needs a butt load of bandwidth.

Edit: Removed some company information, I should prob be careful to say even though it wasn't bad lol.
I'm not really seeing your point...especially now that we have 16 core / 32 thread consumer CPU's. how is SMT4 on 8 cores to achieve 32 threads any different?
 

Bond007

Well-known member
Joined
Jun 24, 2009
Messages
7,989
Location
Nova Scotia
This semi relates to the smt4 discussion. IBM has had more threaded cores before, and this is what they say they gain by going up in threads for 1 to 8.

SMT_performanceIBM.png
 

Entz

Well-known member
Joined
Jul 17, 2011
Messages
1,878
Location
Kelowna
That was in response to software being the issue.

There is a huge difference between a physical core and a virtual one, there are only certain number of physical stages in a cpu pipeline. If one thread is tying up one, you don’t get any performance benefit if the second one wants it. Now imagine 4 threads trying for the same bit. Want to add floating point numbers on 4 threads when you can only do 1 at a time? They will have to be interleave. Sure you can load from memory on one and do another while that stalls or do fp and int at the same time but you have to be extremely careful about how things are processed and SMT 4 only makes that worse. Memory bandwidth also becomes an issue.

not to mention you need bigger register files, and cache and a whole host of other things that increase space. Space you prob could just use for another core. Schedulers would need to be tweaked.

This is why HT is nowhere close to double performance. Why windows will always use physical cores over virtual ones. Why if you have a VM server under load, getting stuck on virtual cores destroys your performance.

As he said sometimes you get better performance by turning off HT (Or pinning your processes to physical cores). Lots of factors that play into it.

Your 8C SMT4 will be slower in general than a 16C SMT2, I would guess a 10C SMT2 would beat it. Unless you have perfectly designed hand crafted software that is not general purpose Or one that causes a lot of stalls (memory intensive)
 
Last edited:

Latest posts

Top