I have always been an Xbox fan, ever since the original Xbox. I very much enjoyed having the superior technology when it came to OG Xbox versus the PS2, Xbox 360 versus the PS3.
But unfortunately, with Xbox One, Microsoft made some miscalculation: it was an underpowered console compared to the PS4 – until the Xbox One X came along.
The Xbox One X is truly a marvel of engineering: Microsoft ran through extensive computer modelling to understand how to improve the behaviour of both the CPU and GPU – especially when it came to its registers and caching – to achieve a 6TFLOP (TF) system that acted like one with more than 6TFs. Even Mark Cerny famously said that he believed 6TFs would NOT be enough to run games at full 4K. He was both right and wrong. He was right because if Microsoft simply scaled the GCN-based GPU from the original Xbox One to create a 6TF GPU, those 6 TFs would not have been enough. But Microsoft did very extensive modelling to understand why a simply scaled 6TF CPU and GPU could stall and how to avoid those stalls. As part of this work, they tweaked and scaled registers and caches to remove the bottlenecks and allow the CPU and GPU to operate closer to their theoretical maximums. The Xbox One X was one hell of an engineering feat!
So it is interesting then, that Microsoft might not have implemented the same lessons learnt from the Xbox One X engineering team when it came to Xbox Series X while Sony seems to have taken note?
After all, the engineering team that worked on Xbox One X might not have joined with the Xbox Series X development team until they were done with the Xbox One X which could have been as far as halfway through its development.
So how come the PS5 with its 10TF GPU is able to compete with Xbox Series X’s 12TF GPU? Let’s examine some of the possible issues. It is likely not one or the other, however, but a mixture of all these issues and challenges.
HARDWARE – CPU
Firstly, let’s have a look at the hardware starting with the CPU. We know from released documents that the Xbox Series X CPU has two clusters of 4 cores – 8 cores in total – with each cluster utilising a separate block of L3 cache.
In the case of the PS5, there are strong indications that the L3 cache is unified on the CPU. Why is this important? Well, this is one of the main boosts to achieving a higher IPC (Instructions Per Clock) when moving from AMD’s Zen2 to Zen3 architecture due to a lower level of latency when cores need to share data.
Additionally, with a unified cache, you do not need to duplicate data between the cache blocks if two cores need to work on the same data, therefore achieving better cache utilisation. While the PS5 can utilise all of its cache for unique data, the Xbox Series X has less overall cache as there will be some duplication.
This can lead to the CPU stalling less and therefore achieving a higher execution efficiency at the same clock speeds – that is -a higher IPC.
So in practical terms, what advantage could this give to the PS5?
- The CPU would be able to feed the GPU faster when it comes to traditional rasterisation – this might be less important when mesh shading (whether through RDNA2 mesh shaders or PS5’s Geometry Engine) become more commonplace but until then it may result in higher frame rates
- The CPU would be able to feed the GPU more consistently as it is less likely to stall or it would have more consistent spikes in wait times. With a split cache, some waits will be longer than others, therefore resulting in less consistency in maintaining the frame rate or simulation rate.
- The same would apply to other types of execution such as world simulation, AI, etc. They would be able to be done with more consistency and less delay, therefore again resulting in more consistent or higher frame rates, especially if the world simulation is closely tied to frame rate – some engines will be more sensitive to this than others.
Can Xbox do anything to combat this?
Well, developers are able to organise the work on the different CPU cores – and threads – in a way that there is as little data sharing required between the CPU clusters as possible. However, there is a bit of an issue with expecting developers to do this: it requires a lot more foresight and more optimisation. When the dominant platform – which Playstation is due to its install size – is not requiring developers to do this, multiplatform games can suffer in performance when porting code. In fact, these performance issues can also happen when porting PC code.
To resolve this fully, Microsoft may need to either:
- Allow developers to see where this data sharing occurs and allow them to very easily move execution of code between CPU clusters – even dynamically using certain conditions. If they make this tedious, it will require more work from developer or will be less successful.
- Get the GDK to do this automatically when it recognises that major and continuous data sharing is occurring between threads from the two clusters – dynamically shifting the load between CPU clusters to prevent a stall. This would be a difficult thing to do but it would allow code to be shifted between the two platforms much easier.
I think Microsoft will probably optimise this and that they are able to enable multi-threading on the CPU is going to aid them in closing the gap. But this certainly will need work.
HARDWARE – GPU
So how come a 10TF GPU seems to be performing like a 12TF GPU? Well, just like Microsoft when engineering the Xbox One X, Sony has put a LOT of focus on reducing latency of data access within the GPU and went beyond cache management of RDNA2 by introducing the cache coherency engines (cache scrubbers) – and possibly other tweaks we will never know.
Whether it is a 10TF or a 12TF GPU it will not achieve its full TF compute performance if there are stalls in the CUs (Compute Units) and it is waiting for information to be fetched from main memory or the SSD. Make no mistake, there are stalls even in the most well-engineered piece of silicon. The question is this:
- How can you ensure you have as few stalls as possible
- How can you ensure that when stalls happen, they will be as short as possible
These two things contribute greatly to the IPC performance of the CUs and literally all strategies to improve IPC chip designers employ revolve around this: cache sizes, cache coherency, execution pipeline prediction and so on.
I do wonder if Microsoft forgot this core lesson from their Xbox One X engineering team and simply trusted AMD to design something amazing here. I also do wonder if Sony was paying attention and outsmarted Microsoft engineers at their own game when it came to cache coherency by implementing cache scrubbers that allow better cache utilisation. It was a gamble which seems to be paying off majorly.
As it stands, it very much looks like Sony’s GPU is stalling less and performs much better than its 10TFs of performance would imply. It also looks like the Xbox Series X performs like 2x Xbox One Xs which of course checks out – going from 6TFs to 12TFs. So is it possible that the Xbox One X learnings with regards to registers and caches WERE implemented, but not appropriately scaled or those ideas not taken further?
I think there’s no better reminder than the Xbox One’s eDRAM that certain strategies provide massive competitive advantage in one generation while become a liability in the next. After all, the Xbox 360’s high speed on GPU eDRAM provided great competitive advantage when paired with a unified high-speed memory interface, while Xbox One’s eDRAM did exactly the opposite when paired with main RAM speeds that were not state of the art compared to the competition. We can conclude then that context – both within the HW itself and of the competition – is critical.
Cache Coherency – What can Xbox do to resolve this?
Well, luckily, cache acts differently in a GPU as it does in the CPU in the sense that both developers and Microsoft have more control over how and what it gets filled with. A CPU doesn’t really allow such control – apart from changing the microcode I guess – if that?
Again there are two ways:
- Allow developers lots of control of caches through the GDK. However, again this results in having to spend lots of time on optimisation and trying to get the best performance out of the GPU, not to mention it could pose technical complexity with forward compatibility. Remember that Microsoft tried to do this with the eDRAM in the Xbox One as a way to fix the horrible performance of the console and achieve resolution parity with the PS4 – among other things. It did allow developers – who were willing to spend the time required optimising – better performance and resolution parity with the PS4. Those developers who couldn’t be bothered or didn’t get the hang of the eDRAM were less likely to achieve that.
- Do the hard work the team did during Xbox One X and dust off those computer models. See how different types of code execution requires different caching strategies (cache policies) for best performance then implement those in the GDK. Allow the developers to switch the GPU into the different cache policies during code execution or detect the code change automatically and switch on the fly. Yes, this is hard work to develop, but may very well allow much easier porting of code and achieve excellent performance.
Clock Speed and CUs
The clock speed of the PS5 GPU is higher than that of Xbox Series X by roughly 20%, however, the Xbox Series X has 45% more Compute Units.
PS5: 36 CUs running at 2.23Ghz (max speed)
Xbox Series X: 52 CUs running at 1.852Ghz
This presents two challenges: all the other functions of the chip run at a higher frequency on the PS5 so some steps in the pipeline will be faster due to the higher frequency. However, a stall will affect the PS5 worse because main memory – or the SSD as it were – is more cycles away. This is why Sony has focused so much on cache coherency and ensuring the GPU doesn’t need to wait around.
The difference in CU numbers also has an effect: it is more difficult to keep 52 CUs occupied at all times than do the same with 36CUs. What is more, code that is heavily optimised for 36 CUs might well run pretty badly on a 52 CU system:
36 * 2 / 52 = 1.38
In very simplistic terms, just to demonstrate the issue: if you were to send batches of work to 36 CUs and then run that on 52 CUs, the 52 CU GPU – without any optimisation lower in the stack such as the drivers or the code itself – would run at roughly 70% of its full speed. It would run a full cycle with all CUs utilised, then run a cycle with only 40% of the CUs running.
Now of course in practice, this may not happen due to optimisation lower in the software stack or the code itself but the question is, was code ported from PS5 to Xbox Series X with such batching / CU utilisation issues? If so, that can seriously hurt performance.
Clock Speed and CUs – What can Xbox do to resolve this?
I am certain that Microsoft already catch some of the batching issues. However, it is very possible that those code paths are not fully performant.
Another option is that Microsoft leaves it to developers fully to resolve such batching / CU utilisation issues in which case the question is:
- Do the tools support the developer to easily but fully untangle such issues?
- Is it easy to resolve such issues in each case?
- Can Microsoft do more lower in the stack to optimise for this?
Such issues may well disappear as developers get the hang of developing for the Xbox Series X and code isn’t ported between the two platforms with such little time allowed for performance optimisation.
GDK – The Tools
Microsoft re-wrote the XDK (Xbox Development Kit) for the Xbox One and it took a few years to improve its tools and performance to a point where the Xbox One wasn’t in a major disadvantage to the PS4. Reading this article – or simply looking at the XDK change history – is quite enlightening I think.
Now Microsoft decided to unify PC and Xbox development under the same umbrella and called the SDK and related toolkit the GDK, the Game Development Kit.
However, to understand how performance is impacted on the Xbox platform, we need to look at the full software stack which is – in a simplified way – below:
- Firmware and microcode (we don’t strictly call this software but it is code that does get updated and impacts performance)
- Hypervisor (as both the game and the dashboard are virtualised)
- Operating System or OS (the dashboard which is constant and the game which loads its own instance of a cut-down OS) – we COULD argue that the game OS is part of the GDK.
- Graphics Drivers (this is bundled per game on console for stability but is system-wide for PC) – we COULD argue this is part of the GDK on console but not on PC
- Various APIs (GDK)
- Development tools (GDK)
Considering Microsoft did not have the chip taped out in its final form until sometime in 2020 and the full stack needed to be re-written for the new platform, we can very easily conclude that the stack is not yet fully mature and optimised.
- Firstly the original Xbox One needed years to fully mature with less new complex technologies within the chips.
- AMD has just released their RDNA2 based drivers for their own cards. Graphics drivers usually improve greatly in both stability and performance during a generation of cards – for both Nvidia and AMD.
- Microsoft didn’t only need to adapt AMD’s graphics drivers to the Xbox Platform, they needed to re-write / adapt the whole software stack above. Ouch!
So what kind of issues can this throw up?
- Inefficiencies and errors in the code causing stalls, performance issues, crashes, etc. We saw this early on and while I think major issues have been fixed, performance optimisation of a new software stack takes time: not a few months, but years, as there are so many new technologies to integrate and optimise.
- Developers have to move their code to a new SDK and re-tooling takes time and is error-prone. To add insult to injury, the GDK is not compatible with earlier version of Visual Studio, so if a developer has not yet made the move to the new tools, it adds extra complexity and possible compile issues. However, the performance discrepancy cannot be explained by this simple point. This is why I wanted to write this article as it is a gross over-simplification of the situation to put it mildly.
- A fixed console is NOT a PC. The main advantage of a console is that it is fixed hardware and it will behave exactly the same across all its instances and across multiple runs of code. It is extremely consistent. When a game is running on PC, it needs to expect multiple performance profiles, environments, etc. What this means is that developer tools for a fixed platform can implement very specific codepaths and optimisations in a way that maximise platform performance. Since Microsoft has only just completed the GDK in June 2020 in a stable form – I wonder if they have had time to implement any specific optimisation for Xbox or will they treat it like a PC.
The last point is a crucial one, as I think this could be why developers are saying that the new GDK is not as easy to develop with. Microsoft is a software company and has a tendency to enable flexibility with its tools. Unfortunately, flexibility often times means they leave it up to the developer to figure out the performance profile of the platform and let them optimise to their hearts’ contents.
However, GENERALISING the GDK is a dangerous path to take when Sony has doubled down on SPECIALISING their SDK for the PS5’s performance profile. It can mean that getting max performance from the PS5 requires less work, rather than more, partly because of the hardware cache coherency and partly because of SDK specialisation.
I seriously think – as I said earlier – that Microsoft needs to get back to computer modelling and start developing specialisation of the Xbox Series X|S performance profiles such as semi-automatic thread management and updated caching policies on both the CPU and GPU to improve their IPC performance.
I fear that focusing on cross-platform development may have hurt the Xbox in the short term, although I have no doubt it will become an advantage in the long-term, should Microsoft develop the Xbox Series X|S code paths for maximum performance and do that with as little intervention and laborious optimisation necessary on the developer’s behalf as possible.
GPU – RDNA2 Advantage?
Now let’s look at the GPU features of each console one by one and see what advantage next-generation games could see in each.
Mesh Shaders versus Geometry Engine
We don’t have a lot of information about relative performance between the two implementations. However, we can speculate.
At worst the Geometry Engine in PS5 is simply another name for mesh shading and it is simply a rebranding. At best, Sony tweaked the mesh shaders in RDNA2 just like they did caching and they will achieve the same or higher throughput than Xbox Series X.
We can be sure of one thing however, the cache coherency they implemented will also aid mesh shading / the Geometry Engine so we will likely not see a massive performance differential for this feature.
However, if Sony tweaked the Geometry Engine ON TOP OF improving caching performance, then Microsoft will have to work extra hard to maintain the performance lead when mesh shading is fully utilised.
I think you can all see where this is going…
Variable Rate Shading (VRS)
As far as we know, PS5 does not have VRS level 2 feature. However, VRS can be implemented without dedicated hardware that aids it. I would refer you to this interview with Ori developer, Moon Studios, where they describe how they implemented VRS on all platforms without using the specific VRS API calls. Granted, their use case was relatively easy due to how they sliced the image up, but similar techniques can be implemented in software in other game engines.
VRS Level 2 on Xbox Series X can improve performance between 5% and 30% dependent on the use case. Some scenes in a game will benefit more than others. However, in MOST instances it will result in SOME image quality degradation. The aim is to restrict this degradation to parts of the image where they are not visible – dark, fast-moving, blurry parts. It is like using lossy compression on an image however.
Now, for example on Dirt 5, the developer has applied it to parts of the image that are darker AND that are moving so fast that detail is difficult to see. However, even with VRS enabled, and on title update 2.0, the PS5 is neck and neck in performance and has slightly better image clarity (while the Xbox is a few frames ahead hence I said it’s pretty much a wash).
So it does look like VRS was used on this title to offset the performance disadvantage of the game code running on the Xbox Series X, because of the possible compounding issues detailed previously.
If we assume for a second that Xbox could get the CPU and GPU closer to its performance targets without serious optimisation work on behalf of the developer, and that the developer is not going to bother implementing VRS in software on PS5, then yes, VRS has the possibility to provide a 5-30% performance boost over PS5. At the moment, it seems like it is used to offset the performance disparity – at least in Dirt 5.
Sample Feedback Streaming (SFS)
Well, Sample Feedback Streaming is an interesting one because we don’t know if Sony has an equivalent OR if they are simply relying on brute force between the SSD and memory to try and stream textures in on the fly.
A few things are for certain:
- SFS “in the lab demos” can reduce memory usage by upto 10x (so think of it as a 10x memory multiplier compared to a solution cobbled together in software.) Sony most certainly does not have an SSD interface that is 10x as fast as Microsoft’s. Even if both were quoting sustained speed – which they are not – Sony’s interface is only 2x faster. However, in practice it looks to be only 1.5x the raw speed from load times on equivalent software. Now the question is, can a skilled developer reduce the memory footprint advantage to around 2x, as opposed to 10x? (see UE5 below for a possible answer)
- Keep in mind both companies implemented hardware compression and the algorithms – no matter what fancy name they give them – are pretty close. Even a 50% improvement on state of the art compression maths takes 5+ years to achieve and both use state of the art. Anyone saying one has a multiplier of 2x over the other with regards to data compression does not understand data compression. Keep in mind, the hardware may well allow the developers a lot of flexibility in updating the compression algorithms they use over the generation as learnings from previous generations show that it is advantageous to allow this flexibility.
- Not having to wait for textures to stream in allows the GPU to continue working and it can be the difference between a stall or no stall. However, all modern game engines account for this and have a lower res texture available ready in memory. The only difference here is that the transition is smoothed over using a new algorithm and the texture pop-in is minimised to a frame as opposed to multiple frames. Useful? Sure! Revolutionary? We shall see.
- UE5 was shown to be running on the PS5 and it was using ultra-high-resolution textures streaming in at 60fps without a hitch! So whether Sony has an SFS solution or not, it is obvious that Epic has figured out a way to do this consistently using the PS5 hardware. Now does this mean that UE5 cannot benefit from SFS even more and would it reduce memory footprint therefore allowing for higher detail / resolution of textures, etc? Well, it is all possible, but rather theoretical at this point and maybe even Microsoft doesn’t know the answer until they see the games running. After all, would we be hitting other limits within the GPU that would bottleneck this at reasonable frame rates?
So I think SFS is likely the odd one out. It is not a proven technology and we don’t know how performant the best software implementation is versus hardware assist. Also, it’s new technology so will take a while to figure out.
Additionally, it looks like Sony has implemented a completely new chip for the SSD controller, which might well house similar technology – or uses brute force to achieve the same.
You can read more about SFS here from Microsoft if you are interested.
Ray Tracing hardware seems to be matched between the two – as in one Ray Accelerator per CU. However, since the Xbox Series X has more CUs, it also has more Ray Accelerators, albeit running at a lower clock speed. From tests it would seem that – although cache coherency seems to have some effect on them – it is not as large as with the rest of the pipeline and will likely be more performant on Xbox Series X.
Additionally, AI/ML can be utilised for faster and better de-noising required for ray-tracing so if Microsoft is able to implement a technique that runs on INT8 or INT4, then they would have an additional speed advantage here. See more on this under AI/ML below.
GPU – MACHINE LEARNING & AI (AI/ML)
So now we come to the Xbox Series X’s trump card and in fact the only major customisation beyond RDNA2 Microsoft did to the GPU that we know of.
A customer GPU executes shader instructions in Single Precision Floating Point format called FP32. AI/ML operations however normally use FP16 (half precision) or more often integer operations (INT8 and INT4).
The PS5 is able to do FP32 and FP16, however Microsoft customised the Xbox Series X silicon to be able to perform INT8 and INT4 operations with the following speeds:
- FP32: 12TFLOPS
- FP16: 12 x 2 = 24 TFLOPS
- INT8: 12*4 = 48 TOPS
- INT4: 12*8= 96 TOPS
So the Xbox is able to perform AI/ML operations upto 8x as fast as FP32 and upto 4x as fast as FP16 (and therefore the PS5). The reason this is significant is that it gives way to technologies like Nvidia’s Deep Learning Super Sampling (a type of Super Resolution algorithm) which means the Xbox can render a scene at a lower resolution and then upscale it to a higher resolution (e.g. 1080p to 4K). This could be used for backwards compatibility but also for games to run with full raytracing at 60 FPS without cutting back on ray tracing quality.
This is also significant because it could allow Microsoft not to have to release a mid-generation refresh at the same time as Sony and still stay competitive. Even if Sony was to introduce this feature in their PS5 Pro for example, they would need at least a year to ship games utilising such a feature – by which time Microsoft can make its own move. If Microsoft invests money here, and times its cards right, this can be an ace up their sleeve.
Of course, the whole mid-generation refresh is very much up in the air for this generation of consoles, while also not nearly as necessary, but that is for another article.
AI/ML can be used to enable a whole host of other applications from better enemy AI, reduction of texture data both on disk and in memory, better de-noising algorithms for ray tracing to increase its quality and many other applications we haven’t even thought of. The sky is really the limit here. The question is how much Microsoft is willing to invest into Research & Development for AI/ML.
So is the Xbox Series X the World’s most powerful console? In theory yes, in practice not right now. Sony’s clever hardware design means that the PS5 is ever so slightly more performant for current titles with less developer optimisation necessary than Xbox Series X. However, no two consoles have ever been so close in performance terms.
While Xbox Series X has the potential to become more powerful due to its untapped potential – and less relative IPC utilisation – it is going to depend on how much investment Microsoft is willing to make into not only optimising the full software stack but also go beyond in enabling developers to easily and fully utilise the hardware under the hood.
As I said, context is important. In a World where the Xbox Series X is the more performant console, the GDK with its cross PC / Xbox development made a lot of sense. It made a lot of sense because the Xbox Series X had a lot of performance margin to play with: upto 20% more if we look at it theoretically.
However, in a World where the Xbox Series X is pretty much on par with the PS5 and in fact has work to do to achieve on par performance consistently, a GDK that is more generalised as opposed to specialised can become a liability if not handled correctly.
For the GDK not to become a liability, Microsoft might need to go beyond traditional optimisations, and also needs to ask the question whether contributing first-party optimisations to game engines like UE5 is worthwhile to allow third-party developers to reach their performance targets easier.
It also raises another question: now that Microsoft has so many first party studios, is it worth having to re-develop the same code and optimisations again and again by each studio for each game engine and title, or should Microsoft actually start their own unified game engine development, just like EA did with Frostbite. This would allow their first-party studios to focus on creating content as opposed to re-inventing the wheel again and again, especially as the wheel is becoming so crazily complex. After all, they now own some of the best game engines in the World and could easily compete with UE5 and Frostbite: IDTech, Forza Tech, Halo’s Slipstream Engine developed by some of the same engineers who wrote the DirectX graphics API (the Xbox name comes from this API).
One strategy could be to have the Coalition continue to work on Unreal Engine, but heavily contributing code back to the code base for third party developers to use, while the other studios start contributing code to a common game engine runtime. This could give Microsoft a leg up on Sony and could allow content to take centre-stage, as opposed to the (continuous re-invention of) technology.
Although, Sony’s hardware is not anywhere near fully tapped out, especially with regards to next-gen features, its IPC is definitely a lot closer to its ceiling due to a more performant software stack and very clever hardware optimisations. There is only so much Sony can do from here with regards to IPC but I very much doubt Sony is going to stop here either.
Whatever the case may be, this will be one hell of a generation to watch and both consoles will surely impress as the generation heats up.