Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do the MMX registers always exist in modern processors?

When I look at diagrams and overviews of recent processors[1], I never see mention of the MMX registers MM0 - MM7. But from the specs, it seems like they still exist. Can one depend on them being present in all processors that support SSE? Do they conflict with anything other than the even older FPU stack? Are they the same physical registers as the general 64-bit ones?

While XMM and YMM are much better for vectors, I occasionally want to use the MMX registers for stashing values that would otherwise spill to the stack. Speedwise this looks a little better, and also there are times when I want to avoid additional stores and loads.

[1] http://www.realworldtech.com/haswell-cpu/

like image 306
Nathan Kurz Avatar asked Jun 07 '13 09:06

Nathan Kurz


People also ask

How many registers do modern processors have?

The CPU has 8 general-purpose registers, each capable of storing 32-digit binary numbers.In addition to 32-bit data, they can also store 16- or 8-bit data.

Are MMX instructions still used?

Although MMX techology, introduced in 1997, is long superseded by SSE series of extensions, it's still included in every modern x86 CPU for backwards compatibility.

What are MMX registers?

Registers represent arrays of 8 bytes, 4 words, or 2 dwords. Used for high speed, low precision integer vector operations (such as for image and signal processing).

For what purpose is MMX technology needed?

MMX is a Pentium microprocessor from Intel that is designed to run faster when playing multimedia applications. According to Intel, a PC with an MMX microprocessor runs a multimedia application up to 60% faster than one with a microprocessor having the same clock speed but without MMX.


2 Answers

SSE1 implies MMX, so yes supporting x86-64 guarantees MMX (because SSE2 is baseline for x86-64).

They alias the 80-bit x87 regs, not the general-purpose integer registers! Long mode doesn't change anything about how MMX works.

All modern CPUs are 64-bit capable and thus have MMX available in all modes. Even 32-bit only embedded AMD Geode CPUs have MMX (but not SSE).


It's pretty rare that MMX is worth using when you have 16x XMM regs + 16x 64-bit GP regs. Store/reload is not terrible, especially if the reload can use a memory source operand.

Extra ALU uops to move data to/from MMX regs is usually not worth it vs. store/reload. Reload can often be micro-fused as a memory source operand, and the ALU execution port pressure can easily be a problem.

If you were doing something special with cache disabled then sure, but normally store-forwarding makes store/reload efficient if you can keep it off the critical path. (It does have ~5 cycle latency).

If you do want to move data between XMM and GP regs, though, typically movd / movq or pinsrd / pextrd are a good choice, not store/reload. I'm saying that a spill/reload of a GP or XMM reg in an outer loop is usually better than 2x movq or movq2dq xmm0, mm0.

In fact on Skylake, one movq2dq costs 2 uops. Same for movdq2q. (movq to/from GP regs is still only 1 uop, though, with the same port 0 or port 5 limitation as transfers between XMM and GP regs).


Plus, using MMX in a function costs you an emms instruction at the end of it (or before any function call if you want to be ABI compliant). The MMX regs are all call-clobbered in normal calling conventions (and in fact the FPU has to be in x87 state instead of MMX state).


MMX is definitely not as efficient as XMM on modern CPUs. Actually using it for anything other than storage is usually worse than SSE2 (with movq loads/stores and ignoring the high bytes of XMM regs, if you want to work in 64-bit chunks).

For example, on Intel/AMD CPUs with mov-elimination for movaps xmm,xmm, MMX register-copy with movq xmm1, xmm0 still costs an ALU uop and still has 1 cycle of latency. (Both still cost a uop for the front-end; mov-elimination only removes the latency and back-end cost other than the ROB entry.)

Also, Skylake has better throughput for the XMM version of some instructions than for the MMX version. e.g. paddb/w/d/q mm,mm runs on p05, but paddb/w/d/q xmm,xmm runs on p015. Many other operations, like pavg*, pmadd*, and shifts, can run on p01 for XMM regs, but only port 0 for MMX regs. (https://agner.org/optimize/)

So like x87 FPU, it's still supported for legacy code, but it has fewer execution units that support it. It's not terrible yet, so software like x264 and FFmpeg that still have significant amounts of MMX code for stuff that natural works in 64-bit chunks don't suffer too badly.

128-bit AVX versions of integer instructions would probably be the best bet in many cases to avoid register-copy mov instructions.

like image 68
Peter Cordes Avatar answered Oct 07 '22 00:10

Peter Cordes


The best "diagrams and overviews" to look at is always the manual, in this case you'll find lots of information on MMX technology and the proceeding SSE (streaming SIMD extensions) starting in Section 5.4 of the Intel Manual, that's pg. 122 in the 4-volume set's pdf. To get deeper into programming with MMX, you'll want to start in section 9.2 (p.228). Personally I really like Intel's "C++ Compiler for Linux* Intrinsics Reference," to learn more than you may ever need to know about MMX. Here's a copy: https://www.cs.fsu.edu/~engelen/courses/HPC-adv/intref_cls.pdf

Can one depend on them being present in all processors that support SSE?

Yes. SSE means MMX is present. As mentioned in the comments, you'll want to use the CPUID intrinsic to check:

CPUID.01H:EDX.MMX[bit 23] = 1

or just keep in mind MMX tech came out in 1997, I see the year this question was posted is 2013, edited in 2014 so...

Do they conflict with anything other than the even older FPU stack?

No, but that is strange isn't it? The MMX state is aliased to the x87 FPU state. The reasoning though is to avoid compatibility problems with the context switch mechanisms in existing operating systems. They are unique to the FPU registers in the sense that they are directly addressable so maybe that's why you are drawn to them. Plus they were designed to work on packed data types! However, this mapping makes it difficult to work on floating point and SIMD data in the same application.

Are they the same physical registers as the general 64-bit ones?

This question was a little confusing. When you say general 64-bit one's you mean the 16 General Purpose Registers in a x64 computer right? Or the eight 80-bit FPU Data Registers, which operate like a stack? Either way, the MMX registers are NOT separate from the x87 FPU data register stack. The Intel Manual seems to embrace how misleading these MMX registers are by saying:

Although MMX registers are defined in the IA-32 architecture as separate registers, they are aliased to the registers in the FPU data register stack (R0 through R7)
-Section 9.2.2, p.229

There's 8 MMX registers (64-bit). But as you can tell there's ALOT of registers for you to use! The confusing part is that instructions that save and restore the x87 state also handle the MMX state.

When an MMX instruction (other than the EMMS instruction) is executed, the processor changes the x87 FPU state as follows:

• The TOS (top of stack) value of the x87 FPU status word is set to 0.

• The entire x87 FPU tag word is set to the valid state (00B in all tag fields).

• When an MMX instruction writes to an MMX register, it writes ones (11B) to the exponent part of the corresponding floating-point register (bits 64 through 79).

-Section 9.6.2, p.235 Intel Manual.

Maybe it's worth noting, when anything is loaded into these x87 data registers, they automatically get converted to double extended precision floating point format (p.194 Intel Manual). Just know when transitioning into MMX mode, all unused fpu bits are set to invalid values so that can cause floating point instructions to behave strangely.

like image 28
Robert Houghton Avatar answered Oct 07 '22 02:10

Robert Houghton