x86 Instruction Set

Post by **Neo** » Thu Aug 20, 2009 9:57 pm

Original 8086/8088 instructions
Instruction	Meaning	Notes
AAA	ASCII adjust AL after addition	used with unpacked binary coded decimal
AAD	ASCII adjust AX before division	buggy in the original instruction set, but "fixed" in the NROBOT.LK V20, causing a number of incompatibilities
AAM	ASCII adjust AX after multiplication
AAS	ASCII adjust AL after subtraction
ADC	Add with carry
ADD	Add
AND	Logical AND
CALL	Call procedure
CBW	Convert byte to word
CLC	Clear carry flag
CLD	Clear direction flag
CLI	Clear interrupt flag
CMC	Complement carry flag
CMP	Compare operands
CMPSB	Compare bytes in memory
CMPSW	Compare words
CWD	Convert word to doubleword
DAA	Decimal adjust AL after addition	(used with packed binary coded decimal)
DAS	Decimal adjust AL after subtraction
DEC	Decrement by 1
DIV	Unsigned divide
ESC	Used with floating-point unit
HLT	Enter halt state
IDIV	Signed divide
IMUL	Signed multiply
IN	Input from port
INC	Increment by 1
INT	Call to interrupt
INTO	Call to interrupt if overflow
IRET	Return from interrupt
Jxx	Jump if condition	(JA, JAE, JB, JBE, JC, JCXZ, JE, JG, JGE, JL, JLE, JNA, JNAE, JNB, JNBE, JNC, JNE, JNG, JNGE, JNL, JNLE, JNO, JNP, JNS, JNZ, JO, JP, JPE, JPO, JS, JZ)
JMP	Jump
LAHF	Load flags into AH register
LDS	Load pointer using DS
LEA	Load Effective Address
LES	Load ES with pointer
LOCK	Assert BUS LOCK# signal	(for multiprocessing)
LODSB	Load byte
LODSW	Load word
LOOP/LOOPx	Loop control	(LOOPE, LOOPNE, LOOPNZ, LOOPZ)
MOV	Move
MOVSB	Move byte from string to string
MOVSW	Move word from string to string
MUL	Unsigned multiply
NEG	Two's complement negation
NOP	No operation	opcode (0x90) equivalent to XCHG EAX, EAX
NOT	Negate the operand, logical NOT
OR	Logical OR
OUT	Output to port
POP	Pop data from stack	(Only works with register CS on 8086/8088)
POPF	Pop data into flags register
PUSH	Push data onto stack
PUSHF	Push flags onto stack
RCL	Rotate left (with carry)
RCR	Rotate right (with carry)
REPxx	Repeat CMPS/MOVS/SCAS/STOS	(REP, REPE, REPNE, REPNZ, REPZ)
RET	Return from procedure
RETN	Return from near procedure
RETF	Return from far procedure
ROL	Rotate left
ROR	Rotate right
SAHF	Store AH into flags
SAL	Shift Arithmetically left (signed shift left)
SAR	Shift Arithmetically right (signed shift right)
SBB	Subtraction with borrow
SCASB	Compare byte string
SCASW	Compare word string
SHL	Shift left (unsigned shift left)
SHR	Shift right (unsigned shift right)
STC	Set carry flag
STD	Set direction flag
STI	Set interrupt flag
STOSB	Store byte in string
STOSW	Store word in string
SUB	Subtraction
TEST	Logical compare (AND)
WAIT	Wait until not busy	Waits until BUSY# pin is inactive (used with floating-point unit)
XCHG	Exchange data
XLAT	Table look-up translation
XOR	Exclusive OR

Added with 80186/80188
Instruction	Meaning
BOUND	Check array index against bounds
ENTER	Enter stack frame
INS	Input from port to string
LEAVE	Leave stack frame
OUTS	Output string to port
POPA	Pop all general purpose registers from stack
PUSHA	Push all general purpose registers onto stack

Added with 80286
Instruction	Meaning
ARPL	Adjust RPL field of selector
CLTS	Clear task-switched flag in register CR0
LAR	Load access rights byte
LGDT	Load global descriptor table
LIDT	Load interrupt descriptor table
LLDT	Load local descriptor table
LMSW	Load machine status word
LOADALL	Load all CPU registers, including internal ones such as GDT
LSL	Load segment limit
LTR	Load task register
SGDT	Store global descriptor table
SIDT	Store interrupt descriptor table
SLDT	Store local descriptor table
SMSW	Store machine status word
STR	Store task register
VERR	Verify a segment for reading
VERW	Verify a segment for writing

Added with 80386
Instruction	Meaning	Notes
BSF	Bit scan forward
BSR	Bit scan reverse
BT	Bit test
BTC	Bit test and complement
BTR	Bit test and reset
BTS	Bit test and set
CDQ	Convert double-word to quad-word	Sign-extends EAX into EDX, forming the quad-word EDX:EAX. Since (I)DIV uses EDX:EAX as its input, CDQ must be called after setting EAX if EDX is not manually initialized (as in 64/32 division) before (I)DIV
CMPSD	Compare string double-word	Compares ES:[(E)DI] with DS:[SI]
CWDE	Convert word to double-word	Unlike CWD, CWDE sign-extends AX to EAX instead of AX to DX:AX
INSB, INSW, INSD	Input from port to string with explicit size	same as INS
IRETx	Interrupt return; D suffix means 32-bit return, F suffix means do not generate epilogue code (i.e. LEAVE instruction)	Use IRETD rather than IRET in 32-bit situations
JCXZ, JECXZ	Jump if register (E)CX is zero
LFS, LGS	Load far pointer
LSS	Load stack segment
LODSW, LODSD	Load string	can be prefixed with REP
LOOPW, LOOPD	Loop	Loop; counter register is (E)CX
LOOPEW, LOOPED	Loop while equal
LOOPZW, LOOPZD	Loop while zero
LOOPNEW, LOOPNED	Loop while not equal
LOOPNZW, LOOPNZD	Loop while not zero
MOVSW, MOVSD	Move data from string to string
MOVSX	Move with sign-extend
MOVZX	Move with zero-extend
POPAD	Pop all double-word (32-bit) registers from stack	Does not pop register ESP off of stack
POPFD	Pop data into EFLAGS register
PUSHAD	Push all double-word (32-bit) registers onto stack
PUSHFD	Push EFLAGS register onto stack
SCASD	Scan string data double-word
SETA, SETAE, SETB, SETBE, SETC, SETE, SETG, SETGE, SETL, SETLE, SETNA, SETNAE, SETNB, SETNBE, SETNC, SETNE, SETNG, SETNGE, SETNL, SETNLE, SETNO, SETNP, SETNS, SETNZ, SETO, SETP, SETPE, SETPO, SETS, SETZ	Set byte to one on condition
SHLD	Shift left double-word
SHRD	Shift right double-word
STOSx	Store string

Added with 80486
Instruction	Meaning	Notes
BSWAP	Byte Swap	Only works for 32 bit registers
CMPXCHG	CoMPare and eXCHanGe
INVD	Invalidate Internal Caches
INVLPG	Invalidate TLB Entry
WBINVD	Write Back and Invalidate Cache
XADD	Exchange and Add

Added with Pentium
Instruction	Meaning	Notes
CPUID	CPU IDentification	*See note below
CMPXCHG8B	CoMPare and eXCHanGe 8 bytes
RDMSR	ReaD from Model-Specific Register
RDTSC	ReaD Time Stamp Counter
WRMSR	WRite to Model-Specific Register
RSM	Resume operation of interrupted program	SMM [System Management Mode]

*The CPUID instruction was fully introduced with the Pentium processor. It was also added to later 80486 processors.

Added with Pentium MMX
Instruction	Meaning	Notes
RDPMC	Read the PMC [Performance Monitoring Counter]	Specified in the ECX register into registers EDX:EAX

Added with Pentium Pro
Conditional MOV: CMOVA, CMOVAE, CMOVB, CMOVBE, CMOVC, CMOVE, CMOVG, CMOVGE, CMOVL, CMOVLE, CMOVNA, CMOVNAE, CMOVNB, CMOVNBE, CMOVNC, CMOVNE, CMOVNG, CMOVNGE, CMOVNL, CMOVNLE, CMOVNO, CMOVNP, CMOVNS, CMOVNZ, CMOVO, CMOVP, CMOVPE, CMOVPO, CMOVS, CMOVZ, SYSENTER (SYStem call ENTER), SYSEXIT (SYStem call EXIT), RDPMC*, UD2
- RDPMC was introduced in the Pentium Pro processor and the Pentium processor with MMX technology

Added with AMD K6-2
SYSCALL, SYSRET (functionally equivalent to SYSENTER and SYSEXIT)

Added with SSE
MASKMOVQ, MOVNTPS, MOVNTQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCE (for Cacheability and Memory Ordering)

Added with SSE2
CLFLUSH, LFENCE, MASKMOVDQU, MFENCE, MOVNTDQ, MOVNTI, MOVNTPD, PAUSE (for Cacheability)

Added with SSE3
LDDQU (for Video Encoding)
MONITOR, MWAIT (for thread synchronization; only on processors supporting Hyper-threading and some dual-core processors like Core 2, Phenom and others)

Added with Intel VT
VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE, VMCALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON

Added with AMD-V
CLGI, SKINIT, STGI, VMLOAD, VMMCALL, VMRUN, VMSAVE (SVM instructions of AMD-V)

Added with x86-64
CMPXCHG16B (CoMPaRe and eXCHanGe 16 bytes), RDTSCP (ReaD Time Stamp Counter and Processor ID)

Added with SSE4a
LZCNT, POPCNT (POPulation CouNT) - advanced bit manipulation

x87 floating-point instructions

Original 8087 instructions

F2XM1, FABS, FADD, FADDP, FBLD, FBSTP, FCHS, FCLEX, FCOM, FCOMP, FCOMPP, FDECSTP, FDISI, FDIV, FDIVP, FDIVR, FDIVRP, FENI, FFREE, FIADD, FICOM, FICOMP, FIDIV, FIDIVR, FILD, FIMUL, FINCSTP, FINIT, FIST, FISTP, FISUB, FISUBR, FLD, FLD1, FLDCW, FLDENV, FLDENVW, FLDL2E, FLDL2T, FLDLG2, FLDLN2, FLDPI, FLDZ, FMUL, FMULP, FNCLEX, FNDISI, FNENI, FNINIT, FNOP, FNSAVE, FNSAVEW, FNSTCW, FNSTENV, FNSTENVW, FNSTSW, FPATAN, FPREM, FPTAN, FRNDINT, FRSTOR, FRSTORW, FSAVE, FSAVEW, FSCALE, FSQRT, FST, FSTCW, FSTENV, FSTENVW, FSTP, FSTSW, FSUB, FSUBP, FSUBR, FSUBRP, FTST, FWAIT, FXAM, FXCH, FXTRACT, FYL2X, FYL2XP1

Added in specific processors

Added with 80287
FSETPM

Added with 80387
FCOS, FLDENVD, FNSAVED, FNSTENVD, FPREM1, FRSTORD, FSAVED, FSIN, FSINCOS, FSTENVD, FUCOM, FUCOMP, FUCOMPP

Added with Pentium Pro
FCMOV variants: FCMOVB, FCMOVBE, FCMOVE, FCMOVNB, FCMOVNBE, FCMOVNE, FCMOVNU, FCMOVU
FCOMI variants: FCOMI, FCOMIP, FUCOMI, FUCOMIP

Added with SSE
FXRSTOR*, FXSAVE*
- Also supported on later Pentium IIs, though they do not contain SSE support

Added with SSE3
FISTTP (x87 to integer conversion)

Undocumented instructions
FFREEP performs FFREE ST(i) and pop stack

SIMD instructions

MMX instructions (added with Pentium MMX)
EMMS, MOVD, MOVQ, PACKSSDW, PACKSSWB, PACKUSWB, PADDB, PADDD, PADDSB, PADDSW, PADDUSB, PADDUSW, PADDW, PAND, PANDN, PCMPEQB, PCMPEQD, PCMPEQW, PCMPGTB, PCMPGTD, PCMPGTW, PMADDWD, PMULHW, PMULLW, POR, PSLLD, PSLLQ, PSLLW, PSRAD, PSRAW, PSRLD, PSRLQ, PSRLW, PSUBB, PSUBD, PSUBSB, PSUBSW, PSUBUSB, PSUBUSW, PSUBW, PUNPCKHBW, PUNPCKHDQ, PUNPCKHWD, PUNPCKLBW, PUNPCKLDQ, PUNPCKLWD, PXOR

MMX+ instructions

added with Athlon

Same as the SSE SIMD Integer Instructions which operated on MMX registers.

EMMX instructions

added with 6x86MX from Cyrix, deprecated now

PAVEB, PADDSIW, PMAGW, PDISTIB, PSUBSIW, PMVZB, PMULHRW, PMVNZB, PMVLZB, PMVGEZB, PMULHRIW, PMACHRIW

3DNow! instructions

added with K6-2

FEMMS, PAVGUSB, PF2ID, PFACC, PFADD, PFCMPEQ, PFCMPGE, PFCMPGT, PFMAX, PFMIN, PFMUL, PFRCP, PFRCPIT1, PFRCPIT2, PFRSQIT1, PFRSQRT, PFSUB, PFSUBR, PI2FD, PMULHRW, PREFETCH, PREFETCHW

3DNow!+ instructions

added with Athlon

PF2IW, PFNACC, PFPNACC, PI2FW, PSWAPD

added with Geode GX

PFRSQRTV, PFRCPV

SSE instructions

added with Pentium III also see integer instruction added with Pentium III

SSE SIMD Floating-Point Instructions

ADDPS, ADDSS, CMPPS, CMPSS, COMISS, CVTPI2PS, CVTPS2PI, CVTSI2SS, CVTSS2SI, CVTTPS2PI, CVTTSS2SI, DIVPS, DIVSS, LDMXCSR, MAXPS, MAXSS, MINPS, MINSS, MOVAPS, MOVHLPS, MOVHPS, MOVLHPS, MOVLPS, MOVMSKPS, MOVNTPS, MOVSS, MOVUPS, MULPS, MULSS, RCPPS, RCPSS, RSQRTPS, RSQRTSS, SHUFPS, SQRTPS, SQRTSS, STMXCSR, SUBPS, SUBSS, UCOMISS, UNPCKHPS, UNPCKLPS

SSE SIMD Integer Instructions

ANDNPS, ANDPS, ORPS, PAVGB, PAVGW, PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, XORPS

Instruction	Opcode	Meaning
MOVUPS xmm1, xmm2/m128	0F 10 /r	Move Unaligned Packed Single-Precision Floating-Point Values
MOVSS xmm1, xmm2/m32	F3 0F 10 /r	Move Scalar Single-Precision Floating-Point Values
MOVUPS xmm2/m128, xmm1	0F 11 /r	Move Unaligned Packed Single-Precision Floating-Point Values
MOVSS xmm2/m32, xmm1	F3 0F 11 /r	Move Scalar Single-Precision Floating-Point Values
MOVLPS xmm, m64	0F 12 /r	Move Low Packed Single-Precision Floating-Point Values
MOVHLPS xmm1, xmm2	0F 12 /r	Move Packed Single-Precision Floating-Point Values High to Low
MOVLPS m64, xmm	0F 13 /r	Move Low Packed Single-Precision Floating-Point Values
UNPCKLPS xmm1, xmm2/m128	0F 14 /r	Unpack and Interleave Low Packed Single-Precision Floating-Point Values
UNPCKHPS xmm1, xmm2/m128	0F 15 /r	Unpack and Interleave High Packed Single-Precision Floating-Point Values
MOVHPS xmm, m64	0F 16 /r	Move High Packed Single-Precision Floating-Point Values
MOVLHPS xmm1, xmm2	0F 16 /r	Move Packed Single-Precision Floating-Point Values Low to High
MOVHPS m64, xmm	0F 17 /r	Move High Packed Single-Precision Floating-Point Values
PREFETCHNTA	0F 18 /0	Prefetch Data Into Caches (non-temporal data with respect to all cache levels)
PREFETCH0	0F 18 /1	Prefetch Data Into Caches (temporal data)
PREFETCH1	0F 18 /2	Prefetch Data Into Caches (temporal data with respect to first level cache)
PREFETCH2	0F 18 /3	Prefetch Data Into Caches (temporal data with respect to second level cache)
NOP	0F 1F /0	No Operation
MOVAPS xmm1, xmm2/m128	0F 28 /r	Move Aligned Packed Single-Precision Floating-Point Values
MOVAPS xmm2/m128, xmm1	0F 29 /r	Move Aligned Packed Single-Precision Floating-Point Values
CVTPI2PS xmm, mm/m64	0F 2A /r	Convert Packed Dword Integers to Packed Single-Precision FP Values
CVTSI2SS xmm, r/m32	F3 0F 2A /r	Convert Dword Integer to Scalar Single-Precision FP Value
MOVNTPS m128, xmm	0F 2B /r	Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint
CVTTPS2PI mm, xmm/m64	0F 2C /r	Convert with Truncation Packed Single-Precision FP Values to Packed Dword Integers
CVTTSS2SI r32, xmm/m32	F3 0F 2C /r	Convert with Truncation Scalar Single-Precision FP Value to Dword Integer
CVTPS2PI mm, xmm/m64	0F 2D /r	Convert Packed Single-Precision FP Values to Packed Dword Integers
CVTSS2SI r32, xmm/m32	F3 0F 2D /r	Convert Scalar Single-Precision FP Value to Dword Integer
UCOMISS xmm1, xmm2/m32	0F 2E /r	Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS
COMISS xmm1, xmm2/m32	0F 2F /r	Compare Scalar Ordered Single-Precision Floating-Point Values and Set EFLAGS
SQRTPS xmm1, xmm2/m128	0F 51 /r	Compute Square Roots of Packed Single-Precision Floating-Point Values
SQRTSS xmm1, xmm2/m32	F3 0F 51 /r	Compute Square Root of Scalar Single-Precision Floating-Point Value
RSQRTPS xmm1, xmm2/m128	0F 52 /r	Compute Reciprocal of Square Root of Packed Single-Precision Floating-Point Value
RSQRTSS xmm1, xmm2/m32	F3 0F 52 /r	Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value
RCPPS xmm1, xmm2/m128	0F 53 /r	Compute Reciprocal of Packed Single-Precision Floating-Point Values
RCPSS xmm1, xmm2/m32	F3 0F 53 /r	Compute Reciprocal of Scalar Single-Precision Floating-Point Values
ANDPS xmm1, xmm2/m128	0F 54 /r	Bitwise Logical AND of Packed Single-Precision Floating-Point Values
ANDNPS xmm1, xmm2/m128	0F 55 /r	Bitwise Logical AND NOT of Packed Single-Precision Floating-Point Values
ORPS xmm1, xmm2/m128	0F 56 /r	Bitwise Logical OR of Single-Precision Floating-Point Values
XORPS xmm1, xmm2/m128	0F 57 /r	Bitwise Logical XOR for Single-Precision Floating-Point Values
ADDPS xmm1, xmm2/m128	0F 58 /r	Add Packed Single-Precision Floating-Point Values
ADDSS xmm1, xmm2/m32	F3 0F 58 /r	Add Scalar Single-Precision Floating-Point Values
MULPS xmm1, xmm2/m128	0F 59 /r	Multiply Packed Single-Precision Floating-Point Values
MULSS xmm1, xmm2/m32	F3 0F 59 /r	Multiply Scalar Single-Precision Floating-Point Values
SUBPS xmm1, xmm2/m128	0F 5C /r	Subtract Packed Single-Precision Floating-Point Values
SUBSS xmm1, xmm2/m32	F3 0F 5C /r	Subtract Scalar Single-Precision Floating-Point Values
MINPS xmm1, xmm2/m128	0F 5D /r	Return Minimum Packed Single-Precision Floating-Point Values
MINSS xmm1, xmm2/m32	F3 0F 5D /r	Return Minimum Scalar Single-Precision Floating-Point Values
DIVPS xmm1, xmm2/m128	0F 5E /r	Divide Packed Single-Precision Floating-Point Values
DIVSS xmm1, xmm2/m32	F3 0F 5E /r	Divide Scalar Single-Precision Floating-Point Values
MAXPS xmm1, xmm2/m128	0F 5F /r	Return Maximum Packed Single-Precision Floating-Point Values
MAXSS xmm1, xmm2/m32	F3 0F 5F /r	Return Maximum Scalar Single-Precision Floating-Point Values
PSHUFW mm1, mm2/m64, imm8	0F 70 /r ib	Shuffle Packed Words
LDMXCSR m32	0F AE /2	Load MXCSR Register State
STMXCSR m32	0F AE /3	Store MXCSR Register State
SFENCE	0F AE /7	Store Fence
CMPPS xmm1, xmm2/m128, imm8	0F C2 /r ib	Compare Packed Single-Precision Floating-Point Values
CMPSS xmm1, xmm2/m32, imm8	F3 0F C2 /r ib	Compare Scalar Single-Precision Floating-Point Values
PINSRW mm, r32/m16, imm8	0F C4 /r	Insert Word
PEXTRW r32, mm, imm8	0F C5 /r	Extract Word
SHUFPS xmm1, xmm2/m128, imm8	0F C6 /r ib	Shuffle Packed Single-Precision Floating-Point Values
PMOVMSKB r32, mm	0F D7 /r	Move Byte Mask
PMINUB mm1, mm2/m64	0F DA /r	Minimum of Packed Unsigned Byte Integers
PMAXUB mm1, mm2/m64	0F DE /r	Maximum of Packed Unsigned Byte Integers
PAVGB mm1, mm2/m64	0F E0 /r	Average Packed Integers
PAVGW mm1, mm2/m64	0F E3 /r	Average Packed Integers
PMULHUW mm1, mm2/m64	0F E4 /r	Multiply Packed Unsigned Integers and Store High Result
MOVNTQ m64, mm	0F E7 /r	Store of Quadword Using Non-Temporal Hint
PMINSW mm1, mm2/m64	0F EA /r	Minimum of Packed Signed Word Integers
PMAXSW mm1, mm2/m64	0F EE /r	Maximum of Packed Signed Word Integers
PSADBW mm1, mm2/m64	0F F6 /r	Compute Sum of Absolute Differences
MASKMOVQ mm1, mm2	0F F7 /r	Store Selected Bytes of Quadword

SSE2 instructions

added with Pentium 4 also see integer instructions added with Pentium 4

SSE2 SIMD Floating-Point Instructions
ADDPD, ADDSD, ANDNPD, ANDPD, CMPPD, CMPSD*, COMISD, CVTDQ2PD, CVTDQ2PS, CVTPD2DQ, CVTPD2PI, CVTPD2PS, CVTPI2PD, CVTPS2DQ, CVTPS2PD, CVTSD2SI, CVTSD2SS, CVTSI2SD, CVTSS2SD, CVTTPD2DQ, CVTTPD2PI, CVTPS2DQ, CVTTSD2SI, DIVPD, DIVSD, MAXPD, MAXSD, MINPD, MINSD, MOVAPD, MOVHPD, MOVLPD, MOVMSKPD, MOVSD*, MOVUPD, MULPD, MULSD, ORPD, SHUFPD, SQRTPD, SQRTSD, SUBPD, SUBSD, UCOMISD, UNPCKHPD, UNPCKLPD, XORPD
CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS), however, the former refer to scalar double-precision floating-points whereas the latters refer to doubleword strings.

SSE2 SIMD Integer Instructions
MOVDQ2Q, MOVDQA, MOVDQU, MOVQ2DQ, PADDQ, PSUBQ, PMULUDQ, PSHUFHW, PSHUFLW, PSHUFD, PSLLDQ, PSRLDQ, PUNPCKHQDQ, PUNPCKLQDQ

SSE3 instructions
added with Pentium 4 supporting SSE3 also see integer and floating-point instructions added with Pentium 4 SSE3

SSE3 SIMD Floating-Point Instructions
ADDSUBPD, ADDSUBPS (for Complex Arithmetic)
HADDPD, HADDPS, HSUBPD, HSUBPS (for Graphics)
MOVDDUP, MOVSHDUP, MOVSLDUP (for Complex Arithmetic)

SSSE3 instructions

added with Xeon 5100 series and initial Core 2
PSIGNW, PSIGND, PSIGNB
PSHUFB
PMULHRSW, PMADDUBSW
PHSUBW, PHSUBSW, PHSUBD
PHADDW, PHADDSW, PHADDD
PALIGNR
PABSW, PABSD, PABSB

SSE4 instructions

SSE4.1
Instruction	Description
MPSADBW	Compute eight offset sums of absolute differences (i.e. \|x0-y0\|+\|x1-y1\|+\|x2-y2\|+\|x3-y3\|, \|x0-y1\|+\|x1-y2\|+\|x2-y3\|+\|x3-y4\|, ...); this operation is extremely important for modern HDTV codecs, and (see [4]) allows an 8x8 block difference to be computed in fewer than seven cycles. One bit of a three-bit immediate operand indicates whether y0 .. y10 or y4 .. y14 should be used from the destination operand, the other two whether x0..x3, x4..x7, x8..x11 or x12..x15 should be used from the source.
PHMINPOSUW	Sets the bottom unsigned 16-bit word of the destination to the smallest unsigned 16-bit word in the source, and the next-from-bottom to the index of that word in the source.
PMULDQ	Packed signed multiplication on two sets of 2 out of 4 packed integers, the 1st and 3rd per packed 4, giving 2 packed 64-bit results.
PMULLD	Packed signed multiplication, 4 packed sets of 32-bit integers multiplied to give 4 packed 32-bit results.
DPPS, DPPD	Dot product for AOS (Array of Structs) data. This takes an immediate operand consisting of four (or two for DPPD) bits to select which of the entries in the input to multiply and accumulate, and another four (or two for DPPD) to select whether to put 0 or the dot-product in the appropriate field of the output.
BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDW	Conditional copying of elements in one location with another, based (for non-V form) on the bits in an immediate operand, and (for V form) on the bits in register XMM0.
PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINSD, PMAXSD	Packed minimum/maximum for different integer operand types
ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD	Round values in a floating-point register to integers, using one of four rounding modes specified by an immediate operand
INSERTPS, PINSRB, PINSRD/PINSRQ, EXTRACTPS, PEXTRB, PEXTRW, PEXTRD/PEXTRQ	The INSERTPS and PINSR instructions read 8, 16 or 32 bits from an x86 register memory location and insert it into a field in the destination register given by an immediate operand, EXTRACTPS and PEXTR read a field from the source register and insert it into an x86 register or memory location. For example, PEXTRD eax, [xmm0], 1; EXTRACTPS [addr+4*eax], xmm1, 1 stores the first field of xmm1 in the address given by the first field of xmm0.
PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ	Packed sign/zero extension to wider types
PTEST	This does the same as the TEST instruction, in that it sets the ZF and CF flags to the result of an AND between its operators ... it sets the Z flag if any of the bits matched, and the C flag if all of them did.
PCMPEQQ	Quadword (64 bits) compare for equality
PACKUSDW	Convert signed DWORDs into unsigned WORDs with saturation.
MOVNTDQA	Efficient read from write-combining memory area into SSE register; this is useful for retrieving results from peripherals attached to the memory bus.

SSE4.2
Instruction	Description
CRC32	Accumulate CRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41).
PCMPESTRI	Packed Compare Explicit Length Strings, Return Index
PCMPESTRM	Packed Compare Explicit Length Strings, Return Mask
PCMPISTRI	Packed Compare Implicit Length Strings, Return Index
PCMPISTRM	Packed Compare Implicit Length String, Return Mask
PCMPGTQ	Compare Packed Signed 64-bit data For Greater Than
POPCNT	Population count (count number of bits set to 1). POPCNT instruction may also be implemented in some processors that do not support the other SSE4 instructions and a separate bit can be tested to confirm POPCNT presence.

SSE4a
Instruction	Description
LZCNT	Leading Zero Count - bit manipulation. LZCNT instruction may also be implemented in some processors that do not support the other SSE4 instructions and a separate bit can be tested to confirm LZCNT presence.
POPCNT	Population count (count number of bits set to 1). POPCNT instruction may also be implemented in some processors that do not support the other SSE4 instructions and a separate bit can be tested to confirm POPCNT presence.
EXTRQ/INSERTQ	Combined mask-shift instructions.
MOVNTSD/MOVNTSS	Scalar streaming store instructions.

3DNow!

3DNow! floating-point instructions
Instruction	Description
PAVGUSB	Packed 8-bit unsigned integer averaging
PI2FD	Packed 32-bit integer to floating-point conversion
PF2ID	Packed floating-point to 32-bit integer conversion
PFCMPGE	Packed floating-point comparison, greater or equal
PFCMPGT	Packed floating-point comparison, greater
PFCMPEQ	Packed floating-point comparison, equal
PFACC	Packed floating-point accumulate
PFADD	Packed floating-point addition
PFSUB	Packed floating-point subtraction
PFSUBR	Packed floating-point reverse subtraction
PFMIN	Packed floating-point minimum
PFMAX	Packed floating-point maximum
PFMUL	Packed floating-point multiplication
PFRCP	Packed floating-point reciprocal approximation
PFRSQRT	Packed floating-point reciprocal square root approximation
PFRCPIT1	Packed floating-point reciprocal, first iteration step
PFRSQIT1	Packed floating-point reciprocal square root, first iteration step
PFRCPIT2	Packed floating-point reciprocal/reciprocal square root, second iteration step
PMULHRW	Packed 16-bit integer multiply with rounding

3DNow! performance-enhancement instructions
Instruction	Description
FEMMS	Faster entry/exit of the MMX or floating-point state
PREFETCH/PREFETCHW	Prefetch at least a 32-byte line into L1 data cache

3DNow! extension DSP instructions
Instruction	Description
PF2IW	Packed floating-point to integer word conversion with sign extend
PI2FW	Packed integer word to floating-point conversion
PFNACC	Packed floating-point negative accumulate
PFPNACC	Packed floating-point mixed positive-negative accumulate
PSWAPD	Packed swap doubleword

MMX extension instructions (Integer SSE)
Instruction	Description
MASKMOVQ	Streaming (cache bypass) store using byte mask
MOVNTQ	Streaming (cache bypass) store
PAVGB	Packed average of unsigned byte
PAVGW	Packed average of unsigned word
PMAXSW	Packed maximum signed word
PMAXUB	Packed maximum unsigned byte
PMINSW	Packed minimum signed word
PMINUB	Packed minimum unsigned byte
PMULHUW	Packed multiply high unsigned word
PSADBW	Packed sum of absolute byte differences
PSHUFW	Packed shuffle word
PEXTRW	Extract word into integer register
PINSRW	Insert word from integer register
PMOVMSKB	Move byte mask to integer register
PREFETCHNTA	Prefetch using the NTA reference
PREFETCHT0	Prefetch using the T0 reference
PREFETCHT1	Prefetch using the T1 reference
PREFETCHT2	Prefetch using the T2 reference
SFENCE	Store fence

3DNow! Professional instructions unique to the Geode GX/LX
Instruction	Description
PFRSQRTV	Reciprocal square root approximation for a pair of 32-bit floats
PFRCPV	Reciprocal approximation for a pair of 32-bit floats