4. Assembly Language & Computer Architecture

716.88k views11263 WordsCopy TextShare

MIT OpenCourseWare

MIT 6.172 Performance Engineering of Software Systems, Fall 2018 Instructor: Charles Leiserson View ...

Video Transcript:

the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu so today we're going to talk about assembly language and computer architecture it's interesting these days most software courses don't bother to talk about these things and the reason is because as much as possible people have been insulated in writing their software from performance considerations but if you want to write fast code you have

to know what is going on underneath so you can exploit the strengths of the architecture and the interface the best interface that we have to that is the assembly language so that's what we're going to talk about today so when you take a particular piece of code like fib here to compile it you run it through clang as I'm sure you're familiar at this point and what it produces it is a binary machine language that the computer is hardware programmed to interpret and execute ok it it looks at the bits as instructions as opposed to

as data and it executes them and and that's what we see when we execute this process is not one step it's actually there for stages two compilation pre-processing compiling sorry for the redundancy that's sort of a bad name conflict but that's what they call it assembling and linking so I want to take us through those stages so so the first thing that goes through is you go through a pre process stage and you can invoke that with clang manually so you can say for example if you do clang - II okay that will show that

will run the preprocessor and nothing else and you can take a look at the output there and look to see how all your macros got expanded and such okay before the compilation actually goes it goes through then you compile it and that produces assembly code okay so assembly is a mnemonic structure of the machine code that makes it more human readable than the machine code itself would be and then and once again you can produce the assembly yourself with clang - s and then finally you was penultimately maybe you can assemble that that assembly language

code to produce an object file and since we like to have separate compilations you don't have to compile everything is one big monolithic hunk okay then there's typically a linking stage to produce the final executable and for that we are using LD for for the most part we're actually using the gold linker but but LD is the command that calls it okay so let's go through each of those steps and see what's going on so first the the the pre-processing is really straightforward so I'm not going to do that that's just a textual substitution the

next stage is the source code - assembly code so when we do clang - at s we get this symbolic representation and it looks something like like this okay where we have some labels on the side some labels on the side and we have some operations we may have some directives and then we have a lot of gibberish which won't seem like so much gibberish after you've played with it a little bit okay but to begin with looks kind of like gibberish from there we assemble that assembly code and that produces the binary okay and

once again you can invoke it just by doing running clang clang will recognize that it doesn't have a CE file or a C++ file it says oh goodness I've got a an assembly language file and they'll produce the the the binary now the other thing that turns out to be the case is because assembly and machine the code they're really very similar in in structure okay just things like the OP codes which are the the things that are here in in blue or purple whatever that color is like these guys okay those correspond to specific

bit patterns over here in the machine code okay so and these are the the addresses and the registers that were operating on the operands okay those correspond to other two other bit codes over there okay and there's very close there's a very much a it's not exactly one to one but it's pretty close one to one compared to if you had C and you look at the binary it's like it's it's way way different okay so so one of the things that turns out you can do is if you have the if you have the

machine code and especially if the machine code that was produced with so called debug symbols that is it was compiled with dash G you can use this program called ABS okay which will produce a disassembly of the machine code so to tell you okay here's what the mnemonic human more human readable code is the assembly code from the binary and that's really useful especially if you're trying to do things well let's see why do we bother looking at the assembly so why would you want to look at the assembly of your program does anybody have

some ideas yeah yeah you can see whether certain optimizations are made or not so other right reasons everybody's gonna say that one okay another one is well let's see so here's some reasons okay the assembly reveals what the compiler did and did not do because you can see exactly what the machine what the assembly is that it's going to be executed as a machine code the second reason which turns out to happen more often than you would think is that hey guess what compiler is a piece of software it has bugs so your code isn't

operating correctly oh goodness what's going wrong maybe the compiler made an error and we have certainly found found that for especially when you start using some of the less frequently used features of a compiler you may discover all it's actually not that well broken in and mentions here you may only have an effect when compiling at - oh three but if you compile it - oh zero - zero one everything works out just fine so then it says cheese somewhere in the optimizations they did an optimization wrong okay so one of the first principles of

optimization is do it right and then the second is make it fast and then and so sometimes the compiler doesn't that it's also the case that sometimes you cannot write code that produces the assembly that you want and in that case you can actually write the assembly by hand okay now used to be many years ago many many years ago that a lot of software was written in assembly okay in fact I had a my first job out of college I spent about half the time programming in assembly language okay and it's not as bad

as you would think okay but it certainly is easier to have high level languages that's for sure you get a lot more done a lot quicker okay and the last one is reverse engineer you can figure out what a program does when you have only have access to its source so for example the matrix multiplication example that I gave on day one okay you know we had the overall outer structure but the inner loop we could not match the Intel and math kernal library code so what do we do we look to see what it

was we didn't have the source for it we look to see what it was doing we said oh is that what they're doing okay and then we were able to do it ourselves okay without having to without having to you know get the source from them so we reverse engineered what they did so all those are good reasons now in this class we have some expectations so one thing is you know assembly is complicated and you needn't memorize the manual in fact the manual has has like over a thousand pages okay it's like okay but

here's what we do expect of you you should understand how a compiler implements various seed linguistic constructs with x86 instructions and that's what we'll see in the next lecture and you should be able to read x86 assembly language with the aid of an architecture manual and on a quiz for example we would give you you know snippets or explain what the op codes are that are being used in case it's not there but you should have some understanding of that so you can see what's actually happening you should understand the high level performance implications of

common assembly patterns okay so what happens in you know what does it mean to do things in a particular way in terms of performance so some of them are quite obvious vector operations tend to be faster than you know doing the same thing with a bunch of scalar operations okay if you do write an assembly typically what we do ziz they're a bunch of compiler intrinsic functions built-ins so called that that allowed you allow you to use the assembly language instructions and you should be after we've done this able to write code from scratch if

the situation demands at some time in the future we won't do that in this class but we expect that you will be in a position to do that after after you should get a mastery to the level where that would not be impossible for you to do you'd be able to do that with a reasonable amount of effort so the rest of the lecture here is I'm going to first start by talking about the instruction set architecture of the x86 64 which is the one that we are using for the cloud machines that we're using

and then I'm going to talk about floating-point and vector hardware and then I'm gonna do an overview of computer architecture now all of this I'm doing this is a software class right okay software performance engineering we're doing so the reason we're doing this is so you can write code that better matches the hardware therefore to better get it in order to do that I could give things at a high level my experience is that if you really want to understand something you want to understand it to the level that's necessary and then one level below

that okay it's not that you'll necessarily use that one level below it but that gives you insight as to why that layer is what it is and what's really going on okay and so that's kind of what we're gonna do we're going to do a dive that takes us one level beyond what you probably will need to know in in the class so that you have a robust foundation for understanding does that make sense okay that's just that's my part of my learning philosophy is you know go one step beyond and then you can you

know come back okay the is a primer okay so the is a talks about the syntax and semantics of assembly this is there are four important concepts in in the instruction set architecture the notion of registers the notion of instructions the data types and the memory addressing modes and those are sort of indicated for example here we're going to go through those one by one so let's start with the registers so the registers is where the processor stores things and there are a bunch of x86 registers so many that you don't need to know most

of them okay the ones that are important are these okay so first of all there are general purpose registers and those typically have with 64 and there are many of those there is a so called Flags register called are flags which keeps track of things like whether there was an overflow whether the last arithmetic operation resulted in a 0 whether a kid there was a carry out of of a word or what-have-you the next one is the instruction pointer so the assembly language is organized as a sequence of instructions and the hardware marches linearly through

that sequence one one after the other unless it encounters a conditional jump or an unconditional jump in which case it'll branch to whatever the location is but for the most part it's just running straight through memory then there are some registers that were added quite late in the that our that namely the SSC registers and the AVX registers and these are vector registers so the xmm registers were when they first did vectorization they use 128 bits there's also four AVX they're the ymm registers and in the most recent processors which were not using this term

there there's another level of av X that gives you 512 bit registers but maybe we'll use that for for the final project because it's just like more a little more power okay for the for the game-playing project okay but for most of what you'll be doing will just be keeping to the to the basic to the C for instances that in AWS that you guys have been using okay now the x86 64 didn't start out as x86 64 started out as x86 and it was used for machines in particularly 8086 which had a 16-bit word

okay so really short okay how many things can you index with a 16-bit word about how many yeah about 65,000 65536 okay words you can execute in that sorry you can address or bytes this is byte addressing okay so that's 65 k bytes that you can address how could they possibly use that for machines well the answer is that's how much memory was on the machine you didn't have gigabytes so as the machines as Moore's law you know marched along and we got more and more memory then the words had to become wider to be

able to index them yeah yeah but here's the thing is if you're building stuff that's going to have to thick it's too expensive and you can't get memory that's big enough then then the then if you build a wider word like if you build a word of 32 bits then your processor just cost twice as much as the next guy's processor so instead what they did is they went along as long as that was the common size and then had a some growth pains and went to 6:32 and from there they had some more growth

pains and went to 64 okay those are two separate things and in fact they did they did some really weird stuff okay so what they did in fact is when they made these longer registers they have registers that are aliased to exactly the same thing for the lower bits so they could address them either by a by a byte okay so these registers all have the same you can do the lower and upper half of the short word or you can do the the 32-bit word or you can do the 64-bit word okay and that's

just like if you're doing this today you wouldn't do that you wouldn't have all these registers that alias and such okay but that's what they did because it because this is history not not design and the reason was because when they're doing that they were not designing for long term now are we gonna go to 128 bit addressing probably not 64 bits address is a spectacular amount of stuff I'm you know not quite as many 2 to the 64th is what is like it's like how many gazillions it's a lot of gazillions ok ok so

yeah we're not going to have to have to go beyond 64 probably ok so here are the general-purpose registers and as I mentioned you know they have different names but they for the same thing so if you change EA X for example that also changes our ax okay and so you see they they originally all had functional purposes now they're all pretty much the same thing except for the and but the names have stuck because of history okay they instead of calling them register 0 register 1 or whatever they all have these funny names okay

some of them still are used for a particular purpose like RSP is used as the stack pointer and RB p is used to point to the base of the frame for those who remember there's six double o4 stuff so anyway they're a whole bunch of them and there are different names depending upon which part of the register you're accessing now the format of an x86 64 instruction code is to have an opcode and then an operand list and the operand list is typically 0 1 2 or rarely three operands separated by commas typically all operands

are sources and one operand might also be the destination so for example if you take a look at this at this add instruction the operation is an ad and the operand list is the is these two registers one is EDI and the other is ECX and the destination is the second one okay when you add in this case what's going on is it's taking the value in ECX adding the value in EDI into it and the result is in ECX yes funny you should ask yes okay so what does OP a B mean it turns

out naturally that the literature is inconsistent about how it refers operations and there's two major ways that are used one is the AT&T syntax and the other is the Intel syntax so the AT&T syntax the second operand is the destination the last operand is the destination in the Intel Singh syntax the first operand is the destination okay is that confusing okay so almost all the tools that we're going to use are going to use the AT&T syntax okay the but you will read documentation which is which is Intel documentation it will use the other syntax

don't get confused okay I can't help you know it's like I I can't help that this is the way the state of the world is okay yeah oh yeah in particular if you you know you could compile it yeah and undo but I'm sure there's a I mean that this is not a hard translation thing I'll bet if you just google it you can in two minutes in two seconds fine to find somebody who'll translate from one to the other okay yeah this is not a not a complicated translation process now here are some very

common x86 op codes and so let me just mention a few of these because these are ones that you'll often see in in the code so move what do you think move does yeah it puts something in one register into another register of course when it moves it this is computer science move not real move you know when I move my belongings in my house to my new house they're no longer in the old place right but in computer science for some reason when we move things we leave a copy behind okay so so they

may call it move but yeah why don't they call it copy you got me okay okay then there's conditional move so this is move based on a on a condition like move and we'll see some of the ways that this thing like move if if a flag is equal or equal to zero or and so forth so basically conditional move it doesn't always do the move then there's then you can extend the sign so for example suppose you're moving from a 32-bit value register into a 64-bit register okay then the question is what happens to

the higher order bits so there's two basic mechanisms can be used either can be filled with zeros or remember that the first bit there the leftmost bit as we think of it is the sign bit right from from our lecture on binary that bit will be extended through the high order part of the word okay so that the whole number will be if it's negative will be negative and if it's positive it'll be zeros and so forth okay does that make sense then there are things like push and pop to do stacks there's a lot

of integer arithmetic and you can take there's you know addition subtraction multiplication division you know very shifts address calculation shifts rotations incrementing decrementing negating etc there's also a lot of binary logic and or X or not those are all doing bitwise operations and then there is boolean logic like testing to see whether some value is has a given value or comparing there's unconditional jump which is and there's conditional jumps which is jump with a condition and then of things like subroutines and there are a bunch more which were which the manual will have and which

will undoubtedly show up like for example there's the whole set of vector operations we'll talk about a little bit later okay now the opcodes may be augmented with a suffix that describes the data type of the operation or a condition code okay so an opcode for data movement arithmetic or logic use a signal single character suffix to indicate the data type and if the suffix is missing it can usually be inferred so take a look at this example so this is a move with a cue at the end what do you think q stands for

quad word okay how many bytes in a quad word eight that's because originally it started out with a 16-bit word so they said a quad word was four of those 16-bit words so that's eight bytes okay you get the idea right but let me tell you this is all over the x86 instruction set all these historical things and all these mnemonics that that if you don't understand what they really mean you can get very confused okay so in this case we're moving a 64-bit integer because a quad word has eight bytes or 64 bits okay

I this is one of my it's like whenever I prepare this lecture I just go into spasms laughter okay as I look and I say oh my god they really did that like for example on the last page when I did subtract okay so the sub operator if it's a two argument operator it it subtracts the I think it's the first one the second but there is no way of subtracting the other way around okay it puts the destination in the second one it basically takes the second one - the first one and puts that

in the second one okay but if you wanted to have it the other way around to save yourself a cycle you anyway it doesn't matter that you can't do it that way okay and all this stuff the compiler has to understand okay so here are the x86 64 datatypes okay the way I've done it is to show you the difference between C and an x86 64 so for example here the declarations here the declarations in C so there's a char a short int unsigned int long etc here's an example of a C constant that does

those things and here's the size in bytes that you get when you declare that okay and then the the assembly suffix is one of these things okay so in the assembly it says B or W for a word and L or D for a double word a queue for a quad word ie 8 bytes single precision double precision extended precision okay the so sign extension use two datatype suffixes so here's an example so the first one says we're going to move and now you see I can't read this without without my cheat sheet so what

is this saying this is saying we're gonna move a with a 0 extend and it's going to be the first operand is a byte and the second operand is along that's that right if I'm wrong it's like I got a look at the chart - okay and of course we don't hold you to that so but the Z there says extends with zeros and the S says preserve the sign ok so that's the the things now that would all be all well and good except that then what they did is if you do 32-bit operations

where you're moving it to a sick the 4-bit value it implicitly zero extends the sign if you do it for smaller values and you store it in it simply puts over writes the values in those registers doesn't touch the higher bits but for that when they did the 32 to 64 bit extension of the of the instruction set they decided that they wouldn't do what had been done in the past and they decided that they would zero extend things unless there was something explicit to the contrary okay you got me okay yeah I have a

friend who worked at Intel and he had a joke about the Intel instruction set he discovered the Intel instructor says really complicated he says here's the idea of the Intel instruction set he said to become an Intel fellow you need to have an instruction in the Intel instruction set okay you have an instruction that you invented and that that's now used in Intel he says nobody becomes an Intel fellow for removing instructions so just so grows and grows and grows and gets more and more complicated for for each thing now once again you can for

extension you can sign extend and here's two examples in one case moving an 8-bit integer to a 32-bit integer and 0 expanded it versus preserving the sign conditional jumps and conditional moves also use suffixes to indicate the condition code so here for example the ne indicates the jump should only be taken if the argument of the previous comparison are not equal so ne is not equal so you do a comparison and that's going to set a flag in the are flags register then the jump will look at that flag and decide whether it's going to

jump or not or just continue the sequential execution of the of the code okay and there are a bunch of things that you can jump on so which are status flags and you can see the names here there's Carrie there's parity parity is the XOR of all the bits in the word there's and just I don't even know what that's for okay there's the zero flag tells with a zero there's a sign flag whether it's positive or negative there's a trap flag which and interrupt enable and Direction overflow so anyway you can see there are

a whole bunch of these okay so for example here this is going to decrement RBX and then it sets the zero flag if the results are equal and then the jump the conditional jump jumps to the label if the z-f flag is set to is not set in this case okay makes sense after a fashion okay it doesn't make rational sense but it does make sense okay here are the main ones that you're going to need the carry flag is whether you got a carrier or borrow out of the most significant bit the zero flag

is if the ALU operation was zero whether the last lu application had the sign bit set and the overflow says it resulted in arithmetic overflow the condition codes are if you put one of these condition codes on the on your conditional jump or whatever this tells you exactly what the flag is that is is being set so for example you know the easy ones are if it's equal but you know there are some other ones there so for example you know if you say why you know why for example do the condition codes E&N a

check the zero flag okay and the answer is typically rather than rather than having a separate comparison that what they've done is separate the branch from the comparison itself but it also needn't be a compare instruction it could be the result of the last arithmetic operation was a zero and therefore it can branch without having to do a comparison with zero okay so for example if you have a loop okay that where your decrementing a counter till it gets to zero that's actually faster to buy one instruction okay to to compare whether the loop index

hits zero then it is if you have the loop going up to N and then every time to loop having to compare with N in order before you can branch okay so these days that optimization doesn't mean anything because as we'll talk about in a little bit these machines are so powerful that you know doing an extra integer arithmetic like that probably has no bearing on the overall cost yeah just looks at the flags yep just looks at the flags doesn't take any arguments okay now the next aspect of this is that you can give

registers but you also can address memory and there are there are three direct addressing modes and three indirect addressing modes okay at most one operand may specify a memory address so so here the direct addressing lines so for immediate what you do is you give it a constant like like 1 7 - random constant to store into the register in this case that's called an immediate what happens in the if you look at the instruction if you look at the machine language 172 is right in the instruction ok it's right in the instruction that number

172 ok our register says will move the value from the register in this case % CX and then the index of the register is put in that in that part and direct memory says use a particular memory location okay and you can give a hex value when you do direct memory it's gonna fetch it out of that he's going to use the value at that place in memory and and to indicate that memory is going to take you on a 64-bit machine 64 eight bytes to specify that memory whereas for example the the move cue

I can I can get one seven two will fit in one bite okay and so it will move you know I'll have spent a lot less storage in order to do it plus I can do it directly from the instruction stream and I avoid having an access to memory which is very expensive so what's how many cycles does it take if the value that you're fetching from memory is not in you know is not in cache or whatever okay or a register if I'm fetching something remember how many cycles of the Machine does it typically

take these days yeah yeah a couple hundred or more yeah a couple hundred cycles to fetch something from memory it's so slow no it's the processors are so fast okay and so so clearly if you can get things into registers most registers you can access in a single cycle okay so that we want to move things close to the process or operate on them shove them back and while we pull things from memory we want other things to be to be working on okay and so the hardware is all organized to the do to do

that okay now of course we spent a lot of time fetching stuff from memory and that's one reason we use caching and we'll have a big thing caching is really important we're gonna spend a bunch of time on how to get the best out of your cache there's also indirect addressing so instead of just giving a location you say oh let's go to some other place for example the register and get the value and the address is going to be stored in that location so for example here register indirect says in this case move the

contents of our ax into sorry the contents is the address of the thing that you're going to move into our di okay so if our ax was location 172 okay then it would take whatever's in location 172 and put it in our di okay registered index says well do the same thing but while you're at it add an offset okay so if it once again if our ax had 172 in this case it would go to 344 okay to fetch the value out of that location 344 for this particular instruction okay and then instruction pointer

relative okay instead of indexing off of a general purpose register you index off the instruction pointer okay that usually happens in the code where your modern where the code is is for example you can jump to where you are in the code plus for instructions okay so you can jump down some some number of instructions in the code usually you'll see that lonely with use with control because you're talking about things but sometimes they'll put some data in the instruction stream and then it can index off the instruction pointer to get those values without having

to to soil annex another register now the most general form is base index scale displacement addressing Wow okay this is a move that has a constant plus three terms okay and this is the most complicated instruction that is supported the mode refers to the address whatever the base is okay so the base is is a general purpose register in this case RDI and then it adds the index times the scale so the scale is one two four eight okay and then a displacement which is that number on the front okay and this gives you very

gentle indexing of things off of a base pointer so you'll often see this kind of accessing when you're accessing stack memory okay because everything you can say here's the base of my frame on the stack and now for anything that I want to add I'm gonna be going up a certain amount I'm in a scaling by a certain amount to get the value that I want okay so once again you know you will become familiar with these with a manual okay you don't have to memorize all these but you do have to understand that there

are a lot of these complex addressing modes the jump instruction take a label as their operand which locate identifies a location in the code for this the labels can be symbols in other words you can say here's a symbol that I want to jump to might be the beginning of a function or it might be a label that's generated to be at the beginning of a loop or whatever they can be exact addresses go to this place in the code or they can be relative address jump to someplace as I mentioned that's indexed off the

instruction pointer okay and then an indirect jump takes as its operand and in direct address I've got okay as it's not brand as its operand okay so that's a typo just takes an operand as an indirect address so basically you can say go you know jump to whatever is pointed to by that register using whatever indexing method that you want okay so that's kind of an overview of the assembly language now let's take a look at some idioms so the extra opcode computes the bitwise XOR of a and B we saw XOR was a great

trick for swapping numbers for example the other day so often in the code you will see something like this XOR ra x ra x what does that do yeah it zeros the register why is that the register yeah it's basically taking it's basically taking the results of our ax the results of our axe X touring them and when you XOR something with itself you get zero storing that back into it so that's actually how you zero things so you'll see that whenever you see that hey what are they doing they're zeroing the register okay and

that's actually quicker and easier than having a zero constant that they put into into the instruction it saves a byte because this ends up being a very short instruction I remember how many bytes that instruction is but okay here's another one the test opcode test a B computes the bitwise and of a and B and discards the result preserving the are Flags register okay so basically it says what does the test instruction for for these things do okay so what is the first one doing so it takes our CX yeah so it takes the bitwise

and of a and B right and so then it's saying jump if equal so right and is nonzero if any of the bits are set that's right so if the zero flag is set then our CX is set so this is going to jump to that location if our CX is holds the value 0 okay in all the other cases it won't set the zero flag because the result of the end will be zero so once again that's kind of an idiom that they use what about the second one so this is a conditional move

so both of them are basically checking to see if the registers 0 ok and then doing something if it is or isn't ok but those are just idioms that you sort of have to look at to see you know how it is that they accomplish the particular thing okay here's another one so the ISA can include several no op no operation instructions including knop not a that's an operation with an argument and data 16 which sets aside 2 bytes of a no op so here's a line of assembly that we found in some of our

code ok data 16 days 16 data 6 no op w you know and then % CSX you know so no op W is going to take this argument which is got all this address calculation in it so what do you think this is doing what's the effect of this by the way they're all no ops so what the effect is nothing ok the effect is nothing ok now it does set the are flags but basically mostly it does nothing why were the cat compiler generated assembly with these idioms why would you get that kind of

that's crazy right yeah yeah it's actually doing alignment optimization typically okay there's it's or code size so it may want to start the next instruction on the beginning of a cache line and in fact there's a directive to do that if you want all your functions to start at the beginning of cache line then it wants to make sure that if code gets to that point it will you know you'll just proceed to jump through memory continue through memory okay so mainly is to optimize marking so you'll see those things I mean you just have

to realize oh that's the compiler generating some some no-ops so that's sort of our brief excursion over assembly language x86 assembly language now I want to dive into floating-point and vector hardware which is going to be the main part and then if there's any time at the end I'll show the slides where there's I have a bunch of other slides on how branch prediction works and and a variety of other machines sorts of things that if we don't get to it's no problem you can take a look at the slides and there's also the architecture

manual so floating point instruction sets so mostly the scalar floating point operations are accessed via a couple of different instruction sets so the history of floating point is interesting because Ridge '''l II the 8086 did not have a floating point unit okay floating point was done in software and then they made a companionship that would do floating point and then they started integrating and so forth as as miniaturization took hold so the SSC and AVX instructions do both single and double precision scalar floating point ie floats or doubles and then the x86 instructions the x87

instructions that's the 88 seven that was attached to the 8086 and that's where they get them support single double and extended precision scalar floating-point arithmetic including float double and long double so you can actually get a great big result of a multiply if you use the x87 instruction sets and they also include vector instructions you can multiply or add there as well so all these places on the chip where you can decide to do one thing or another okay compilers generally like the SSE instructions over the x87 instructions because they're simpler to compile for and

to optimize and the SSE op codes are similar to the normal x86 op codes and they use the xmm registers and floating-point types and so you'll see stuff like this where you've got a move SD and so forth okay the suffix there is saying what the datatype in this case it's saying it's a double precision floating point value ie a double okay once again they're using suffix the SD in this case is a double precision floating point the other option is the first letter says whether it's single I use scalar operation or pet I have

vector operation okay and the second letter says whether it's single or double precision okay and so this when you see one of these operations you can decode oh this is operating on a 64-bit value or a 32-bit value floating point value or on a vector of those values now what about these vectors so when you start using the packed representation and you start using vectors you have to understand a little bit about the vector units that are on these machines so they so the way a vector unit works is that there is the processor issuing

instructions and it issues the instructions to all of the vector units okay so for example if you take a look at a typical thing that you may have a vector with the four vector units each of them is often called a lane la ne and the X is the vector with and so when the instruction is given is given to all of the vector units and they all do it on their own local copy of the register so the register you can think of as a very wide thing broken into several words and when I

say add two vectors together it'll add four words together okay and store it back into another vector register okay and so whatever K is you know in the example I just said K was four and the lanes are the thing that each of which contains the integer or floating-point arithmetic but the important thing is that they all operate in lockstep okay it's not like one is going to do one thing and another is going to do another thing they all have to do exactly the same thing and the basic idea here is the price of

one instruction okay I can command a bunch of operations to be done now generally vector instructions operate in an element white fashion where you take the eye of one vector and operate on it with the I L iment of another vector and all the lanes perform exactly the same operation depending upon the architecture some architectures the operands need to be aligned that is you've got to have the beginnings at the exactly same place in memory a multiple of the vector length there are others where the vectors can be shifted in memory usually there's a performance

difference between the two okay if it does support some of them will not support unaligned vector operations so if it can't figure out that they're aligned I'm sorry your code will end up being executed scalar in a scalar fashion if they are aligned okay it's got to be able to figure that out and and in in that case sorry if it's not aligned but you do support vector operations on a on line it's usually slower than if they are aligned okay and for some machines now they actually have good performance on both okay so it

really depends upon the machine and then also there are some architectures will support cross lane operation such as inserting or extracting subsets of vector elements permuting shuffling scatter gather types of operations so X supports several instruction sets as I mentioned there's SSE there's a VX there's a VX - and then there's now the avx-512 or sometimes called a VX 3 and which is not available on the machines that we'll be using the Haswell machines that we'll be doing generally the a VX and avx2 in extend the SSE instruction set by using the wider registers and

operate on to the SSE use wider registers and operate on most two operates the a VX ones can use the 256 and also have three operands not just two operands so so you can say you know add a to be in store it and see as opposed to saying add a to b and store it and be ok so it can also support 3 yeah most of them are similar to traditional op codes with minor differences so there's you know if you look at them they look you know you basically just if you have an

SSE it basically looks just like a the traditional name like ad in this case but you can then say do a packed add or a vector with packed data so the V premise says that say V X so if you see it's V you go to the part in the manual that says a VX ok if you see the peas that say it's packed data then then you go to SSE if it doesn't have the V ok and the P prefix distinguish an introvert or instruction you got me I tried to think why does P

in distinguishing a an integer it's like a P no good mnemonic for integer right okay then in addition they do this aliasing trick again where the ymm registers actually alias the xmm registers okay so you can use both operations but you got to be careful what's going on okay so because they just extended them and now of course with the avx-512 they did another extension to 512 bits okay that's vectors stuff so so you can use those explicitly the compiler will vectorize for you and the homework this week takes you through some vectorization exercises actually

a lot of fun we're just going over it in the staff meeting and it's really fun I think it's really fun exercise we introduced that last year by the O he hadn't or maybe two years ago but in any case it's a fun one for my definition of fun which I hope is your definition of fun okay now I want to talk generally about computer architecture and I'm not going to get through all of these slides as I say but but I want to get started on them and give you a sense of other things

going on in the processor that you should be aware of so in six double-o for you probably talked about a five stage processor to anybody remember that okay five stage processor there's an instruction fetch there's an instruction decode there's an execute then there's a memory addressing and then you write back the values and this is done as a pipeline so as to make you could do all of this in one thing but then you would have a long clock cycle and you only be able to do one thing a time instead they stack them together

so here's a block diagram of the five stage processor we read the instruction from memory in the instruction fetch cycle then we decode it basically it takes a look at what is the opcode what are the addressing modes etc and figures out what it actually has to do then actually performs the ALU operations and then it reads and writes the data memory and then it writes back the results into registers that's typically a common way that these things go for a for a for a five stage processor by the way this is vastly oversimplified okay

if you can take six eight to three if you want to learn truth okay I'm gonna tell you I'm gonna tell you nothing but white lies okay for this lecture now if you look at the Intel Haswell the machine that we're using it actually has between 14 and 19 pipeline stages the 14 to 19 reflects the fact that there are different paths through it that take different amounts of time it also I think reflects a little bit that nobody has published the Intel internal stuff so maybe we're not sure if it's 14 to 19 but

somewhere in that range ok but I think it's actually because the different lengths of time I'll explain so what I want to do is is you've seen the five stage pipeline I want to talk about the difference between that and a modern processor by looking at several design features we already talked about vector or Hardware I then want to talk about superscalar processing out of order execution and branch prediction ok a little bit and at the out of order I'm gonna skip a bunch of that because it has to do with score boarding which it's

really interesting and fun but it's also time consuming but it's really interesting and fun that's what you learn in 6/8 2/3 so historically there's two ways that people make processors go faster by exploiting parallelism and by exploiting locality ok and parallelism there's instruction well we already did word level parallelism right in in the bit tricks thing but there's also instruction level parallelism so called ILP vectorization multi-core and for locality the main thing that's used there is caching I would say also the fact that you have a design with registers that also reflects locality because the

way that the processor wants to do things is fetch stuff from memory doesn't want to operate on it in memory that's very expensive wants to fetch things into memory get enough of them there that you can do some calculations do a whole bunch of calculations and then put them back out there okay so this lecture we're talking about ILP and vectorization so let me talk about instruction level parallelism so when you have a let's say a five stage pipeline you're interested in finding opportunities to execute multiple instructions simultaneously so an instruction one it's going to

do an instruction fish then it does its decode and so it takes five cycles for this for this instruction to complete so ideally what you'd like is that you can start instruction two on cycle two instruction three on cycle three and so forth and have five instructions once you get into the steady state have five instructions executing all the time that would be ideal okay where each one takes just one thing so that was really pretty good and that would improve the throughput even though it might take a long time to get one instruction done

I can have many instructions in the pipeline at some time okay so each pipeline is executing a different structure however in practice this isn't what happens in practice you discover that there are what's called pipeline stalls when it comes time to execute an instruction for some correctness reason it cannot execute the instruction has to wait and that's a pipeline stall that's what you want to try to avoid and the compiler tries to bruce code that will avoid stalls okay so why do stalls happen they happen because of what are called hazards there's actually two notions

of hazard and this is one of them the other is a race condition hazard this is the pendency hazard but people call them both hazards just like they call the the second stage of compilation compiling they it's like they make up these words okay so here's three types of hazards can prevent an instruction from executing first of all there's what's called a structural hazard to instructions attempt to use the same functional unit the same time if there's for example only one floating point you know multiplier and two of them try to use at the same

time one has to wait okay in modern process there's a bunch of each of those but if you have you know k functional units and k plus one instructions want to access it you're out of luck one of them is going to have to wait the second is a data hazard this is when an instruction depends on the result of a prior instruction in the pipeline okay so you know I'm you know one instruction is computing a value that it's going to stick in in in you know RCX say okay so they stick it into

RCX the other one is to read the value from RCX and it comes later it's got a weight that other instruction has to wait until that value is written there before it can read it that's a data hazard and a control hazard is where you're where where you decide that you need to make a jump and you can't execute the next instruction because you don't know which way the jump is going to go so if you have a conditional jump it's like well what's the next instruction after that jump I don't know so I have

to wait to execute that I can't go ahead and do the jump and then do the net instruction after it because I don't know what happened to the previous one okay now all these we're going to mostly talk about data hazards so an instruction can create a data hazard I can create a data hazard due to a dependence between I and J so the first type is called a true dependence or I read after write dependence and this is where as in this example I'm adding something and story into our ax and the next instruction

wants to read from our ax okay so the second instruction can't get going until the previous one or it's going to may stall until the previa the result of the previous one is known there's another one called an anti-dependence this is where I want to write into a location but I have to wait until the previous instruction has read the value okay because otherwise I'm going to clobber that instruction andrey clobber the value before it gets read okay so that's an anti-dependence and then the final one is an output dependence where where they're both trying

to move something to our ax so why would two things want to move things to the same location after all one of them is going to be lost and just not do that instruction why we yeah maybe because it wants to set some flags okay so that's that's one reason that it might might do this because it wants you know the first instruction set some flags in addition to moving the output to that location and there's one other reason what's the other reason I'm blanking there's two reasons and I didn't put them in my notes

I don't remember okay but anyway that's a good that's a good question for quiz then okay give me two reasons yeah there could but of course then you know if it's you're gonna use that register then oh I know the other reason okay so so this is still good for a quiz okay the other reason is there may be aliasing going on maybe an intervening instruction uses one of the values and it's aliased okay so use as part of the result or whatever there still could be a a dependency anyway some arithmetic operations arithmetic operations

are complex to implement in hardware and have long latencies so here's some sample op codes and how many instructions how many latency they take they take a different number so for example integer division actually is variable but a multiply takes about three times what most of the integer operations are and floating-point multiply is like five and then F ma what's F M a fuse multiply add this is where you're doing both a multiply in an ADD and why do we care about fuse multiply adds not for memory actually this is actually floating point multiplier an

add it's called linear algebra okay so when you do makes multiplication you're doing dot product you're doing multiplies and adds so that kind of thing that's where that's where you do a lot of those so how does the hardware accommodate these complex operations so the strategy that that hardware I meant much hardware tends to use is to have separate functional units for complex operations such as floating-point arithmetic so there's a there may be in fact separate registers for example the xmm registers that only work with the floating-point so you have your basic five-stage pipeline you

have another pipeline that's off on the side and it's going to take multiple cycles sometimes for things and maybe pipeline to a different depth okay and so you basically separate these separate these operations the the maybe pipeline fully partially or not at all okay and so I now have a whole bunch of different different functional units and there's different paths and I'm gonna be able to take through the data path of the of the processor so it has well there have integer a vector floating-point demit distributed among eight different ports which is sort of the

inch from the entry so so given that things get really complicated if we go back to our our simple diagram suppose we have all these additional functional units how can I now exploit more instruction level parallelism so right now we have you know we can start up one operation at a time what what might I do to get more parallelism out of the hardware that I've got what do you think computer architects did okay yeah so so even simpler than than that but which is implied in what you're saying is you can just inch you

know fetch an issue multiple instructions per cycle so rather than just doing one per cycle as we showed with a typical pipeline processor let me fetch several that use different parts of the pipe processor pipeline because they're not going to interfere okay to keep everything busy and so that's basically what's called a superscalar processor where it's not executing one thing at a time it's executing multiple things at a time so has well in fact breaks up the instructions into simpler operations called micro ops and they can emit for micro ops per cycle to the rest

of the pipeline and the fetch and decode stages implement optimizations on microcode processing including special cases for common patterns for example if it sees the X or of our ax and our ax it knows that our X is being set to zero it doesn't even use a functional unit for that it just does it and it's done ok has just a special logic that observes that because it's such a common thing to set things out and so that means that now your processor can execute a lot of things at one time and that's the machines

that you're doing that's why when I said if you save one add instruction it probably doesn't make any difference in today's processor because there's probably an idle add or lying around there's probably did I record how many where do we go here yeah so if you look here you can you can discover that they're actually a bunch of Al use that are capable of doing an ad so you know they're all over the map and in has well now still we are insisting that the processors execute in things in order and that's kind of the

next stage is how do you end up making things run that is how do you make it so that that you can free yourself from the tyranny of one instruction after the other okay and so the first thing is there's a strategy called bypassing so suppose that you have you know a instructions running into our ax and then you're gonna use that to read well why bother waiting for it to be stored into into the register file and then pulled back out for the second instruction okay instead let's have it let's have a bypass a

special circuit that identifies that kind of situation and feeds it directly to the to the next instruction without requiring that it go into the register file and back out okay so that's called bypassing there are lots of places where things are bypassed and we'll talk about it more so normally you would stall waiting for it to be written back and now when you eliminate it now I can move it way forward because I just use the bypass path to to execute and allows the second instruction to get going earlier okay what else can we do

well let's take a large code example given the amount of time what I'm going to do is basically say you can go through and figure out what are the read after write dependence a--'s and the write after read dependence is there all over the place and what you can do is is if you look at what the dependencies are that I did that I just flashed through you can discover oh there's all these things each one right now has to wait for the previous one before it can get started and but there are some for

example this the first one is just issue order you can't start the second B if it's in order you can't start the second till you've started the first okay that it's finished the first stage but the other thing here is there's a data dependence between the second and third instructions if you look at the second and third instructions they're both using xmm - and so we're prevented so one of the questions there is well why not do a little bit better by taking a look at this as a graph and figuring out what's the best

way through the graph and there are a bunch of tricks you can do there which I'll run through here very quickly okay and you can take a look at these okay you can discover that some of these dependences are not real dependence and as long as you're willing to execute things out of order and keep track of that it's perfectly fine okay if you're not actually dependent on it then just go ahead and execute it and then you can advance things and then the other trick you can use is what's called register renaming if you

have a destination that's going to be read from sorry if you have a if you have a if I want to read from something but if I want to write to something but something I have to wait for something else to read from it okay they write after read dependence then what I can do is just rename the register so that I have something to write to that is the same thing and there's a very complex mechanism called scoreboarding that that does that so anyway you can take a look at all of these tricks and

then the last thing that I want to so this is the part I was going to skip over and indeed I don't have time to do it I just want to mention the last thing which is is worthwhile so this you don't have to know any of the details of that part but it's in there if you're interested so it does renaming and reordering and then the last thing I just want to mention is branch prediction so when you come to branch prediction the outcome you can have a hazard because the outcome is known to

late and so in that case what they do is what's called speculative execution which you've probably heard of okay so basically that says I'm going to guess the outcome of the branch and execute okay if it's encountered you you assume it's taken otherwise you exit and you execute normally and if you're right everything is hunky-dory if you're wrong it costs you something like a you know you have to undo that speculative computation and the effect is sort of like stalling so you don't want that to happen and so there are a mispredicted branch on on

has well cost about 15 to 20 cycles most of the machines use a branch predictor to tell whether or not it's going to do there's a little bit of stuff here about how you tell tell about whether something is going to be branch is going to be predicted or not okay and you can take a look at that on your own so sorry to rush a little bit at the end but I I knew I wasn't gonna get through all of this but it's in the notes you know in the slides when we put it

up and this is really kind of interesting stuff once again I remember that I'm dealing with this at one level below what you really need to do but it is really helpful to understand that layer so you have a deep understanding of why certain software optimizations work and don't work sound good okay good luck on finishing your project ones you