4.2 (i) RAW -- data flow dependence, compiler can reschedule the instructions by changing offset (ii) no dependence, compiler can reschedule (iii) maybe dependence between store addresses, in general compiler cannot reschedule them. howeve if someone mentioned that by doing pointer analysis compiler might find the dependence and hence reschedule the instructions, that was also correct (iv) store control dependent on branch, compiler cannot reschedule since given to be non speculative execution 4.18 The predicated code is DSUB R1, R13, R14 CMP.NE pT, pF = R1, R0 (pF) DADDI R2, R2, #1 (pF) SD 0(R7), R2 (pT) MUL.D F0, F0, F2 (pT) ADD.D F0, F4 (pT) S.D 0(R8), F0 The difference between the dependences between the two codes is that the control dependences in the first part are converted to data dependences in the second part. The advantages of predication are that it may help improve performance by avoiding flushes on branch misprediction and by providing more instruction scheduling opportunities 5.2 time to load data and instructions from memory to L2 cache = 60ns + // access time 64/16cycles @ 133 Mhz + // block size of L2/bus width * bus speed 0.5 * (60 + 64/16@133Mhz) // 50% of the blocks replaced are dirty =135ns miss penalty for L2 = time to load data from memory to L2 = 135ns time to load instructions from L2 to L1-I in case of a hit in L2 (hit time L2 for instructions) = 15ns+ // access time 32/16cycles@266Mhz // block size of L1/bus width * bus speed = 22.5ns time to load data from L2 to L1-D in case of a hit in L2 (hit time L2 for data) = 15ns+ // access time 16/16cycles@266Mhz // block size of L1/bus width * bus speed = 18.75ns miss penalty L1-I = 22.5ns+ // hit time L2 for instructions 0.2 * 135ns // miss ratio L2 * miss penalty L2 = 49.5ns miss penalty L1-D= 18.75ns+ // hit time L2 for data 0.2 * 135ns // miss ratio L2 * miss penalty L2 = 45.75ns (a) avg Icache access time= 0+ // no penalty on hits 0.02 * 49.5ns // miss ratio L1 * miss penalty L1 =0.99ns (b) avg Dcache read time= // similar to part (a) 0+ 0.02 * 45.75ns =2.2875ns (c) avg Dcache write time= 95% * 0+ // write buffer eliminates stalls on 95% of the writes 0.05 * (15 + 3.76)+ // since L1 is a write through cache all the writes go to L2 whether a hit or a miss in L1 0.05 * 0.2 * 135 // memory access required for those that miss in L2 =2.288ns (d)overall CPI = 0.7+ 0.99*1.1+ // instructions 0.2*2.2875*1.1+ // loads 0.05*2.288*1.1 // stores =2.418 (e)CPI for 2.1Ghz machine (calculated as in part (d)) = 3.98 Speedup = Time1.1/Time2.1 = (CPI1.1*cycle-time1.1)/(CPI2.1*cycle-time2.1) =1.16 So speedup = 16% 5.18 (a) If we assume non self modifying code then not required, else it is required (b) overhead = 64-36 + 8 = 36bits per tag For the same cache size as the block size increases, no of blocks decreases, but the number of (index+offset) bits remains the same and hence the tag bits per block remains the same. But a larger block size means that there are less tag bits per data bit and hence the overhead decreases.