In Volume 3 of the Intel Manuals it contains the description of a hardware event counter: <blockquote> BACLEAR_FORCE_IQ Counts number of times a BACLEAR was forced by the Instruction Queue. The IQ is also responsible for providing conditional branch prediction direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit. If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken, then the IQ will force the Branch Address Calculator to issue a BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline. </blockquote> I always thought the Branch Address Calculator performs the static prediction algorithm (when the Branch Target Buffer contains no branch entry)? Can anybody confirm which of the above two are correct? I cannot find anything.

<blockquote> If the conditional branch target is not found in the Target Array </blockquote> How can it not be found? you mask it with a bit mask to find the index into the table and get the next branch target. Well if you after you read the result check that the call address does not match the tag on the result you have a "not taken" result. At this point we get to the second part of the statement. <blockquote> and the IQ predicts that the branch is taken </blockquote> So branch target says "not taken" and the IQ predicts that it will be taken we have a contradiction. To solve the contradiction the IQ wins as the branch target is just "if we jump, we jump here", but the IQ predicts if we jump or not based on a lot more logic. Hence <blockquote> then the IQ will force the Branch Address Calculator to issue a BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline. </blockquote> Which is good in a 14-19 stage pipeline. The 8 cycles is if the IQ can read the actual target address from the instruction (combined with PC), if the value needs to be read in a register (that is possible not yet retired) it could take a bit longer.

Intel CPUs Instruction Queue provides static branch prediction?

Tags:

performance

cpu-architecture

branch-prediction

x86

assembly

In Volume 3 of the Intel Manuals it contains the description of a hardware event counter:

BACLEAR_FORCE_IQ

Counts number of times a BACLEAR was forced by the Instruction Queue. The IQ is also responsible for providing conditional branch prediction direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit. If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken, then the IQ will force the Branch Address Calculator to issue a BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline.

I always thought the Branch Address Calculator performs the static prediction algorithm (when the Branch Target Buffer contains no branch entry)?

Can anybody confirm which of the above two are correct? I cannot find anything.

911

asked Jul 26 '15 23:07

user997112

2 Answers

Yes. Modern Intel processors use at least one static prediction technique and at least one dynamic prediction technique (such as the L2 BPU mentioned in the description of the performance event). Static prediction is discussed in the Intel optimization manual, but it does not clearly say where static prediction happens exactly. However, the description of multiple performance events related to branch prediction, such as BACLEAR_FORCE_IQ, indicate that it is implemented in the IQ unit. I think that this is the place where static branch prediction makes most sense.

The BPU first guesses where the branch instructions are most likely to be in the (to be) fetched instruction stream bytes (32 bytes per cycle in Haswell, twice the fetch unit width). Then, based, on the virtual instruction address(s) of the instruction(s) that are predicted to be control transfer instruction(s), the BPU consults its buffers (specifically, the "branch target buffer" or the "target array"), to make more predictions regarding the predicted branches (direction and target address). However, in some cases the BPU misses in its buffers or it might mispredict the location(s) of the branch instruction(s) in the instruction stream bytes or there could be more branches than the BPU could handle. Whatever the case is, whatever prediction is makes, they all get passed with the instruction stream bytes to the instruction queue unit. This is the earliest place in the pipeline where it is known where each instruction begins and ends and which of the instructions may transfer control.

The IQ is also responsible for providing conditional branch prediction direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit.

This part of the event description should make sense to you now. Note that static branch prediction is mostly only used to predict directions, not target addresses.

If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken...

The simple static branch predictor is only used when the BPU fails to make a prediction. So the first part of the condition makes sense. The second part, however, basically says that if the IQ predicts not taken, then nothing needs to be done. This indicates that the fetch unit will by default continue fetching code from the fall-through path on a BPU failure.

...then the IQ will force the Branch Address Calculator to issue a BACLEAR

So if the static predictor predicts taken, then it's better to do something about that. One intuitive thing is to flush everything above the IQ and tell the fetch unit to stop fetching bytes. That's what the BACLEAR signal does.This situation is called a frontend resteering. It'd be nice if we could tell the fetch unit where to fetch from as well, but we my not know the branch target address yet. Even if the address is embedded within the instruction (as an immediate operand), the IQ may not be to just extract it and forward to the fetch unit. We can just do nothing and wait until the address is calculated, thereby potentially saving power and energy. Or we can provide the BPU with the address (now that we know exactly where the branch instruction is) and let the BPU try again. Perhaps, the purpose of the "Branch Address Calculator", is to not only send the BACLEAR signal, but also try to determine the address as early as possible.

Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline.

It's not clear to me what the 8 cycle bubble accounts for. One possible interpretation is that the flushing that is caused by BACLEAR takes about 8 cycles, but the fetch unit might still be idle waiting for the address from which it should fetch. Determining the target address may take more than 8 cycles, depending on how it gets calculated and the surrounding code. Or it could mean that, on average, it take only about 8 cycles to fully resteer the front end and begin fetching from the target address. In addition, this 8 cycles penalty may not actually be on the critical path, so it may not impact the overall performance.

In summary, BACLEAR_FORCE_IQ occurs when a conditional branch (and only conditional branches) misses in the BPU (not any other BPU failure) and the IQ predicts taken.

I think that the BAC is used to handle any branch misprediction situation, not just by the IQ. Other performance events indicate that.

answered Oct 07 '22 15:10

Hadi Brais

If the conditional branch target is not found in the Target Array

How can it not be found? you mask it with a bit mask to find the index into the table and get the next branch target.

Well if you after you read the result check that the call address does not match the tag on the result you have a "not taken" result.

At this point we get to the second part of the statement.

and the IQ predicts that the branch is taken

So branch target says "not taken" and the IQ predicts that it will be taken we have a contradiction.

To solve the contradiction the IQ wins as the branch target is just "if we jump, we jump here", but the IQ predicts if we jump or not based on a lot more logic.

Hence

then the IQ will force the Branch Address Calculator to issue a BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline.

Which is good in a 14-19 stage pipeline. The 8 cycles is if the IQ can read the actual target address from the instruction (combined with PC), if the value needs to be read in a register (that is possible not yet retired) it could take a bit longer.

answered Oct 07 '22 17:10

Surt

Related questions
                            
                                WPF DataGrid performance concerns
                            
                                How is this memoized DP table too slow for SPOJ?
                            
                                Why is my Cassandra node stuck with MutationStage increasing?
                            
                                What would cause Hibernate performance to fluctuate in a nondeterminisic way?
                            
                                MongoDB Aggregation V/S simple query performance?
                            
                                Number of objects vs Payload, while scaling a modern Javascript project which is more important?
                            
                                Possible Solutions to Poor Serialization Performance
                            
                                Scala slow builds: development approaches to avoid
                            
                                python read() from stdout much slower than reading line by line (slurping?)
                            
                                Using celery to process huge text files
                            
                                Most efficient pointer arithmetic type in c
                            
                                Case of using filtered statistics
                            
                                Partially evaluating right-handed operator sections
                            
                                NodeJS much slower than PHP?
                            
                                Choosing a multiplier for a (string) hash function
                            
                                String width via fontmetrics calculation is very slow if there are arabic or persian letters in text
                            
                                'Shared Object Memory' vs 'Heap Memory' - Java
                            
                                Application, improve performance of touch events
                            
                                LESS, Media Queries, and Performance
                            
                                Large "idle" bars in Chrome dev tools Frames Timeline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With