Herman Code 🚀

Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with mmpopcntu64 on Intel CPUs

February 20, 2025

Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with mmpopcntu64 on Intel CPUs

Contemporary package improvement frequently calls for dealing with monolithic datasets, pushing the boundaries of hardware show. A seemingly innocuous alteration, similar switching from a 32-spot loop antagonistic to a sixty four-spot 1, tin typically pb to surprising show impacts, particularly once using specialised directions similar _mm_popcnt_u64 connected Intel CPUs. This station delves into the amazing show deviations that tin originate from this seemingly elemental modification, exploring the underlying causes and providing possible options.

The Funny Lawsuit of the sixty four-Spot Antagonistic

Colonisation number, frequently utilized successful bioinformatics, cryptography, and another fields, counts the figure of fit bits successful a fixed information construction. Intel’s _mm_popcnt_u64 education supplies an businesslike manner to execute this cognition connected sixty four-spot integers. Nevertheless, new benchmarks uncover a perplexing development: changing a 32-spot loop antagonistic with its sixty four-spot counterpart tin pb to important show regressions successful loops using this education. This anomaly contradicts the communal anticipation that bigger counters ought to, astatine worst, person a impartial show contact.

This show deviation isn’t a cosmopolitan development. It’s predominantly noticed connected circumstantial Intel CPU architectures and whitethorn change relying connected the compiler and optimization flags utilized. Knowing the underlying hardware microarchitecture is important to deciphering this show puzzle.

1 imaginable mentation lies successful the manner representation addresses are accessed and processed inside the CPU’s cache hierarchy. sixty four-spot addresses tin possibly pb to little businesslike cache utilization, peculiarly once dealing with ample arrays oregon information buildings. This cache inefficiency tin manifest arsenic accrued representation entree latency, frankincense slowing behind the general loop execution.

Unmasking the Show Bottleneck

Figuring out the base origin of show deviations requires a heavy dive into the CPU’s microarchitecture. Elements specified arsenic cache formation measurement, prefetching behaviour, and representation alignment tin each drama a function. Instruments similar perf and VTune Amplifier tin beryllium invaluable for profiling the codification and pinpointing the bottlenecks.

For case, perf tin uncover cache girl charges and subdivision mispredictions, piece VTune gives elaborate insights into education-flat show. By analyzing the show information, builders tin addition a clearer knowing of wherefore the sixty four-spot antagonistic introduces show regressions successful circumstantial situations.

See this script: a loop iterates complete a ample array, performing colonisation number connected all sixty four-spot component. If the array dimension exceeds the L3 cache capability, utilizing a sixty four-spot antagonistic mightiness origin much predominant cache misses, starring to show degradation. This occurs due to the fact that the bigger code abstraction necessitates accessing antithetic cache traces much often.

Mitigation Methods and Champion Practices

Respective methods tin aid mitigate the show points related with sixty four-spot counters and _mm_popcnt_u64.

  • Loop Chunking: Dividing the loop into smaller chunks tin better cache locality and trim cache misses.
  • Information Alignment: Guaranteeing information alignment tin optimize representation entree patterns and trim latency.

Optimizing compiler flags, specified arsenic enabling vectorization and loop unrolling, tin additional heighten show. Experimenting with antithetic compiler variations and flags is frequently essential to discovery the optimum operation for a circumstantial hardware and package situation.

Past the Antagonistic: Broader Implications

The show deviations noticed with _mm_popcnt_u64 detail a broader content successful contemporary package improvement: the expanding complexity of hardware-package interactions. Seemingly insignificant codification modifications tin person unexpected show penalties, particularly once using specialised directions oregon dealing with ample datasets.

Knowing the underlying hardware structure and using due profiling instruments are important for optimizing show successful specified situations. This lawsuit survey serves arsenic a reminder that show optimization is an ongoing procedure that requires steady monitoring and adaptation to evolving hardware and package landscapes. Seat additional accusation connected loop optimization.

  1. Chart your codification utilizing instruments similar perf oregon VTune Amplifier.
  2. Analyse the show information to place bottlenecks.
  3. Experimentation with mitigation methods specified arsenic loop chunking and information alignment.

Infographic Placeholder: [Insert infographic illustrating cache behaviour with 32-spot vs. sixty four-spot counters]

Addressing Communal Issues

Q: Does this content impact each Intel CPUs?

A: Nary, the show contact varies relying connected the circumstantial microarchitecture and another components similar compiler optimizations. It’s important to benchmark your codification connected the mark hardware.

Arsenic we’ve seen, equal seemingly easy adjustments similar switching to a sixty four-spot loop antagonistic tin person amazing show implications, particularly once dealing with specialised directions similar _mm_popcnt_u64. Cautious investigation, profiling, and optimization methods are critical for maximizing show successful these situations. By knowing the underlying hardware and package interactions, builders tin navigate these complexities and physique advanced-performing functions. Research additional optimizations by delving into SIMD directions and representation direction methods to full unlock the possible of your codification. Analyze sources similar Intel’s developer manuals and on-line boards for successful-extent accusation and assemblage activity. Don’t fto hidden show bottlenecks clasp you backmost. Commencement optimizing present!

Outer sources:

Question & Answer :
I was trying for the quickest manner to popcount ample arrays of information. I encountered a precise bizarre consequence: Altering the loop adaptable from unsigned to uint64_t made the show driblet by 50% connected my Microcomputer.

The Benchmark

#see <iostream> #see <chrono> #see <x86intrin.h> int chief(int argc, char* argv[]) { utilizing namespace std; if (argc != 2) { cerr << "utilization: array_size successful MB" << endl; instrument -1; } uint64_t dimension = atol(argv[1])<<20; uint64_t* buffer = fresh uint64_t[measurement/eight]; char* charbuffer = reinterpret_cast<char*>(buffer); for (unsigned i=zero; i<measurement; ++i) charbuffer[i] = rand()%256; uint64_t number,period; chrono::time_point<chrono::system_clock> startP,endP; { startP = chrono::system_clock::present(); number = zero; for( unsigned okay = zero; ok < ten thousand; okay++){ // Choky unrolled loop with unsigned for (unsigned i=zero; i<measurement/eight; i+=four) { number += _mm_popcnt_u64(buffer[i]); number += _mm_popcnt_u64(buffer[i+1]); number += _mm_popcnt_u64(buffer[i+2]); number += _mm_popcnt_u64(buffer[i+three]); } } endP = chrono::system_clock::present(); period = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).number(); cout << "unsigned\t" << number << '\t' << (period/1.0E9) << " sec \t" << (ten thousand.zero*dimension)/(length) << " GB/s" << endl; } { startP = chrono::system_clock::present(); number=zero; for( unsigned okay = zero; okay < ten thousand; okay++){ // Choky unrolled loop with uint64_t for (uint64_t i=zero;i<dimension/eight;i+=four) { number += _mm_popcnt_u64(buffer[i]); number += _mm_popcnt_u64(buffer[i+1]); number += _mm_popcnt_u64(buffer[i+2]); number += _mm_popcnt_u64(buffer[i+three]); } } endP = chrono::system_clock::present(); length = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).number(); cout << "uint64_t\t" << number << '\t' << (period/1.0E9) << " sec \t" << (ten thousand.zero*dimension)/(period) << " GB/s" << endl; } escaped(charbuffer); } 

Arsenic you seat, we make a buffer of random information, with the measurement being x megabytes wherever x is publication from the bid formation. Afterwards, we iterate complete the buffer and usage an unrolled interpretation of the x86 popcount intrinsic to execute the popcount. To acquire a much exact consequence, we bash the popcount 10,000 occasions. We measurement the occasions for the popcount. Successful the high lawsuit, the interior loop adaptable is unsigned, successful the less lawsuit, the interior loop adaptable is uint64_t. I idea that this ought to brand nary quality, however the other is the lawsuit.

The (perfectly brainsick) outcomes

I compile it similar this (g++ interpretation: Ubuntu four.eight.2-19ubuntu1):

g++ -O3 -march=autochthonal -std=c++eleven trial.cpp -o trial 

Present are the outcomes connected my Haswell Center i7-4770K CPU @ three.50 GHz, moving trial 1 (truthful 1 MB random information):

  • unsigned 41959360000 zero.401554 sec 26.113 GB/s
  • uint64_t 41959360000 zero.759822 sec thirteen.8003 GB/s

Arsenic you seat, the throughput of the uint64_t interpretation is lone fractional the 1 of the unsigned interpretation! The job appears to beryllium that antithetic meeting will get generated, however wherefore? Archetypal, I idea of a compiler bug, truthful I tried clang++ (Ubuntu Clang interpretation three.four-1ubuntu3):

clang++ -O3 -march=autochthonal -std=c++eleven teest.cpp -o trial 

Consequence: trial 1

  • unsigned 41959360000 zero.398293 sec 26.3267 GB/s
  • uint64_t 41959360000 zero.680954 sec 15.3986 GB/s

Truthful, it is about the aforesaid consequence and is inactive unusual. However present it will get ace unusual. I regenerate the buffer dimension that was publication from enter with a changeless 1, truthful I alteration:

uint64_t measurement = atol(argv[1]) << 20; 

to

uint64_t measurement = 1 << 20; 

Frankincense, the compiler present is aware of the buffer measurement astatine compile clip. Possibly it tin adhd any optimizations! Present are the numbers for g++:

  • unsigned 41959360000 zero.509156 sec 20.5944 GB/s
  • uint64_t 41959360000 zero.508673 sec 20.6139 GB/s

Present, some variations are as accelerated. Nevertheless, the unsigned obtained equal slower! It dropped from 26 to 20 GB/s, frankincense changing a non-changeless by a changeless worth pb to a deoptimization. Earnestly, I person nary hint what is going connected present! However present to clang++ with the fresh interpretation:

  • unsigned 41959360000 zero.677009 sec 15.4884 GB/s
  • uint64_t 41959360000 zero.676909 sec 15.4906 GB/s

Delay, what? Present, some variations dropped to the dilatory figure of 15 GB/s. Frankincense, changing a non-changeless by a changeless worth equal pb to dilatory codification successful some circumstances for Clang!

I requested a workfellow with an Ivy Span CPU to compile my benchmark. Helium obtained akin outcomes, truthful it does not look to beryllium Haswell. Due to the fact that 2 compilers food unusual outcomes present, it besides does not look to beryllium a compiler bug. We bash not person an AMD CPU present, truthful we might lone trial with Intel.

Much insanity, delight!

Return the archetypal illustration (the 1 with atol(argv[1])) and option a static earlier the adaptable, i.e.:

static uint64_t dimension=atol(argv[1])<<20; 

Present are my outcomes successful g++:

  • unsigned 41959360000 zero.396728 sec 26.4306 GB/s
  • uint64_t 41959360000 zero.509484 sec 20.5811 GB/s

Yay, but different alternate. We inactive person the accelerated 26 GB/s with u32, however we managed to acquire u64 astatine slightest from the thirteen GB/s to the 20 GB/s interpretation! Connected my collegue’s Microcomputer, the u64 interpretation grew to become equal quicker than the u32 interpretation, yielding the quickest consequence of each. Sadly, this lone plant for g++, clang++ does not look to attention astir static.

My motion

Tin you explicate these outcomes? Particularly:

  • However tin location beryllium specified a quality betwixt u32 and u64?
  • However tin changing a non-changeless by a changeless buffer measurement set off little optimum codification?
  • However tin the insertion of the static key phrase brand the u64 loop sooner? Equal sooner than the first codification connected my collegue’s machine!

I cognize that optimization is a difficult district, nevertheless, I ne\’er idea that specified tiny modifications tin pb to a one hundred% quality successful execution clip and that tiny components similar a changeless buffer dimension tin once more premix outcomes wholly. Of class, I ever privation to person the interpretation that is capable to popcount 26 GB/s. The lone dependable manner I tin deliberation of is transcript paste the meeting for this lawsuit and usage inline meeting. This is the lone manner I tin acquire free of compilers that look to spell huffy connected tiny modifications. What bash you deliberation? Is location different manner to reliably acquire the codification with about show?

The Disassembly

Present is the disassembly for the assorted outcomes:

26 GB/s interpretation from g++ / u32 / non-const bufsize:

0x400af8: lea 0x1(%rdx),%eax popcnt (%rbx,%rax,eight),%r9 lea 0x2(%rdx),%edi popcnt (%rbx,%rcx,eight),%rax lea 0x3(%rdx),%esi adhd %r9,%rax popcnt (%rbx,%rdi,eight),%rcx adhd $0x4,%edx adhd %rcx,%rax popcnt (%rbx,%rsi,eight),%rcx adhd %rcx,%rax mov %edx,%ecx adhd %rax,%r14 cmp %rbp,%rcx jb 0x400af8 

thirteen GB/s interpretation from g++ / u64 / non-const bufsize:

0x400c00: popcnt 0x8(%rbx,%rdx,eight),%rcx popcnt (%rbx,%rdx,eight),%rax adhd %rcx,%rax popcnt 0x10(%rbx,%rdx,eight),%rcx adhd %rcx,%rax popcnt 0x18(%rbx,%rdx,eight),%rcx adhd $0x4,%rdx adhd %rcx,%rax adhd %rax,%r12 cmp %rbp,%rdx jb 0x400c00 

15 GB/s interpretation from clang++ / u64 / non-const bufsize:

0x400e50: popcnt (%r15,%rcx,eight),%rdx adhd %rbx,%rdx popcnt 0x8(%r15,%rcx,eight),%rsi adhd %rdx,%rsi popcnt 0x10(%r15,%rcx,eight),%rdx adhd %rsi,%rdx popcnt 0x18(%r15,%rcx,eight),%rbx adhd %rdx,%rbx adhd $0x4,%rcx cmp %rbp,%rcx jb 0x400e50 

20 GB/s interpretation from g++ / u32&u64 / const bufsize:

0x400a68: popcnt (%rbx,%rdx,1),%rax popcnt 0x8(%rbx,%rdx,1),%rcx adhd %rax,%rcx popcnt 0x10(%rbx,%rdx,1),%rax adhd %rax,%rcx popcnt 0x18(%rbx,%rdx,1),%rsi adhd $0x20,%rdx adhd %rsi,%rcx adhd %rcx,%rbp cmp $0x100000,%rdx jne 0x400a68 

15 GB/s interpretation from clang++ / u32&u64 / const bufsize:

0x400dd0: popcnt (%r14,%rcx,eight),%rdx adhd %rbx,%rdx popcnt 0x8(%r14,%rcx,eight),%rsi adhd %rdx,%rsi popcnt 0x10(%r14,%rcx,eight),%rdx adhd %rsi,%rdx popcnt 0x18(%r14,%rcx,eight),%rbx adhd %rdx,%rbx adhd $0x4,%rcx cmp $0x20000,%rcx jb 0x400dd0 

Curiously, the quickest (26 GB/s) interpretation is besides the longest! It appears to beryllium the lone resolution that makes use of lea. Any variations usage jb to leap, others usage jne. However isolated from that, each variations look to beryllium comparable. I don’t seat wherever a one hundred% show spread may originate from, however I americium not excessively adept astatine deciphering meeting. The slowest (thirteen GB/s) interpretation appears to be like equal precise abbreviated and bully. Tin anybody explicate this?

Classes discovered

Nary substance what the reply to this motion volition beryllium; I person discovered that successful truly blistery loops all item tin substance, equal particulars that bash not look to person immoderate relation to the blistery codification. I person ne\’er idea astir what kind to usage for a loop adaptable, however arsenic you seat specified a insignificant alteration tin brand a a hundred% quality! Equal the retention kind of a buffer tin brand a immense quality, arsenic we noticed with the insertion of the static key phrase successful advance of the measurement adaptable! Successful the early, I volition ever trial assorted options connected assorted compilers once penning truly choky and blistery loops that are important for scheme show.

The absorbing happening is besides that the show quality is inactive truthful advanced though I person already unrolled the loop 4 instances. Truthful equal if you unroll, you tin inactive acquire deed by great show deviations. Rather absorbing.

Perpetrator: Mendacious Information Dependency (and the compiler isn’t equal alert of it)

Connected Sandy/Ivy Span and Haswell processors, the education:

popcnt src, dest 

seems to person a mendacious dependency connected the vacation spot registry dest. Equal although the education lone writes to it, the education volition delay till dest is fit earlier executing. This mendacious dependency is (present) documented by Intel arsenic erratum HSD146 (Haswell) and SKL029 (Skylake)

Skylake fastened this for lzcnt and tzcnt.
Cannon Water (and Crystal Water) mounted this for popcnt.
bsf/bsr person a actual output dependency: output unmodified for enter=zero. (However nary manner to return vantage of that with intrinsics - lone AMD paperwork it and compilers don’t exposure it.)

(Sure, these directions each tally connected the aforesaid execution part).


This dependency doesn’t conscionable clasp ahead the four popcnts from a azygous loop iteration. It tin transportation crossed loop iterations making it intolerable for the processor to parallelize antithetic loop iterations.

The unsigned vs. uint64_t and another tweaks don’t straight impact the job. However they power the registry allocator which assigns the registers to the variables.

Successful your lawsuit, the speeds are a nonstop consequence of what is caught to the (mendacious) dependency concatenation relying connected what the registry allocator determined to bash.

  • thirteen GB/s has a concatenation: popcnt-adhd-popcnt-popcnt → adjacent iteration
  • 15 GB/s has a concatenation: popcnt-adhd-popcnt-adhd → adjacent iteration
  • 20 GB/s has a concatenation: popcnt-popcnt → adjacent iteration
  • 26 GB/s has a concatenation: popcnt-popcnt → adjacent iteration

The quality betwixt 20 GB/s and 26 GB/s appears to beryllium a insignificant artifact of the oblique addressing. Both manner, the processor begins to deed another bottlenecks erstwhile you range this velocity.


To trial this, I utilized inline meeting to bypass the compiler and acquire precisely the meeting I privation. I besides divided ahead the number adaptable to interruption each another dependencies that mightiness messiness with the benchmarks.

Present are the outcomes:

Sandy Span Xeon @ three.5 GHz: (afloat trial codification tin beryllium recovered astatine the bottommost)

  • GCC four.6.three: g++ popcnt.cpp -std=c++0x -O3 -prevention-temps -march=autochthonal
  • Ubuntu 12

Antithetic Registers: 18.6195 GB/s

.L4: movq (%rbx,%rax,eight), %r8 movq eight(%rbx,%rax,eight), %r9 movq sixteen(%rbx,%rax,eight), %r10 movq 24(%rbx,%rax,eight), %r11 addq $four, %rax popcnt %r8, %r8 adhd %r8, %rdx popcnt %r9, %r9 adhd %r9, %rcx popcnt %r10, %r10 adhd %r10, %rdi popcnt %r11, %r11 adhd %r11, %rsi cmpq $131072, %rax jne .L4 

Aforesaid Registry: eight.49272 GB/s

.L9: movq (%rbx,%rdx,eight), %r9 movq eight(%rbx,%rdx,eight), %r10 movq sixteen(%rbx,%rdx,eight), %r11 movq 24(%rbx,%rdx,eight), %rbp addq $four, %rdx # This clip reuse "rax" for each the popcnts. popcnt %r9, %rax adhd %rax, %rcx popcnt %r10, %rax adhd %rax, %rsi popcnt %r11, %rax adhd %rax, %r8 popcnt %rbp, %rax adhd %rax, %rdi cmpq $131072, %rdx jne .L9 

Aforesaid Registry with breached concatenation: 17.8869 GB/s

.L14: movq (%rbx,%rdx,eight), %r9 movq eight(%rbx,%rdx,eight), %r10 movq sixteen(%rbx,%rdx,eight), %r11 movq 24(%rbx,%rdx,eight), %rbp addq $four, %rdx # Reuse "rax" for each the popcnts. xor %rax, %rax # Interruption the transverse-iteration dependency by zeroing "rax". popcnt %r9, %rax adhd %rax, %rcx popcnt %r10, %rax adhd %rax, %rsi popcnt %r11, %rax adhd %rax, %r8 popcnt %rbp, %rax adhd %rax, %rdi cmpq $131072, %rdx jne .L14 

Truthful what went incorrect with the compiler?

It appears that neither GCC nor Ocular Workplace are alert that popcnt has specified a mendacious dependency. However, these mendacious dependencies aren’t unusual. It’s conscionable a substance of whether or not the compiler is alert of it.

popcnt isn’t precisely the about utilized education. Truthful it’s not truly a astonishment that a great compiler might girl thing similar this. Location besides seems to beryllium nary documentation anyplace that mentions this job. If Intel doesn’t disclose it, past cipher extracurricular volition cognize till person runs into it by accidental.

(Replace: Arsenic of interpretation four.9.2, GCC is alert of this mendacious-dependency and generates codification to compensate it once optimizations are enabled. Great compilers from another distributors, together with Clang, MSVC, and equal Intel’s ain ICC are not but alert of this microarchitectural erratum and volition not emit codification that compensates for it.)

Wherefore does the CPU person specified a mendacious dependency?

We tin speculate: it runs connected the aforesaid execution part arsenic bsf / bsr which bash person an output dependency. (However is POPCNT carried out successful hardware?). For these directions, Intel paperwork the integer consequence for enter=zero arsenic “undefined” (with ZF=1), however Intel hardware really offers a stronger warrant to debar breaking aged package: output unmodified. AMD paperwork this behaviour.

Presumably it was someway inconvenient to brand any uops for this execution part babelike connected the output however others not.

AMD processors bash not look to person this mendacious dependency.


The afloat trial codification is beneath for mention:

#see <iostream> #see <chrono> #see <x86intrin.h> int chief(int argc, char* argv[]) { utilizing namespace std; uint64_t measurement=1<<20; uint64_t* buffer = fresh uint64_t[measurement/eight]; char* charbuffer=reinterpret_cast<char*>(buffer); for (unsigned i=zero;i<dimension;++i) charbuffer[i]=rand()%256; uint64_t number,length; chrono::time_point<chrono::system_clock> startP,endP; { uint64_t c0 = zero; uint64_t c1 = zero; uint64_t c2 = zero; uint64_t c3 = zero; startP = chrono::system_clock::present(); for( unsigned okay = zero; okay < ten thousand; ok++){ for (uint64_t i=zero;i<measurement/eight;i+=four) { uint64_t r0 = buffer[i + zero]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + three]; __asm__( "popcnt %four, %four \n\t" "adhd %four, %zero \n\t" "popcnt %5, %5 \n\t" "adhd %5, %1 \n\t" "popcnt %6, %6 \n\t" "adhd %6, %2 \n\t" "popcnt %7, %7 \n\t" "adhd %7, %three \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) ); } } number = c0 + c1 + c2 + c3; endP = chrono::system_clock::present(); length=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).number(); cout << "Nary Concatenation\t" << number << '\t' << (period/1.0E9) << " sec \t" << (ten thousand.zero*dimension)/(length) << " GB/s" << endl; } { uint64_t c0 = zero; uint64_t c1 = zero; uint64_t c2 = zero; uint64_t c3 = zero; startP = chrono::system_clock::present(); for( unsigned okay = zero; okay < ten thousand; ok++){ for (uint64_t i=zero;i<measurement/eight;i+=four) { uint64_t r0 = buffer[i + zero]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + three]; __asm__( "popcnt %four, %%rax \n\t" "adhd %%rax, %zero \n\t" "popcnt %5, %%rax \n\t" "adhd %%rax, %1 \n\t" "popcnt %6, %%rax \n\t" "adhd %%rax, %2 \n\t" "popcnt %7, %%rax \n\t" "adhd %%rax, %three \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) : "rax" ); } } number = c0 + c1 + c2 + c3; endP = chrono::system_clock::present(); length=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).number(); cout << "Concatenation four \t" << number << '\t' << (period/1.0E9) << " sec \t" << (ten thousand.zero*measurement)/(period) << " GB/s" << endl; } { uint64_t c0 = zero; uint64_t c1 = zero; uint64_t c2 = zero; uint64_t c3 = zero; startP = chrono::system_clock::present(); for( unsigned ok = zero; okay < ten thousand; ok++){ for (uint64_t i=zero;i<measurement/eight;i+=four) { uint64_t r0 = buffer[i + zero]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + three]; __asm__( "xor %%rax, %%rax \n\t" // <--- Interruption the concatenation. "popcnt %four, %%rax \n\t" "adhd %%rax, %zero \n\t" "popcnt %5, %%rax \n\t" "adhd %%rax, %1 \n\t" "popcnt %6, %%rax \n\t" "adhd %%rax, %2 \n\t" "popcnt %7, %%rax \n\t" "adhd %%rax, %three \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) : "rax" ); } } number = c0 + c1 + c2 + c3; endP = chrono::system_clock::present(); period=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).number(); cout << "Breached Concatenation\t" << number << '\t' << (length/1.0E9) << " sec \t" << (ten thousand.zero*dimension)/(length) << " GB/s" << endl; } escaped(charbuffer); } 

An as absorbing benchmark tin beryllium recovered present: http://pastebin.com/kbzgL8si
This benchmark varies the figure of popcnts that are successful the (mendacious) dependency concatenation.

Mendacious Concatenation zero: 41959360000 zero.57748 sec 18.1578 GB/s Mendacious Concatenation 1: 41959360000 zero.585398 sec 17.9122 GB/s Mendacious Concatenation 2: 41959360000 zero.645483 sec sixteen.2448 GB/s Mendacious Concatenation three: 41959360000 zero.929718 sec eleven.2784 GB/s Mendacious Concatenation four: 41959360000 1.23572 sec eight.48557 GB/s