Has anyone tried determining if the A520 cores have been configured with the “One FPU per two IU” option yet? This is the recommended configuration by ARM, but I’m curious if they may have opted out of that since this is not a phone chip.
A520 FPU configuration
Would you suggest any simple benchmark to sort this out ?
I was kind of hoping you or tkaiser might have an idea. I suppose something FP heavy that can be parallelized can be run on varying sets of the A520 cores; if it increases roughly by 4x with all four cores, then they’re probably all full. Otherwise it will bottleneck as they fight for the FPU and slow everything down.
You mean the cores would share the FPUs between them ? That would surprise me quite a bit. I thought you meant the “1 FPU per 2 IU” would be “one FP unit per 2 integer units”, and since there are 3 ALU + one MAC/DIV, I guess these are configurable depending on the expected power savings and then maybe Arm recommends a number of FPU per ALU to maintain a good balance between performance and efficiency. But I could be wrong, of course.
FYI I just did something very dirty, I replaced the cycle-burning code in mhz.c from integer ops to floating point ops. It gives me a number of subtract per second instead of int ops per second, as showing in the MHz output. On a single core:
$ taskset -c 1 ./mhz 5
count=59843 us50=19990 us250=99959 diff=79969 cpu_MHz=149.665
count=59843 us50=19991 us250=99959 diff=79968 cpu_MHz=149.667
count=59843 us50=19990 us250=99955 diff=79965 cpu_MHz=149.673
count=59843 us50=19988 us250=99961 diff=79973 cpu_MHz=149.658
count=59843 us50=19989 us250=99960 diff=79971 cpu_MHz=149.662
willy@orion-o6:~/mhz$
That is 150 million floating point instructions per second.
And on the 4 A520 cores in parallel:
$ taskset -c 1 ./mhz 5 & taskset -c 2 ./mhz 5 & taskset -c 3 ./mhz 5 & taskset -c 4 ./mhz 5 &
[1] 32158
[2] 32159
[3] 32160
[4] 32161
count=59768 us50=19985 us250=99941 diff=79956 cpu_MHz=149.502
count=59813 us50=19993 us250=99977 diff=79984 cpu_MHz=149.562
count=59853 us50=19996 us250=99989 diff=79993 cpu_MHz=149.646
count=59843 us50=19988 us250=99977 diff=79989 cpu_MHz=149.628
count=59768 us50=19985 us250=99938 diff=79953 cpu_MHz=149.508
count=59813 us50=19993 us250=99977 diff=79984 cpu_MHz=149.562
count=59843 us50=19991 us250=99977 diff=79986 cpu_MHz=149.634
count=59853 us50=19992 us250=99985 diff=79993 cpu_MHz=149.646
count=59768 us50=19984 us250=99925 diff=79941 cpu_MHz=149.530
count=59813 us50=19992 us250=99977 diff=79985 cpu_MHz=149.561
count=59843 us50=19990 us250=99971 diff=79981 cpu_MHz=149.643
count=59853 us50=19993 us250=99987 diff=79994 cpu_MHz=149.644
count=59768 us50=19983 us250=99931 diff=79948 cpu_MHz=149.517
count=59813 us50=19990 us250=99977 diff=79987 cpu_MHz=149.557
count=59843 us50=19988 us250=99981 diff=79993 cpu_MHz=149.621
count=59853 us50=19991 us250=99983 diff=79992 cpu_MHz=149.647
count=59768 us50=19984 us250=99936 diff=79952 cpu_MHz=149.510
count=59813 us50=19992 us250=99974 diff=79982 cpu_MHz=149.566
count=59843 us50=19990 us250=99978 diff=79988 cpu_MHz=149.630
count=59853 us50=19992 us250=99981 diff=79989 cpu_MHz=149.653
Still 150M/s on each process, likely indicating that at first glance the FPU capacity per core remains the same in this test.
FWIW the same test on core 0 and 11 gives 325/core. That’s a 2.166 absolute gain, or a 50% gain at the same frequency.