16% more GPU performance for Panthor

There is a problem with the clock configuration of RK3588 in every known implementation, mainline, bsp etc, and thus the GPU / NPU clocks are not being set to nominal max values when CRU is used.

To understand why, better to look at how clocks are assigned to individual soc cores.

The base source frequency in 24MHz that is feeding everywhere and it is fixed.

There are several PLLs that take this frequency, and creates or synthesize another frequency from it. But the PLLs are not generally changed once they are configured. So they are configured once the device boots up and thats it.

Individual cores (ie, gpu) get their source clock form the PLL output. The cores however dynamically change the PLL output by dividing it. So they can reduce the PLL input to their own desired clock, by dividing them (only decrease, they cant multiply).

So simply something like below:

24Mhz -> PLL (multipliers and dividers [p,m,s,k]) -> GPU/NPU (divider) = dynamic frequency.

There are a bunch of PLLs but for GPU only CPL, GPLL, V0PLL, AUPLL, SPLL, NPLL is relevant. Gpu can select one of those and can divide it with a value in between [1-32].

default configuration for those PLLs are:

AUPLL: 786.432 Mhz
CPLL : 1500 Mhz
V0PLL: 1188 Mhz
NPLL : 850 Mhz
GPLL : 1188 Mhz

you can verify this by

sudo cat /sys/kernel/debug/clk/clk_summary | grep pll_

The problem with frequency divider is, you get your step frequency more sensitive in the lower frequency range, in the higher freq. range your steps will be collasal.

Ie:
for CPLL: 1500/1=1500Mhz, 1500/2=750Mhz, 1500/3=500Mhz, 1500/4=375Mhz ....

so you can jump from 1500 to 750 and to 500 from there, there frequencies in between can not be set.

and if you take all frequencies above listed PLLs and divide them you will get the below steps:

... 500, 594, 702, 750, 786, 850, 1188, 1500 Mhz. And this is the exact problem.

When you look at the opp_table of gpu/npu, you can see that it is configured to get ... 500, 600, 700, 800, 900, 1000 Mhz already, but the PLLs and the core dividers can not create such frequency.

Instead, when you request ie: 1000Mhz, clock driver gets the highest possible frequency, which is less than the reeusted in above table, In this case it is 850 Mhz. This is the top frequency you can get and thats the problem.

You can also verify this by.
set the the gpu governor to performance
sudo bash -c "echo performance > /sys/class/devfreq/fb000000.gpu/governor" (Note: you fb00000.gpu node might have different name, check it first)

check the actual assigned clock
sudo cat /sys/kernel/debug/clk/clk_summary | grep gpu

you will see that it will be maximum 850Mhz no matter what.

The Solution:

The actual solution is to tune the PLL clocks so that they can provide the frequencies which are requested in the opp_tables but this is not as easy as it sounds, because the above PLLs are also used in bunch of other cores which are not gpu, and if your new clock does not meet the frequency tolerance of the other cores then you will break other hardware cores.

But there is an easy approach. NPLL is mainly used for NPU and it has the same clock opptable as GPU. So if we modify this we will have the least impact to other components.

So i just pumped the NPLL clock from 850Mhz to 4Ghz so that we will bunch more divided frequencies. With that change, the available frequencies would be:

500, 571, 594, 666, 702, 750, 786, 800, 1000, 1188, 1333, 1500 ...

Now we can get proper 1Ghz or 800Mhz.

Here is the fix https://github.com/hbiyik/linux/commit/e4fd428dd34fe13cbd5fa6ed79e2f787bc7655b0

new when the governor is set to performance you can actually get 1Ghz set any can verify this by.
sudo cat /sys/kernel/debug/clk/clk_summary | grep gpu

I have also benchmarked this with glmark2-wayland -b terrain on weston.

Before the score was 112fps with the fix it is 130fps! So this is our way of getting lost %16 performance.

It worth to note that setting NPLL to 4Ghz is just a workaround, becuase step frequencies like 900 are still not available. Also 4Ghz is way above of this max 1.5Ghz define in bsp SOC, which also contradicts with TRM statement of 4.5Ghz (The FracPLL Fout part)

. Alternatively it can be set to 1Ghz to be more conservative but then 4000/5=800Mhz option will be gone…

Future Plans

It is also possible to bypass the kernel and set the individual CRU registers with mmm tool that i created. I have also set the GPU freq to 1.5GHz so the tool works, but a second tool is actually needed to change the regulator voltages over RK806 to feed such frequencies to gpu properly otherwise it will just crash when voltage is not enough.

a general tip to get involved with mmm.

//to get the actual CRU registers status
sudo python mmm.py get -c rk3588 -d CRU
//to get clocks in the CRU
sudo python mmm.py get -c rk3588 -d CRU -p clock
//to set PLL source of to GPU to GPLL
sudo python mmm.py set -c rk3588 -d CRU -r GPU_CLKSEL -p sel GPLL
//to set PLL source of to GPU to SPLL
sudo python mmm.py set -c rk3588 -d CRU -r GPU_CLKSEL -p sel SPLL
//to set GPU clock divider to 1 (catuion value - 1 must be entered )
sudo python mmm.py set -c rk3588 -d CRU -r GPU_CLKSEL -p div 0
//to set GPU clock divider to 2 (catuion value - 1 must be entered )
sudo python mmm.py set -c rk3588 -d CRU -r GPU_CLKSEL -p div 1

so the tool is quite capable, but be careful, dont burn your device or break it. You have been warned…
Any developer or tinkerer interest is appreciated in the tool, but again be careful, it is a scalpel for a surgeon

PVTPLL situation is continued here

13 Likes

You just squashed the peformance trigger on the chip ! Congratulations Great post ma dude! :facepunch:t2::beer:

Awesome work!
I’ve seen some dts to overclock rk3588 to higher values, but probably not that complete like here.
Hopefully we will get best values optimized in future release.
Earlier @tkaiser was able to get the best from this SoC, maybe he can share his methods? :slight_smile:

I have these values??: I mean, how do i know at which frequency gpu is running?
oot@rock5b:/home/rock# cat /sys/kernel/debug/clk/clk_summary | grep GPU

 scmi_clk_gpu                         1        1        0  1000000000          0     0  50000         Y
    clk_gpu_pvtm                      0        0        0    24000000          0     0  50000         N
          clk_gpu_src                 3        3        0   198000000          0     0  50000         Y
             clk_core_gpu_pvtm        0        0        0   198000000          0     0  50000         N
             clk_gpu_stacks           1        3        0   198000000          0     0  50000         Y
             clk_gpu_coregroup        1        3        0   198000000          0     0  50000         Y
             clk_gpu                  1        3        0   198000000          0     0  50000         Y

rock@rock5b:~$ glmark2-es2-wayland -b terrain
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
=======================================================
    glmark2 2021.02
=======================================================
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-LODX
    GL_VERSION:    OpenGL ES 3.2 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
=======================================================
[terrain] <default>: FPS: 319 FrameTime: 3.135 ms
=======================================================
                                  glmark2 Score: 319 
=======================================================

Nice work! Just out of curiosity, have you checked whether the power draw increases a bit by bumping the PLL to 4 GHz ? I think it should be negligible but possibly observable.

did not check it, but just pumping the PLL shouldnt increase the consumption, it is just a roaming clock. the consumption comes from a core that is attached to that clock with its own driver. in our case gpu will have divider 4 for with 1 ghz

if you are concerned about 4ghz, you might as well set it 1Ghz, it will apply the same affect.

that is actually interesting, you are running the blob driver.

when i dump the clock of gpu with blob driver i get the following

[alarm@alarm mmm]$ sudo python mmm.py get -c rk3588 -d CRU -r GPU_CLKSEL
-c rk3588 -d CRU -r GPU_CLKSEL -p div = 5, (default=0), (values=[0~31])
-c rk3588 -d CRU -r GPU_CLKSEL -p sel = GPLL, (default=GPLL), (values=GPLL,CPLL,AUPLL,NPLL,SPLL)
-c rk3588 -d CRU -r GPU_CLKSEL -p testout_div = 31, (default=0), (values=[0~31])
-c rk3588 -d CRU -r GPU_CLKSEL -p testout_mux = PLL, (default=PLL), (values=PLL,PVTM)
-c rk3588 -d CRU -r GPU_CLKSEL -p mux = PLL, (default=PLL), (values=PLL,PVTM)
-c rk3588 -d CRU -r GPU_CLKSEL -p reserved = 0, (default=0)
-c rk3588 -d CRU -r GPU_CLKSEL -p clock = 198 Mhz

[alarm@alarm mmm]$ sudo python mmm.py get -c rk3588 -d CRU -p clock     
-c rk3588 -d CRU -r V0PLL_CON0 -p clock = 1188 Mhz
-c rk3588 -d CRU -r AUPLL_CON0 -p clock = 786 Mhz
-c rk3588 -d CRU -r CPLL_CON0 -p clock = 1500 Mhz
-c rk3588 -d CRU -r GPLL_CON0 -p clock = 1188 Mhz
-c rk3588 -d CRU -r NPLL_CON0 -p clock = 850 Mhz
-c rk3588 -d CRU -r GPU_CLKSEL -p clock = 198 Mhz

so the gpu clock is set to use PLL (not pvtm), the source is GPLL. and divider is 5+1=6, so the clock is 1188/6=198Mhz.

But i can not make sense of the glmark results. It is too high for 200Mhz.

My only theory is, blob driver is using smcc to set the clocks, and thus the requested clocks are set by the BL31. BL31 has different execution level than normal kernel, so somehow soc might have different IO base addr for the BL31 part. So what normal registers report should not be valid. In any case, a weird situation…

thats a very valid but very hard to answer question :slight_smile:

It’s really not a concern, mostly a matter of curiosity. PLLs are free-running clocks controlled on their phase after the divide and at such frequencies they can usually draw a few milliamps. Thanks!

1 Like

Just thinking, I’ve used opengl a tiny little bit several years ago and found that it was apparently possible to port generic code there, but the communication latency with the host was horrible for me (I really don’t know the right way to do things). Maybe it would be feasible to simply port the mhz utility to the GPU for this, if we find a way to accurately measure the processing time.

that should be somehow possible with the PMU (performance measuring unit) of the GPU, but the code will not be very portable i assume.

@boogiepop

I applied your proposed tunning but there are no changes in the case of mali blob.

root@rock5b:/home/rock# cat /sys/kernel/debug/clk/clk_summary | grep gpu
 scmi_clk_gpu                         1        1        0  1000000000          0     0  50000         Y
    clk_gpu_pvtm                      0        0        0    24000000          0     0  50000         N
          clk_gpu_src                 3        3        0   666666667          0     0  50000         Y
             clk_core_gpu_pvtm        0        0        0   666666667          0     0  50000         N
             clk_gpu_stacks           1        3        0   666666667          0     0  50000         Y
             clk_gpu_coregroup        1        3        0   666666667          0     0  50000         Y
             clk_gpu                  1        3        0   666666667          0     0  50000         Y
root@rock5b:/home/rock# 

rock@rock5b:~$ glmark2-es2-wayland -b terrain
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
=======================================================
    glmark2 2021.02
=======================================================
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-LODX
    GL_VERSION:    OpenGL ES 3.2 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
=======================================================
[terrain] <default>: FPS: 320 FrameTime: 3.125 ms
=======================================================
                                  glmark2 Score: 320 
=======================================================
rock@rock5b:~$ 

@willy
Unfortunately, porting your MHz to a GPU is beyond my knowledge, but I can run it here if someone does.

1 Like

yes mali blob wont take advantage out of it, because it is using PVTPLL i think i am also checking this in detail, may be i should update the title accrodingly to be more precise

It’s beyond my knowledge as well :wink: But it makes an interesting project I should consider. I have no idea where to start to run code on the GPU there, I’m totally ignorant of these things.

So, i would like to clarify more about what i have learned about the clock adventures of rk35xx.

In previous post i mentioned that GPU takes PLL frequency from either of CPLL, GPLL, AUPLL, V0PLL, SPLL. This is not complete. For the small cores like i2c, spi, pcie even correct but for bigger cores like CPU, GPU, NPU etc, there is another PLL source called PVTPLL.

PVTPLLs are dedicated to the core, and not shared across diffrent cores, sometimes there are even multiple PVTPLL for single core. (ie: CPU has different PVTPLL for litlle cores and for each big cores).

Unlike normal PLLs, PVTPLLs are meant to be dynamically configured with a twist.
PVTPLLs, gives the best possible frequency output for a given voltage, temperature, and chip quality.

Ie: you request 1Ghz from a PVTPLL, then you set the voltage to your target voltage, and start monitoring the PVTPLL circuit. PVTPLL runs a very small hardware benchmark circuit called ring oscillator and locks the frequency output to maximum possible. It can be 999Mhz, 950Mhz, or 1Ghz. Then the core gets this voltage and uses it.

Now comes the complicated part. This is my understanding someone may be correct me if i am wrong but, PVTPLL is not directly configured by the kernel. Instead it is configured by the BL31.

Kernel uses an interface called SMCCC to communicate with BL31, and request the frequency. BL31 sets the PVTPLL and configures the Core. This whole communication of BL31 with kernel is sometimes referrred as firmware or scmi. There are also other ways to communicate rather than smccc but in our rk3588 it is smccc.

So the initial problem with GPU clocks was not reaching to 1Ghz is that GPU was using normal PLLs rather than PVTPLL with Panthor driver. It seems that even though smi clock of gpu is defined in GPU block of the mainline DTS, it looks the me devfreq is not taking care of it. I think there needs to to be done something about this in mainline. When the issue in mainline is resolved i can also backport this to bsp hopefully.

When it comes to mali blob driver, it is actually using PVTPLL as a source and can sucessfully set the frequency to desired 1Ghz. However there seems to be still a problem. When you request a frequency from BL31 with PVTPLL, it is reporting the requested frequency as set frequency, not what PVTPLL provides., You can see that from the reference TF-A implementation of rk3588.

So how do we know what the actual frequency is?

when i probe the GPU_GRF register with mmm tool, i get directly a kernel crash. I interpret this as a security mechanism somehow since direct access to those registers from kernel or mmapped userspace is not allowed (theory). So my approach now to use pysmccc to probe the BL31, but use functions sip_smc_secure_reg_read and sip_smc_secure_reg_write callbacks to probe GPU_GRF registers. Normally those callbask are meant to access OTP registers, but worth to give it a shot. I also dont know if they are even implemented in BL31 as well.

6 Likes