Detailed description of the multimedia part of the mobile phone SoC

After setting up the shelf of the mobile phone chip, you need to understand how to integrate the multimedia components. The so-called multimedia consists of three main modules: the Graphics Processing Unit (GPU), the Display Module, and the Video Module. The Display Module is responsible for outputting all content to the screen, while the Video Module handles decoding and encoding processes, such as video playback and camera recording. The Image Signal Processor (ISP) module is not included here and will be discussed later. The GPU is often a key focus for users, especially when it comes to gaming or performance benchmarks. However, when defining the multimedia specifications of a mobile chip, the first parameter to determine is not the GPU's power, but the display resolutionâ€”such as 720p, 1080p, 2K, or higher. This resolution determines the minimum requirements for the GPU fill rate, system bandwidth, and memory controller configuration. It also influences CPU and ISP selection, which in turn affects overall power consumption and cost. Therefore, this parameter is critical in chip design. Here are a few typical examples: - Ultra-low end: Spreadtrum SC9832, with a 720p display, ARM Cortex-A7MP4, Mali400MP2, and 1xLPDDR3. - Low-end: MediaTek MT6739, with a 1440x720 display, ARM Z, IMG PowerVR GE8100, and 1xLPDDR3. - Mid-range: Qualcomm's Opteron 652, with a 2560x1600 display, Cortex A72MP4/Cortex A53MP4, Adreno 510, and 2xLPDDR3. - High-end: HiSilicon 970, with a 2K display, Cortex A73MP4/Cortex A53MP4, G72MP12, and 4xLPDDR4. As the display resolution increases, the chip's specifications become more complex, but they usually stop at 2K. Let's look at the quantitative factors behind these choices. As shown in the figure above, mobile multimedia must include three core modules: graphics, video, and display. Why do desktop GPUs include video and display output, while mobile chips separate them? To put it more clearly, all multimedia and image processing involves calculations. Why not use the CPU for everything? That would be impractical due to power constraints. Mobile chips must operate under strict power limitsâ€”typically less than 2.5 watts during long-term operation and up to 5 watts for short bursts. Even this 2.5 watts includes CPU, bus, and memory bandwidth. Each multimedia module consumes significantly less power than the CPU. For example, a 16nm video codec handling 4K30FPS consumes around 60 milliwatts, while the same task on the CPU would require at least four A53 cores running at 2GHz, consuming over 1.5 watts. The difference is about 25 times. Looking at the display module, a 2K60fps display requires approximately 50 milliwatts on a 16nm process. Using the GPU for the same task would consume roughly 300 milliwatts, and using the CPU could reach nearly 1 watt. So, even just playing a 4K video and displaying it could hit the power limit, not to mention supporting video playback for more than 10 hours. Hence, mobile multimedia systems must differentiate between the GPU, video, and display modules. In ultra-low-end devices, software decoding may be used to save hardware area. This is feasible because ultra-low-end chips may only support 1080p video, and the limited number of CPU cores makes power consumption manageable. However, high-end phones avoid this to save space and improve performance. Another question might arise: why does the video resolution go up to 4K while the display only supports 2K? The answer lies in the limitations of mobile screens. Even high-end devices haven't yet adopted 4K screens, and doubling the resolution doubles the power consumption, which is a big issue for the screen, one of the biggest power consumers. Solutions like lower-power screens are needed before this can change. Similarly, why is the video resolution higher than the display refresh rate? Because the source material determines the video format. A 4K source cannot be decoded by a 2K decoder, and the resolution must be handled after decoding. There's also the question of why video runs at 30fps while the display operates at 60fps. Experiments have shown that changing the screen refresh rate to 30fps doesn't make a noticeable difference. However, most people have better eyesight and are sensitive to unnatural images. Thus, 60fps is used for UI elements like backgrounds, while 30fps suffices for videos. Now, let's explore how display resolution impacts the GPU, system bandwidth, and memory controllers. As shown in the next figure, the display module's role is closely tied to the concept of layers in the operating system's user interface. The final image on the screen is a composite of multiple layers, each of which can be rotated, scaled, or transformed. The display module can receive inputs from decoded video or from the GPU, which has already done some preliminary synthesis. On Android, there are typically 4â€“8 layers. Assuming a 1080p60fps display, the bandwidth per layer is 1920x1080x60x4 (RGBA) = 480MB/s, and 8 layers would total 4GB/s. Adding 4K30fps video playback would require an additional 1.5GB/s. The typical bandwidth overhead for a GPU running a UI is as follows: Depending on UI complexity, the bandwidth required for every 60 frames can reach 1GB/s (after compression) or 1.5â€“2GB/s without compression. Other overheads include CPU driver tasks, app operations, and more. Adding 1GB/s together results in around 9GB/s. A single-channel LPDDR4 offers about 12.8GB/s of bandwidth. At 70% utilization, itâ€™s almost enough for one DDR controller. The power consumption per GB of bandwidth is about 100 milliwatts (on 16nm). The total power consumption of the bus and DDR controllers is approximately 1 watt. From this, we can see that bandwidth is also very power-hungry, and increasing it leads to higher memory controller, DDR PHY, and memory costs. While the increase in bus area and power consumption is relatively small, the complexity of the solution is a basic skill in SoC design. So far, we can see how display resolution defines the system bandwidth, power consumption, and cost requirements. From this, the minimum GPU requirements can also be derived. As shown in the figure, the GPU has three parameters: triangle output rate, pixel fill rate, and theoretical floating-point performance. For the UI, the most significant is the pixel fill rate. What does fill rate mean? For each layer mentioned earlier, if the resolution is 1080p, then a pixel fill rate of 1920x1080x60=120M/s is required. If all 8 layers are drawn by the GPU, then a fill rate of 1G/s is needed. But remember the compositing, scaling, and rotation features in the display module? These are actually tasks the GPU can handle. If the display module isnâ€™t capable of handling 8 inputs, the GPU must merge the 8 layers into 4 before sending them to the display module. Each two-layer synthesis is equivalent to redrawing one layer, requiring an additional 4 layers, totaling 1440M/s pixels. Including scaling and rotation would require even more. In general, the display module wonâ€™t support 8 layers, as such scenes are rare, leading to hardware redundancy. In extreme cases, the GPU is used to do extra work, increasing flexibility and preventing the screen from freezing due to too many layers. Of course, due to system latency and bandwidth, pixel utilization is unlikely to reach 100%. Some systems Iâ€™ve seen can only achieve 70% utilization, mainly due to long average latency and insufficient parallelism. Simply increasing the number of GPU cores isnâ€™t wise if the system bandwidth is the bottleneck or if scheduling isnâ€™t optimized. Over the past two years, mobile phone chips have achieved around 90% utilization. Even low-end phones leave some padding to handle unexpected situations during complex operations. At this point, increasing utilization becomes a way to reduce power consumption. Additionally, on many mobile GPUs, the pixel fill rate also refers to the material fill rate. Since the UI is mostly images or materials mapped and mixed, it doesnâ€™t require much 3D graphics, triangle output, or floating-point capabilities, but the material fill rate must match. Looking at two extreme examples based on power and area: Supporting a VR chip with a 4K120fps display (binocular), playing 4K video in a virtual room, and having the display module support 8 inputs would require a pixel fill rate of 6.4G/s, plus one 4K video decoding path. Switching to a GPU would need at least G72MP8, and considering 3D performance, MP16 might be necessary. The GPU area would be 36 square millimeters, and the power consumption would be 6 watts. The system couldnâ€™t run without a fan. On the other hand, a low-end chip supporting 1080p, 4K30fps video playback, and a 4-layer scene could use G72MP1, with an area of 2 square millimeters and power consumption of 0.4 watts. Adding the video and display modules wouldnâ€™t exceed 5 square millimeters, and the power consumption is already low. The GPU driverâ€™s CPU requirements arenâ€™t considered here. At full load, a G72MP12 requires an A73 running at 2.5GHz and struggles to balance the load (due to OpenGL ES limits), while a low-end chip can manage with just one A53. Large and small cores, and 4-core vs. 8-core configurations cause significant differences in area and power consumption. Therefore, increasing the display resolution isnâ€™t just about image quality. If doubled, the system cost and power consumption would be a major concern. In simple terms, display resolution sets the lower limit of a chip. Introducing the basic role of the GPU in the system, the following analysis focuses on designing the GPU. To create a GPU, the market needs to be analyzed first. The GPU market has four major segments: desktop and game consoles (under 300 million), mobile phones and tablets (less than 2 billion, of which nearly 1.5 billion are occupied by Qualcomm and Apple), TVs and set-top boxes (less than 200 million), and car panels and autonomous driving (less than 100 million). Among these, desktop and game consoles, and autonomous driving are not considered, while other types have the following requirements: - Mobile phones and tablets: Display resolution 1080p to 2K, 4â€“8 layers, 3D performance from weak to strong, power consumption 2.5 watts, cost-sensitive. - TV and set-top box: Display resolution 1080p to 8K, 8-layer layers, 3D performance weak, power consumption 2.5 watts, cost-sensitive, and image quality enhancement required. - Automotive panel: Display resolution 1080p to 2K, 4 layers, 3D performance weak, power consumption 2.5 watts, cost-sensitive, and no automotive safety design needed. As Vulkan becomes Androidâ€™s next-generation graphics interface, the fixed graphics pipeline design will eventually phase out, and the general-purpose graphics processor, known as GPGPU, will become inevitable. Resolution changes can be refined into configurable multi-core designs, and different needs for UI and games can be refined into large and small core designs. Here, the size core represents different computing power at the same fill rate. Combining the two achieves the highest energy efficiency and area ratio. Speaking of core size, naturally arises the question: Is it necessary to integrate the size of the GPU in a chip like a CPU? The answer is negative. The CPU core size is useful because there is a 4â€“5 times difference in energy efficiency and a hard requirement for single-threaded performance. However, a good GPU design, whether using size or core, due to the natural multi-threaded properties, should consume the same energy for the same performance. The core area will differ, but even if you only use the UI for a certain period, you donâ€™t need the computing power, so the meaning of the small core is not significant, except in scenarios where the GPU is the main application and 3D performance is not pursued. In terms of rendering mode, there are currently mainly real-time rendering and block rendering. This topic has been around for several years. The former is based on primitives, rendering relevant vertices, geometry, pixels, and then synthesizing the output. The latter is based on pixels, selecting relevant vertices and triangles, calculating coverage relationships, and finally synthesizing the output. At first glance, block rendering seems more economical because it calculates pixel coverage and avoids repeated rendering. However, in turn, the triangles, vertices, and properties needed for block rendering are all read from memory. If there are many triangles, you need to repeat the reading multiple times, and it is likely that the bandwidth saved is not as good as real-time rendering. So, the key to determining which method is better is the ratio of vertices to pixels. For applications on mobile phones, the number of pixels is much larger than the number of triangles or vertices, roughly 50:1 to 30:1. At this time, block rendering is more suitable for embedded devices. From the perspective of computational density, the GPU area for real-time rendering must be smaller than the GPU for block rendering, but the bandwidth is increased, which increases the cost of the handset. As for the increased power consumption, it is not necessarily more than the block rendering method. Therefore, qualitative discussion is not enough, and it needs to be quantified to determine. An example: An instant rendering GPU A with a fill rate of 7200M p/s, Manhattan 3.0 running at 25, T28nm running at 450Mhz, power consumption of 2.3 watts, an area of 13 square millimeters, and a bandwidth of 12.8GB/s. Correspondingly, the block rendering GPU B has a fill rate of 1300M p/s, Manhattan 3.0 runs 4.5, T28 runs at 650Mhz, consumes 0.55 watts, has an area of 4.8 square millimeters, and has a bandwidth of 688MB/s. In comparison, the fill rate is the same as the Manhattan 3.0 running score, and the difference between the two GPUs is 5.5 times. GPU A consumes 30% less power and only half the area, but the bandwidth is 3.4 times. This three times the bandwidth, almost 1.5 DDR4 channel, and suddenly brought 2 watts of power (28 nm), the previous GPU's own power consumption advantage of course, no matter how small the half is not enough. Conversely, if you follow the absolute performance of GPU B, then GPU A actually only needs 2.3GB/s bandwidth, although it is very large, it is far less than the maximum bandwidth of a DDR4 channel, and the power consumption is very low, reaching less than 2.5 watts. The upper limit of power consumption can save half of the area. Why not? From this, it can be concluded that low-end phones can use GPUs for real-time rendering, while block rendering GPUs are still used in the middle and high end. As the process progresses, the GPU for real-time rendering will be more widely available. This should be unexpected to many people. Let's take a look at the flow of graphics rendering: The process of each step is not specifically explained. What we care about is what can be done with a general-purpose computing unit, and which ones have to be solidified for hardware, which is related to the performance area power consumption. We correspond the above flow to the Mali GPU, as shown below: Make the shader more detailed, as shown below: Among them, the processing of vertices and pixels, the calculation amount is relatively large, the algorithm relatively large change, you can use the general shader, which is the execution unit ExecuTIon Engine in the above figure. The tessellation module TessellaTIon, because it is not necessary, there is no corresponding hardware module in Mali's Norr, which can be done by software using the shader general processing unit. Depth and template testing and compositing, which are post-processing of pixels, can be done directly with dedicated hardware, because the operation is simple and the amount of computation is linearly related to the pixel. The material requires an extra unit to do because the material has a number of specialized operations, and the output of each pixel requires the material unit to participate, and the output is linearly related. In addition, the material access bandwidth is also very large, so it has its own memory access unit, which does not occupy the access unit of the shader. The properties and Varying used to pass data in vertex and pixel calculations need to read the data based on the attribute data of the vertex, interpolate the value of the computed pixel, and provide it to the pixel shader. The amount of calculation is linearly related to the vertex or pixel and is suitable for solidification into a hardware unit. According to the vertex position information of the primitive, the converted coordinates and the normal vector are calculated. If it is outside the boundary or on the back, it is not used for output, and is directly thrown away to save bandwidth. This step is the back-removal of the Culling and the Clipping, which can be done with a dedicated module in conjunction with a general-purpose computing unit. The last remaining valid part can be used to generate a list of triangles and add them to the block based on the block pixel for rasterization. Next is rasterization. This step involves calculations such as depth and color, using a generic shader. When you do this, you need a hardware triangle setting module. For the triangle involved in a certain pixel area, read the triangle list formed in the previous step, calculate the parameters of the equation corresponding to each side of the triangle, and the plane and center of gravity. Since the algorithm is fixed due to the triangular output rate, it is also suitable for curing. In addition, additional units are needed to handle system-related transactions such as the memory subsystem, the internal bus, and the modules responsible for managing tasks and threads. Among them, the task management module can be subdivided at different levels, which is the essence of a GPU design. Take Mali as an example. On the top layer, all tasks are divided into three types of jobs based on primitives, vertices and pixels, and assigned to different processing units, namely TIler, vertex/pixel shader. The next layer, according to the pixel block as a unit, can divide the entire screen into a number of pixels, that is, threads. At the same instant, multiple threads can be run on each shader, and the combination of these threads is called a Warp. The secret to increasing the computational density is to insert more threads into a single core. The lower layer, specific to the execution engine inside each shader, the input of each clock is a Clause. Clause is a program with no branches. When all the input data has been read from the memory and stored in the register, the program will be executed without interruption until the result is output. When the input data has not been retrieved, then switch to another ready-to-use Clause. All of these tasks, Warp and Clause management, require specialized hardware to keep the entire pipeline utilization maximized. In short, as long as it is a process that has always appeared on the flow of graphics, the operation is fixed, and the calculation amount is stable, it can be done with a dedicated hardware unit, and it is not wasted, but also saves power consumption area. The part that calculates the change is handed over to the general purpose computing unit. Once you have identified the generic and dedicated units, you need to optimize the rendering process and tweak the hardware. Look at the vertex calculation first, as shown above. When generating a list of triangles, some triangles represent the back side. As long as the normal vector is calculated, you can directly determine whether to discard or not, and save the output. Further, if you make some simple calculations based on the near and far relationship of the vertices, you can directly conclude that some triangles are completely covered, and you can also discard them directly. In the subsequent pixel rendering and compositing, the processing can be completely omitted. This is called Early-Z. However, there is still a limitation. In multiple drawing functions Draw call, if commands are sent to the GPU successively, it is not easy to do early-Z optimization between different functions, because if you want to keep the results of the same area drawing function together, optimize it. It may take a greater price. But the task of its block rendering can wait until all the drawing functions have been completed. In this case, the drawing function area drawn from the back to the front is not easy to optimize, and the overlay can be completed directly from the drawing to the back. This is called forward killing, which requires the graphics engine to pre-calculate the depth information of the object, and pre-judicate it during the script generation phase to improve rendering efficiency. The above optimization does not solve all the triangle coverage problems, and can go one step further. In the pixel rendering stage, after obtaining the triangle corresponding to the pixel point and the depth information, the hidden triangle can be discarded, and only the color and illumination of the triangle being seen, and the texture are also calculated, and the subsequent mixing is also avoided. The operation in this step is called TBDR, delay block rendering. This can be even across Draw calls. If the number of vertices is small enough and the triangles covered are sufficient, Early-Z and TBDR can greatly reduce the amount of pixel rendering calculations. In the synthesis phase, we can also make an optimization, which is to calculate a CRC value for the entire region when outputting the final block content. If it is a 16x16 block, its CRC size is usually only 1%, and the 1080p screen is about 100K bytes. At the next rendering, we read the CRC value from the DDR or even the internal cache to determine if the content has changed. If not, then simply abandon the output and save bandwidth. However, the rendering calculations done before can't be saved. If you know that there is an area on the screen that will not change content for a period of time, then we can tell the GPU in advance to let it cancel the area directly from the pixel rendering, thus eliminating the calculations in the previous paragraph. However, this needs to be done in conjunction with the display module. There are many similar optimizations for the rendering pipeline level, which can be found in almost every step. After combining all the optimizations, we finally made an initial GPU and got its area distribution: Among them EE is the calculation engine, the area is 47%, TEX is the material unit, the area is 20%. Obviously, the core of our optimization should be placed on these two pieces. This actually leads to a goal of the graphics processor: higher computing density. As mentioned earlier, the definition of computational density is measured in terms of pixel output rate on a low-GPU-based graphics processor and at floating-point density at the high end of the game. To increase the UI pixel output density, the calculation unit is not important if the synthesis unit has a fixed capacity. The material unit must match the output capability. Generally, the pixel material ratio is 1:2 or 1:1. The ratio of 1:2 can be achieved. After mixing two material points into one, with one pixel output, this is useful in some UI scenarios, which doubles the pixel output rate. But in turn, if you don't use it, the extra material unit area is a waste. What is the specific proportion, can only be seen. To increase the floating point density, the method is not difficult, that is, the heap unit, and then match the corresponding graphics processor to solidify the hardware, instructions, cache and bandwidth. When designing the arithmetic unit specifically, there are still some considerations. Previously, ARM has been using the SIMD+VLIW architecture. That is to say, one pixel is used as one thread, and its four RGBA dimensions are used as vectors to form a SIMD instruction of 32-bit data. Then, try to find out the 6 threads that can be paralleled and put them together in parallel. These six threads are vector multiplication, vector addition, scalar multiplication, scalar addition, instruction jump, and lookup table. In fact, it corresponds to the design of the computing unit, as shown below: Since Mali is based on block rendering, there are 16x16 pixels in a block, which is 256 threads. These threads can be in different blocks, some are calculating depth, and some are calculating colors. If the thread manager can always find such 6 pixels, corresponding to different arithmetic units, and fill these units all the time, then naturally the highest unit utilization can be obtained. Unfortunately, this high-utilization scenario is not easy to find. In many cases, only vector units are used, and the rest are empty. Thus, Mali's style is turned, the scalar multiplication and sum of the above figure is removed, and the VLIW is discarded, so that the instruction jump unit is extracted. Finally, a new processing unit is formed, as shown below: Here, a 128-bit wide multiply-add and an equal-width adder unit assumes the previous scalar and vector multiply-add, and the look-up table operation is also mixed with the addition unit. What is completely different from the previous one is that the input here is always a single instruction of 128 bits wide, instead of 6 instructions of the VLIW, thus improving the utilization of the computing unit. The data coming in every clock cycle can only go to one of the FMA and ADD/TBL units, and it is impossible to go in at the same time, which reduces the effective utilization of the area to some extent. In line with this design, Mali has also made a new adjustment, as shown above. From the previous RGBA four-channel vector operation method according to the pixel point, the four pixels are each extracted into one color channel, stuffed into the upper FMA, and four threads are simultaneously run. Since in most cases, the same channel of 256 pixels on a block always performs the same calculation, the high utilization of this design is guaranteed. If the operations of each channel are different for each of the four adjacent points, the efficiency will naturally decrease. Some GPUs have another form of arithmetic unit: Here, the multiply-accumulate is placed in the small unit, the ratio is higher; the large unit is in addition to multiply and add, and the table is checked, the ratio is low; the jump unit is placed separately. This can make the calculation unit ratio more reasonable and the area utilization rate higher. After determining the capabilities of the computing unit, is there a uniform way to fine tune the proportion of each supporting module? The answer is yes, first determine the running score standard, then refine it into subtests, and finally count on the model: The current popular standard on mobile is GFXBench, and there are three versions of the mainstream, 2.x, 3.x and 4.x. Each version has a focus, such as 2.0 side triangle generation, 3.0 side calculation unit and material, 4.0 side calculation. Antutu is also a standard that currently focuses on shadow and triangle generation rates. Sometimes, chips and mobile phone companies will also count the mainstream games, running on chips or simulation platforms and even models, in order to get the computing power of the next generation GPU. The standard run score is set, and the next step is to refine it into smaller targets, which are triangles, vertices, pixels, materials, ZS, Varying, blending or bandwidth requirements. Then, on the model, translate these refinement requirements into PPAs and give them to each small module to see if there is room for compression. This is the lowest level of optimization. After going through the above polishing, we got a better GPU. Is there nothing to improve? Not enough. From the application point of view, there is still a need to come in: DRM, data compression, system hardware consistency, unified memory address, AR/VR/AI, CPU driver. First of all, it is copyright protection. When playing copyrighted content, decryption and decoding are done in the protection world, and the operation of the UI may require the participation of the GPU. This piece is discussed in the security article and will not be expanded here. Second, all media and material data can be compressed to save system bandwidth and cost. As discussed above, it is no longer open here. Third, the system has two-way hardware consistency issues. To achieve data access between the GPU and the CPU and the accelerator without copying and refreshing the cache, a bus that supports bidirectional consistency, such as the CCI 550, is discussed in the basics. The latest OpenCL 2.0 and Vulkan support this new feature. As shown in the figure below, in the case of very frequent data interaction, you can save 30% or even 90% of the running time. Unfortunately, since OpenGL ES is inherently not supported by this feature, for most current graphics applications, even if the CCI550 is connected, the GPU will not send any transmissions with listening operations. This situation will improve after Vulkan. Fourth, the unified physical address with the CPU. On the desktop, the CPU and GPU access space is completely independent, and the mobile processor unifies the physical address from the beginning. Of course, their page tables are still separate. As we said in the basics, the benefit of unification is to save bandwidth and cost. Just the feature of block rendering GPU is that the bandwidth is relatively small. The rest of the bus and memory controller scheduling is done, keeping the memory bandwidth utilization at a relatively high level. One application of hardware consistency and uniform address is heterogeneous computing. CPU/GPU/DSP/accelerator can use the same physical address, and the hardware automatically performs consistency maintenance. The specific calculation can be an image or a voice. But unfortunately, on high-end mobile phones, if all the processing units are running, the power consumption is definitely much higher than 2.5 watts, and even 10 watts. Moreover, the software of the GPU is limited, and the two-way hardware consistency has not been widely used. At present, most of the ISP post processing is done. Fifth, AI. In the AI article we mentioned. If the AI is running on the GPU, then it is definitely necessary to support INT8 or even smaller multiply-and-accumulate operations, which has no problem for the GPU. However, AI requires more parameter compression, and Mali's GPU does not have native support. In this way, the original quad-core MP4 uses almost two 128-bit AXI interfaces and can only provide a read bandwidth of about 32 GB/s. If it is not compressed, it can also support the calculation of 64GB INT8, which is much smaller than the INT8 calculation of 1Tops in the GPU quad core. Sixth, VR. VR has a considerable relationship to the GPU. First of all, since the left and right eyes need to be separately rendered, and the resolution needs to be 4K or more, this puts 8 times the requirement of GPU performance at 1080p, which is also a test for system bandwidth. Subdivided, there are some needs: Left and right eye independent rendering: As shown below, the triangle or vertex part of the VR scene, the left and right eyes are shared, but the rotation and subsequent pixel processing must be separate. Since vertex rendering only accounts for 10% of the entire workload, the amount of computation that can be saved is quite limited, but it can reduce the reading of some vertex materials. Distortion Correction: This can be done easily by adding a step matrix multiplication at the vertex. However, since this operation is interspersed between vertex and pixel rendering, additional APIs are needed to alert the hardware. Asynchronous Time Warp ATW: The principle is that when it is found that the time required to calculate the next frame exceeds the expected time, the condition is not calculated, but the current frame is interpolated according to the head moving direction to create a fake image. Thus, an API is needed to estimate the next frame generation time, and an additional timer is needed to calculate the remaining available time based on the Vsync signal of the display module. If the time is not enough, the interpolated image will be taken directly, and this image calculation is also on the GPU. The calculation amount is small and the priority is high, so as to ensure that the Vsync signal is caught. Multi-view rendering MulTI-View: That is, for the non-focus area of the view, use low resolution, focus area, high resolution. To do this, with high resolution, the total amount of calculation can be reduced by about half. In practice, because the determination of the focus area requires eye tracking, it is more complicated, so the center area will be used instead. Front buffer: In the original desktop GPU design, the display buffer is divided into two blocks. The front buffer and the back buffer are alternately output. Among them, Vsync is used for synchronization, and the frame is output. Now only use a piece of front buffer, hsync as the synchronization flag, and output by line, so that the rendering time of the entire frame does not change, but the granularity is fine. For GPUs, this requires adding a row-by-row rendering order. This is not the same as the previous Multi-View principle. Although the former one is divided into multi-view areas, it does not specify the rendering order. Blocks and blocks are still out of order, but the final output is well rendered. Front buffer is equivalent to inserting a synchronous instruction between rows and rows. If it is simply dispatched by hardware, it is likely to degrade performance. It is also possible to keep the frame buffer of the GPU as it is, and to do this with the display module, which is easier to implement. Seventh, AR. In the AI article, we mentioned that the GPU is actually doing rendering, there is no special need, unless the identification work is handed over to the GPU. Another important point is the load that the GPU driver imposes on the CPU. On OpenGL ES, due to the limitations of the API itself, many driver tasks must be done in one thread. This requires that you must run on a CPU core. The more GPU cores, the higher the load on a single CPU core. Before the Mali G71, almost 10-12 900Mhz GPU cores needed an A73 running at 2.5Ghz to run the drive. Other big core small cores couldn't help. But in fact, because the energy efficiency of the big core is 4 to 5 times worse than the small core, it must be driven by a small core to drive more power. Conversely, if more GPU cores are used, the final bottleneck will become the performance of a single CPU, not the GPU. There are several ways to solve it. One is to use Vulkan. Vulkan took this into account during the design phase, and naturally supports CPU multi-threaded load balancing, which is a good solution to this problem. I have seen a desktop GPU that only uses 30% of the performance on x86 servers on a 16-core A53 server, and switched from Open GL to Vulkan before moving the bottleneck to the GPU itself. However, although Vulkan is already Google's next-generation graphics interface, Vulkan-based graphics applications may take 3-5 years. Another possible method is to optimize the GPU driver software itself and hand over part of the software work to the hardware, such as the hardware management of the memory management module. You can also use an MCU and embedded cache to translate and process the command sequence instead of the CPU. Jobs. This MCU can only run at a few hundred megabytes, consumes more than a dozen milliwatts, far less than the small core of more than 100 milliwatts, and is even lower than the hundreds of milliwatts of the large core. Finally, the GPU produced in PPA must be compared with Qualcomm's Adreno (Apple's GPU also lags behind Qualcomm in computational density). Among all block-rendering GPUs in the world, Qualcomm is the best in energy efficiency and performance density, exceeding the latest Mali GPU by more than 30%. There are several places where Mali finds it difficult to catch up, such as system cache, front-end optimization for a few configurations (Mali needs to support 1-32 cores), and confirmed backend process and optimization. Each of these can provide 5%-10% of optimization space, accumulating to a significant advantage. In conclusion, GPU design is a continuous process of refinement. Choose the right direction, define the benchmark scores clearly, pay attention to new trends and demands, and master the model and verification process. The rest is just constant improvement.

Multimedia Wiring Harness

The multimedia wiring harness apply to Audio,Video,Radio, LVDs,Flat RCA,USB.

Yacenter has experienced QC to check the products in each process, from developing samples to bulk, to make sure the best quality of goods. Timely communication with customers is so important during our cooperation.

If you can't find the exact product you need in the pictures,please don't go away.Just contact me freely or send your sample and drawing to us.We will reply you as soon as possible.

Electric Fan Wiring Kit,Socket Fan Wiring Harness,Custom Pc Wiring Harness,Multimedia Wiring Harness

Dongguan YAC Electric Co,. LTD. , https://www.yacentercn.com