Experiment 2.2 release: Build mode rendering issue and fix

Lskyi · October 15, 2024, 11:02am

Problem Description

In a program compiled with gcc 14.2.0, the issue occurs only in Release mode. The problem is reproduced on two Windows computers with similar environments.

Rendering Behavior

First click render
Second click render
Third click render
After modifying the camera light parameters, the third click render looks like two images are stitched together.
Fourth click render

Log Output

Initially thought it was a loading issue. Using printf (because spdlog crashes the program when printed at the same location, and dandelion seems to use a single‑threaded spdlog, while printf appears thread‑safe) to print logs revealed that fragment finished before rasterized.

vertex_processed:17568 rasterized: 5856 frag_num:306 dropped:23 [Rasterizer Renderer] [info] rendering (single thread) takes 0.005859 seconds
[Rasterizer Renderer] [info] finished render data load
vertex_processed:17568 rasterized: 5856 frag_num:17382 dropped:8847 [Rasterizer Renderer] [info] rendering (single thread) takes 0.007812 seconds
[Rasterizer Renderer] [info] finished render data load
vertex_processed:17568 frag_num:9908 dropped:923 rasterized: 5856 [Rasterizer Renderer] [info] rendering (single thread) takes 0.003906 seconds
[Rasterizer Renderer] [info] finished render data load
vertex_processed:17568 rasterized: 5856 frag_num:3044 dropped:1263 [Rasterizer Renderer] [info] rendering (single thread) takes 0.007812 seconds
[Rasterizer Renderer] [info] finished render data load

Solution

Changed the declaration of Context::rasterizer_finish in the header file to a mutex variable and its definition in rasterizer_renderer.cpp. The issue was resolved.
Screenshot 2024-10-15 190453

vertex_processed:17568 rasterized: 5856frag_num:7929 dropped:2755 [Rasterizer Renderer] [info] rendering (single thread) takes 0.005859 seconds
[Rasterizer Renderer] [info] finished render data load
vertex_processed:17568 rasterized: 5856frag_num:7929 dropped:2755 [Rasterizer Renderer] [info] rendering (single thread) takes 0.007812 seconds
[Rasterizer Renderer] [info] finished render data load
vertex_processed:17568 rasterized: 5856frag_num:7929 dropped:2755 [Rasterizer Renderer] [info] rendering (single thread) takes 0.005859 seconds
[Rasterizer Renderer] [info] finished render data load
vertex_processed:17568 rasterized: 5856frag_num:7929 dropped:2755 [Rasterizer Renderer] [info] rendering (single thread) takes 0.005859 seconds
[Rasterizer Renderer] [info] finished render data load

I did not modify other threads or queue‑related parts of the original project. After reviewing the code repeatedly, I don’t see any issue with the original ordering and don’t understand how, after compiler optimizations, the fragment worker thread could read rasterizer_finish=true before the modification. The problem was solved by changing the source code.

rouge · October 15, 2024, 3:28pm

Problem Description

Thank you for the question, I have a similar issue as well.
First, regarding:

I haven’t modified any other threads or queue‑related parts of the original project

I also haven’t made any changes.

My problem is: I also encounter that the program renders correctly when compiled and run in Debug mode, but not in Release mode. However, the specific issue I face is slightly different ： My program runs continuously, stuck before rendering (or something is preventing the rendering code from executing):

Analysis & Temporary Workaround

By tracing, I found the issue occurs in VertexProcessor::worker_thread(). I suspect it is due to O3 optimization in Release mode, where the compiler optimizes the first if (vertex_queue.empty()) {continue;} loop. So I added printf("\n"); on the line before the continue (or any simple syscall) to prevent the compiler from optimizing away that continue, compiled and ran, and the program renders correctly in Release mode. ( This might serve as a temporary workaround )

void VertexProcessor::worker_thread()
{
    while (true) {
        VertexShaderPayload payload;
        {
            if (vertex_queue.empty()) {
                printf("\n");  // Insert this line to prevent the compiler from optimizing this continue
                continue;
            }
            // Omitted code below

Verifying the Optimization Bug

So, in Release mode, I built dandelion for the following two scenarios

Add printf at the location described above to avoid compiler optimization of continue
Do not add printf, keeping VertexProcessor::worker_thread() unchanged

Then I disassembled the built dandelion, as shown in the image (I manually used --- to omit non‑branch instructions to fit the code on one page). The left side shows the disassembly with printf, the right side without printf.
undefined

Left side code starting from the first jump:
- cmp... + je d0770 → call printf → jmp d065a → cmp... + je d0770.
- This corresponds to if (vertex_queue.empty()) {continue;}. As in the actual run, everything works fine.
Right side code starting from the first jump (two possibilities):
- empty returns false, execution proceeds: d0656: jmp d0755 → d0755: cmp...+jne d0660 → d0660: call mutex_lock
- empty returns true, does not proceed: d0656: jmp d0755 → d0755: cmp...+jne d0660 + jmp d0763 → d0763: jmp d0763
- Found that when the program reaches d0763, it does not return to the empty check but enters an infinite loop: an unconditional jump to itself (jmp IP - 2). I suspect this is why the Release build gets stuck in an infinite execution.

So I directly edited that part of the dandelion code, replacing the original eb fe (jmp IP - 2) with c3c3, which is ret ret. After saving and running, the function returns immediately and the program exits, confirming that there was indeed an infinite loop. Therefore, the cause of the program hanging before rendering described earlier likely originates from this

Lskyi · October 15, 2024, 4:57pm

I disassembled my own .exe program and found extremely similar code
Screenshot 2024-10-16 003813
However, I have never entered an infinite loop during program execution

But considering that input_vertices() might be running too fast, causing the first conditional block of the loop never to be reached, I added std::this_thread::sleep_for(std::chrono::seconds(2)); between launching the worker and starting input vertices in RasterizerRenderer::render() (the rendering main thread). This ensures that vertex_woker experiences at least one empty‑check scenario.
Then I reproduced the bug you encountered: the program becomes unresponsive

rouge · October 17, 2024, 4:38am

It might be due to device performance; it always ends up in eb fe. I may need to test on a higher‑performance device.

You’re right, it seems that the empty‑check of the queue in each worker_thread has problems under O3, such as in Rasterizer::worker_thread()

btw, I noticed that dandelion has a branch to address this issue, but more testing is needed:

However, I have found two problems with the new branch so far: 1) Compilation yields a conflicting declaration because in graphics_interface.h the symbols vertex_finish, rasterizer_finish, and fragment_finish are declared as volatile static bool, while in rasterizer_renderer.cpp they are redefined as bool. 2) After I manually unified the definitions in rasterizer_renderer.cpp with the header, the eb fe issue still persists after compilation; other issues have not been tested yet.

Also, could the teaching assistant please tell me how this problem should be evaluated under the current circumstances?

greyishsong · October 17, 2024, 1:30pm

The analyses and attempts made by several classmates are all good; this indeed was a bit of our oversight. In fact, this issue had been noticed before, but when updating we forgot to port it from the dev channel to the release channel, so the released experimental framework did not correctly handle flags such as Context::vertex_finish. Moreover, the outer loops of the vertex thread and fragment thread should not be while (true), but rather while (!Context::vertex_finish), etc.

According to our understanding, this kind of problem is caused by compiler optimizations / multicore cache incoherence. The rasterization renderer (serial version) actually involves four threads: the main thread, the vertex thread, the rasterization thread, and the fragment thread. During code generation, the compiler may reorder instructions that are not memory‑bound, and the CPU may execute out‑of‑order instructions that have no data dependencies at runtime. These can cause a flag set in one stage to not be correctly read by the thread of the next stage, or even disrupt the order of flag reads and writes. On modern CPUs, these four compute‑intensive threads are usually assigned to different cores, so they cannot share L1/L2 cache, and it is possible that after one thread modifies a flag, other threads still see the value before the modification.

Lskyi’s approach of using atomic variables leverages the read‑write consistency of atomics (the default memory order for C++ atomic types is memory_order_seq_cst, i.e., sequential consistency). Rouge’s addition of printf acts similarly to inserting a memory barrier (memory‑bound; most standard library output implementations include one). Both actions force multicore cache synchronization and prevent the compiler from reordering memory read/write instructions, thereby solving the loop issue.

Lskyi’s comment is entirely valid; we indeed try to minimize mutex wait overhead. Since these threads are all compute‑intensive (CPU‑bound), using a mutex or sleep to pause a thread is wasteful. It is preferable to avoid locking altogether, and when locking is logically required, atomic variables or spinlocks should replace mutexes. The current solution on the dev channel is to add volatile to ensure these flags are read from memory rather than from cache; the related updates are being re‑validated.

You should still follow the documentation requirements. However, if you completed the experiment based on version v1.1.1 and are being evaluated this weekend, you may modify any code of the rasterization pipeline (including code beyond the scope of the lab manual) to ensure the rendering process completes correctly; if you completed the experiment based on a later updated version, you may only modify code within the range specified by the lab manual.

Topic		Replies	Views
自组了一台可公网访问的 NAS 谈笑风生	19	319	July 19, 2024
（转载）当我在运营产品时，我在做什么？谈笑风生	3	152	February 14, 2024
2.11 中调用 collapse_edge 时会卡住计算机图形学	6	81	December 20, 2024
转载交流琐记二——邓俊辉老师谈笑风生	7	151	March 12, 2024
大三寒假计划贴！谈笑风生	36	431	January 13, 2025
图形学实验框架 Dandelion 始末（三）：OpenGL API 抽象与实时渲染深入交流 cg	0	64	January 28, 2024
盘点我在电气的四年，供看官偷乐深入交流 ee , retrospection	26	983	April 9, 2024
Optiver 挂经 & 反思总结深入交流	3	862	May 31, 2024
Gpt 说真话咒语谈笑风生	0	111	October 9, 2023
用 tauri 做一个交大門跨平台应用网站 app	0	244	April 30, 2024

Experiment 2.2 release: Build mode rendering issue and fix

Problem Description

Rendering Behavior

Log Output

Solution

Problem Description

Analysis & Temporary Workaround

Verifying the Optimization Bug

Related topics