Computer job application template

Lrefrain · August 7, 2024, 8:15am

Autumn recruitment season has started; I’m writing some canonical computer notes for easy memorization and review
Friends are welcome to supplement, ask questions, or write down the important canonical knowledge you consider essential in the thread

Lrefrain · August 7, 2024, 8:49am

Shallow Copy, Deep Copy, std::move

Shallow copy: Copies bytes directly, which solves many scenarios, but when the copied object contains pointers or file descriptors, two kinds of errors can occur:
- During use, the original object and the copy have pointers that refer to the same address; a change in one affects the other, which usually contradicts the intended logic.
- During destruction, both the original and the copy try to destroy the same pointer, leading to a double‑free and undefined behavior.
Deep copy: Used to fix the errors described above. A deep copy must be implemented manually (assignment operator / copy constructor). Suppose the original object has a pointer; you allocate a new block of memory and copy the contents pointed to by the original pointer into this new block. Thus the original object and the copy operate independently.
std::move: Converts an lvalue to an rvalue, roughly equivalent to

template<typename T>
T&& move(T lvalue) {
    return static_cast<T&&>(lvalue);
}

The original code is slightly altered and involves some template syntactic sugar, which is omitted here. Using a = std::move(b); enables move semantics, i.e., transfers ownership of the object. After the move, b returns to a default‑initialized state (the exact state is defined by the move constructor / move assignment operator, as shown below), cannot be used further, but can be safely destroyed.

The following code implements a move constructor and a move assignment operator:

class A{
public:
    A() : data(nullptr) {
    }

    A(int value) : data(new int(value)) {
    }

    A(A&& other) : data(other.data) {
        other.data = nullptr;
    }

    A& operator=(A&& other) {
        if (this != &other) {
            delete data;
            data = other.data;
            other.data = nullptr;
        }
        return *this;
    }

    ~A() {
        delete data;
    }

private:
    int *data;
};

Lrefrain · August 7, 2024, 9:23am

Lambda Expressions
Usage:

int x= 10;
auto f = [x](int c = 5) {return x + c;};
std::cout << f() << std::endl;

The parameters inside the brackets form the capture list, which can be specified in the following ways:

The parentheses contain the parameter list, used similarly to a function.

To modify the value of x, you need to add the mutable keyword, as shown:

int x= 10;
auto f = [&x](int c = 5) mutable {x=c;};
f();
std::cout << x << std::endl;

Lrefrain · August 7, 2024, 10:03am

Virtual Functions and CRTP Implement Function Polymorphism

Virtual Functions:

class A{
public:
    virtual void print() {
        puts("A");
    }
};
class B: public A{
public:
    void print() {
        puts("B");
    }
};
class C: public A{
public:
    void print() {
        puts("C");
    }
};
int main() {
    B x;
    A y = static_cast<A>(B);
    y.print();
}

Drawbacks:

Uses an extra vtable pointer in memory compared to CRTP
At runtime incurs an additional pointer indirection compared to CRTP
Functions cannot be inlined, causing significant performance impact for short functions
Code predictability is low, which is detrimental to prefetching and indirect branch prediction

CRTP:

template<typename der>
class A{
public:
    void print() {
        static_cast<der *>(this)->printName();
    }
    void printName() {
        puts("A");
    }
};
class B: public A<B>{
public:
    void printName() {
        puts("B");
    }
};
class C:public A<C>{
public:
    void printName() {
        puts("C");
    }
};
template<typename der> 
void CRTP(A<der>& x) {
    x.print();
}
int main() {
    B x;
    CRTP(x);
}

Drawbacks:

Code logic is more complex, making it harder to read
Cannot achieve dynamic dispatch

Lrefrain · August 7, 2024, 3:34pm

Out-of-Order Execution and Memory Barriers

Out-of-Order Execution:
Many topics were covered in architecture classes (e.g., scoreboard algorithm and Tomasulo algorithm); here’s a pasted excerpt from the web.

Out-of-order processors (Out-of-order processors) typically handle instructions in the following steps:

Instruction fetch
Instruction is dispatched to the instruction queue
Instruction waits in the queue until its input operands become available (once operands are ready, the instruction can leave the queue even if earlier instructions have not executed)
Instruction is assigned to an appropriate functional unit and executed
Execution result is placed into a queue (instead of being written to the register file immediately)
Only after the execution results of all earlier instructions have been written to the register file can the result of this instruction be written (reordering results to make execution appear in order)

From the above process, it can be seen that out-of-order execution avoids waiting for unavailable operands (the second step of in-order execution), thus improving efficiency. On modern machines, the processor runs much faster than memory, and an in-order processor spends much of its time waiting for data that could otherwise be used to process many instructions.

Memory barriers are divided into two categories:

Compiler barriers
Define variables with the volatile keyword; volatile has three implications:

The access order between volatile variables will not be altered by the compiler
Each read of a volatile variable accesses memory
Volatile variables are not subject to aggressive compiler optimizations

Example:

//thread1
volatile bool tag = true;
volatile int x = 0;
while(tag);
work(x)

//thread2
x = 10;
tag = false;

Here, thread1 enters an infinite loop; thread2 sets x to 10 and writes it to memory, then sets tag to false and writes it to memory. Thread1 reads tag from memory, the loop stops, and work(10) is called.

If volatile is not used, compiler optimizations may cause memory reordering, e.g., in thread2 the assignment x = 10 might be executed after tag = false, potentially causing thread1 to execute work(0).

CPU barriers

Ensure ordered reads and writes: mb() and smp_mb()
Ensure ordered writes only: wmb() and smp_wmb()
Ensure ordered reads only: rmb() and smp_rmb()

These memory barriers are used on multi-core processors; compiler barriers alone are insufficient, as cache coherence is also involved, which is beyond the scope of this discussion.

Lrefrain · August 7, 2024, 5:25pm

Cache Coherence and Memory Barriers

MESI cache coherence protocol:

In a dual‑core processor as shown above, memory can be main memory or L3 cache (usually L3 cache).
The two caches refer to L1/L2 caches; for simplicity, assume there is only one level of cache.

Consider the following scenario:
There is a variable x = 0, present in memory and in CPU0’s cache

CPU1 reads x, cache miss, requests the x data from memory and stores it in its cache
CPU0 writes x = 1, cache hit, directly modifies it in CPU0’s cache; if the cache is not write‑through, the value of x in memory remains 0
CPU1 reads x, cache hit, reads x = 0, resulting in an error

To prevent the above situation, the MESI protocol is used to maintain cache coherence.

The MESI protocol has four states, as follows:

I: Invalid – the cache line’s data is invalid.
S: Shared – the cache line is up‑to‑date and at least one other core’s cache also holds that cache line.
M: Modified – the cache line is up‑to‑date and differs from main memory; no other core’s cache holds the latest copy.
E: Exclusive – the cache line is up‑to‑date and matches main memory; no other core’s cache holds a newer copy.

Maintaining these states based on various CPU actions forms a four‑state automaton; during transitions, cores must exchange information.

Possible interaction message types:

Read: Sent on a cache miss, to obtain the latest copy from another core’s cache or from main memory; the message contains the address of the data being read.
Read Response: Response to a Read, containing the latest copy of the requested data.
Invalidate: The receiving cache invalidates the corresponding cache line; the message contains the address to be invalidated.
Invalidate Ack: Acknowledgment of an Invalidate, returned after the cache line has been invalidated.
Read Invalidate: Reads the latest data of a cache line and invalidates the corresponding line in the core’s cache. The message contains the address to be read and invalidated. This signal is typically sent on a write cache miss and requires both a read response and an invalidate acknowledgment.
Writeback: Occurs when a cache line in the M state is evicted, writing the dirty block back to main memory. The message contains the data address and the cache line’s contents.

The MESI protocol is correct but not very efficient; however, some simple optimizations can be added, such as Store Buffer and Invalidate Queue

The figure above shows the ARM architecture; the x86 architecture does not have an Invalidate Queue.

Store Buffer
When a CPU experiences a write miss, it must send a Read Invalidate and wait for a response, which can take a considerable amount of time; during this period the CPU cannot do other work, severely impacting performance. A store buffer solves this problem. As shown in the figure, when a write miss occurs, the CPU can send a read‑invalidate signal while storing the data to be written in the store buffer, then continue executing subsequent instructions. Once the other CPU(s) or main memory return the latest data and all invalidate acknowledgments are received, the data is read from the store buffer and written into the cache.

Store Forwarding
Consider the following scenario:
Initially x = 0 and is present in the CPU’s cache.

The CPU writes x = 1 to the cache; the write is stored in the store buffer and an invalidate operation is issued.
The CPU reads x; at this point x = 1 is still in the store buffer and has not been committed. Because the cache line hits, it returns x = 0.

Store Forwarding: When the CPU reads data, it first checks whether the data exists in the store buffer; if so, it reads it directly, otherwise it reads from the cache.

Store Buffer and Memory Barriers
undefined
Assume the fourth line smp_mb() is removed from the code above, and consider thread0 executing the foo function on CPU0 while thread1 executes the bar function on CPU1.

Initially, the copy of a resides only in CPU1’s cache and the copy of b only in CPU0’s cache. Consider the following execution order:

cpu0 executes a = 1; because a is not in cpu0’s cache, cpu0 issues a read‑invalidate signal and stores a = 1 in the store buffer.
cpu1 executes while (b == 0). Since b is not in cpu1’s cache, cpu1 sends a read request for the latest copy of b.
cpu0 executes b = 1; because b is in cpu0’s cache, it hits and simply updates the value to 1.
cpu0 receives the read request from cpu1 and returns the new value of b (1) to cpu1.
cpu1 receives the new value of b (1), updates its cache line, and because b is now 1, exits the while loop.
cpu1 executes assert(a == 1), discovers that a is still 0, and the assertion fails.
cpu1 receives a read‑invalidate signal, returns the data to cpu0, and invalidates the cache line containing a. By then it is already too late!

smp_mb() flushes the contents of the Store Buffer to the cache, i.e., it blocks until all stores issued before that point are visible to all CPUs.

Invalidate Queues
A cache’s Store Buffer is usually small; when it becomes full, the CPU must wait for some entry to be flushed before pushing a new store instruction, which can significantly degrade performance. The flush rate of the Store Buffer depends on the response speed of Invalidate Ack messages.

When a CPU receives an Invalidate signal, it does not immediately invalidate; instead it stores all invalidate information in an Invalidate Queue and promptly returns an Invalidate Ack. The Invalidate Queue asynchronously invalidates the cache lines.

Invalidate Queues and Memory Barriers
Consider the following scenario:
CPU1 and CPU2 both hold the latest copy of x = 0 in their caches, with the cache line in the Shared state.

Assume the instructions execute in the following order:

CPU1 writes x = 1, sends an Invalidate signal, and places the write into its store buffer.
CPU2 receives the Invalidate signal and returns an Invalidate Ack.
CPU1’s store buffer receives the Invalidate Ack and writes x = 1 into its cache.
CPU2 reads x; at this point CPU2 sees the cache line state as Shared and reads x = 0 (incorrect result).
CPU2’s cache changes the state of the cache line containing x to Invalid and pops the corresponding instruction from the Invalidate Queue, but it is already too late.

One can still use smp_mb() to clear the contents of the Invalidate Queue, i.e., block until all loads issued before that point are visible to all CPUs.

Thoughts on Memory Barriers
In summary, we can see that the read‑write barrier mb() essentially flushes both the Store Buffer and the Invalidate Queue; the write barrier wmb() likely flushes only the Store Buffer, while the implementation of the read barrier rmb() is still not entirely clear.

Lrefrain · August 8, 2024, 9:47am

# Compilation and Linking Consider using the following command to build the executable file for a.cpp ``` gcc a.c -o a ``` When this command is executed, it performs four tasks: preprocessing, compilation, assembly, and linking; these four stages correspond to four tools: the preprocessor, the compiler, the assembler, and the linker.

Preprocessing

Expand all macro definitions and conditional macros
Insert the header files included by #include at the specified locations
Remove comments, add file name and line number (for error reporting)

Compilation

The compilation process can be divided into six steps: lexical analysis, syntax analysis, semantic analysis, source code optimization, code generation, and target code optimization
The first four steps belong to the front end of the compiler, and the last two steps belong to the back end, based on the generation order of the intermediate language.
Specifically, lexical analysis splits the code into tokens, syntax analysis uses the tokens to produce an abstract syntax tree (AST), semantic analysis further processes the AST to check whether the program conforms to semantic rules, and source code optimization transforms the entire AST into intermediate code.

Assembly

The assembler translates a.s into machine language instructions and packages these instructions into a relocatable object file (i.e., a .o file).

Linking

Relocatable Object File

Its format and contents are as follows:

--------------------------------
|     ELF Head       |    ELF Head, describing basic information of the object file
--------------------------------
|      .text         |    Machine code
--------------------------------
|      .rodata       |    Read-only data
--------------------------------
|      .data         |    Initialized global and static variables
--------------------------------
|       .bss         |    Uninitialized global and static variables (does not actually occupy space, only needs runtime initialization)
--------------------------------
|     .symtab        |    Symbol table, stores all defined and referenced functions and global variables
--------------------------------
|     .rel.text      |    Code relocation table, instructions that call external functions need relocation
--------------------------------
|     .rel.data      |    Data relocation table, references to external global variables and external functions' global variables need relocation
--------------------------------
|      .debug        |
--------------------------------
|       .line        |
--------------------------------
|      .strtab       |
--------------------------------
|section header table|
--------------------------------

Static Linking

The linker obtains the symbol tables of each object file and merges them into a single global symbol table.
The linker obtains the lengths of each section of each object file, and calculates the length and position of each merged section.
The start address of each section is defaulted to 0; at load time, fields are accessed via section base address + offset within the section.

Symbol Resolution
Each relocatable object file contains three types of symbols:
a. Symbols defined in the current module that cannot be referenced by other modules (static functions and global variables)
b. Symbols defined in the current module that can be referenced by other modules (functions and global variables without static)
c. Symbols referenced by the current module but defined in other modules (functions and global variables without static defined in other modules)
Mapping each symbol reference to its corresponding definition is symbol resolution, and the rules are as follows:
A strong symbol is a defined symbol with an initial value, while a weak symbol is declared but not defined.
a. There can be only one strong symbol.
b. If there is one strong symbol and multiple weak symbols, the strong symbol is chosen as the definition.
c. If there are multiple weak symbols and no strong symbol, any one weak symbol is chosen as the definition.
Relocation
In the generated files above, the start address of each section defaults to 0; the linker needs to map symbol definitions to memory, and this mapping process is relocation.
a. The linker first merges the sections, creating a combined segment as the segment of the executable file.
b. Calculate the length and start address of each segment.
c. Based on each segment’s start address, assign runtime addresses (i.e., offsets within the segment) to the code/symbol references inside each segment.
Advantages and Disadvantages of Static Linking

Advantages and Disadvantages of Static Linking

Disadvantages: waste of space, because each executable must contain a copy of all required object files; difficult to update, as any change to a library function requires recompiling and relinking.
Advantages: the executable contains everything it needs, resulting in fast execution speed.

Dynamic Linking

The basic idea of dynamic linking is to split the program into relatively independent modules and perform linking at runtime.

Suppose there are two programs, program1.o and program2.o, both using the same library lib.o. When program1 is run first, the system loads program1.o; upon detecting that program1.o uses lib.o (i.e., program1.o depends on lib.o), the system then loads lib.o. If program1.o and lib.o further depend on other object files, they are also loaded into memory. When program2 runs, program2.o is loaded, and it is found that program2.o depends on lib.o, but lib.o is already present in memory, so it is not reloaded; instead, the existing lib.o in memory is mapped into program2’s virtual address space, and linking (similar to static linking) produces an executable.

Runtime Relocation for Dynamic Linking

Relocate the .so’s text and data to a memory segment.
Relocate all references to the .so file within the executable.

An interesting detail here: when coding a linker in an ICS class, references to symbols in a dynamic library involve multiple pointer jumps on the first reference to locate the corresponding memory segment of the symbol and record its exact location; on subsequent references, the previously recorded location is used directly.

Advantages and Disadvantages of Dynamic Linking

Advantages: even if each program depends on the same library, there will not be multiple copies in memory as with static linking; instead, the programs share a single copy at runtime. Updating is easy: just replace the library file without relinking all programs.
Disadvantages: linking is deferred to runtime, incurring performance overhead.

Loading

The loader copies the executable’s code and data from disk into memory, then the PC jumps to the program’s first instruction or entry point.

Lrefrain · August 11, 2024, 10:53am

# Atomic Operations and C++ Memory Order

Atomic Operations and Atomic Types

std::atomic

The atomic library defines the atomic type std::atomic<>, for example you can use std::atomic<int>.
std::atomic<> provides various atomic operations via member functions, such as std::atomic<>::load(), std::atomic<>::store(), std::atomic<>::fetch_and(), std:;atomic<>::fetch_add, etc.

CAS Operation

Implemented via atomic<>::compare_exchange_weak/strong(x, y). It means that if the atomic value equals x, it is set to y and returns true; otherwise x is updated with the atomic’s current value and returns false.

weak has better performance but may return false even when it could return true.
strong has worse performance but will always return true when it can.

C++ Memory Order

seq_cst requires all atomic operations to appear to execute in program order; the strictest memory order.
acquire ensures that all prior writes performed with release are visible to the current thread, and no statement after this one may be executed before this statement completes.
release ensures that all writes before this statement have been performed and are visible to other threads.
acq_rel ensures that before any read after this statement executes, all writes before this statement have been performed and are visible to other threads; and before any write after this statement executes, all reads before this statement have been performed.
relaxed imposes no ordering constraints, the weakest memory order.
consume

Lrefrain · August 11, 2024, 11:00am

User Mode and Kernel Mode

Quote previous post:

Lrefrain:

An example

Initially it is in kernel mode; you can notice that its Base and Bound are both disabled (meaning all address space [0000…, FFFF…] is accessible), uPC points to the executable code of the yellow process, indicating that we intend to load into the yellow process.

We replace PC with the previous uPC, set sysmode to 0, after which Base and Bound become active and equal to the address region boundaries of the yellow process, meaning we have entered the yellow process and also entered user mode.

Assume that after some time a timer interrupt occurs; the next instruction to execute will be the timer interrupt handler (located in the kernel code area). After the interrupt, we are once again in kernel mode.

Because a process switch must occur, the timer interrupt handler will save the yellow process’s PCB and place it in the Static Data of the kernel area (gray) as a small yellow block.
At the same time we load the register information needed to switch to the green process.

After changing sysmode, the green process runs successfully in user mode.

Lrefrain · August 11, 2024, 2:26pm

# Parallel Programming and Its Applications

pthread (POSIX thread)

Mutexes and Spin Locks

When a thread wants to acquire a lock using a mutex, if the lock is currently held by another thread, the requesting thread will be blocked, a context switch will occur, and the thread will be placed in a waiting queue. The CPU can then be scheduled to do other work, and once the lock becomes available, the thread will resume execution.
When a thread wants to acquire a lock using a spin lock, if the lock cannot be obtained immediately, it will busy‑wait continuously until the lock is acquired.
In practice, mutexes are used when the expected wait time to acquire the lock is relatively long, while spin locks are used when the expected wait time is short.

Condition Variables

std::condition_variable
Functions used:

pthread_cond_init/destroy, similar to mutexes and spin locks
pthread_cond_wait
Function definition:
int pthread_cond_wait(pthread_cond_t *restrict cond, pthread_mutex_t *restrict mutex)
cond: pointer to the condition variable.
mutex: pointer to the mutex.
Before calling this function, the mutex must be locked. During the wait, the mutex is automatically released, and the calling thread is blocked until the condition variable is signaled and the mutex is re‑acquired.
pthread_cond_signal/broadcast
Used to wake up threads waiting on the condition variable. signal wakes up one waiting thread (if multiple threads are waiting, only one is awakened); broadcast wakes up all waiting threads.
If no threads are waiting on the condition variable, the function has no effect.

Blocking Circular Queue

Copy some code from the internet:

    #include 
```<unistd.h>
    #include <cstdlib>
    #include <condition_variable>
    #include <iostream>
    #include <mutex>
    #include <thread>
    
    static const int kItemRepositorySize  = 10; // Item buffer size.
    static const int kItemsToProduce  = 1000;   // How many items we plan to produce.
    
    struct ItemRepository {
        int item_buffer[kItemRepositorySize]; // product buffer, used with read_position and write_position to model a circular queue.
        size_t read_position; // position where consumer reads products.
        size_t write_position; // position where producer writes products.
        std::mutex mtx; // mutex protecting the product buffer
        std::condition_variable repo_not_full; // condition variable indicating the product buffer is not full.
        std::condition_variable repo_not_empty; // condition variable indicating the product buffer is not empty.
    } gItemRepository; // global product repository variable, accessed by producer and consumer.
    
    typedef struct ItemRepository ItemRepository;
    
    
    void ProduceItem(ItemRepository *ir, int item)
    {
        std::unique_lock<std::mutex> lock(ir->mtx);
        while(((ir->write_position + 1) % kItemRepositorySize)
            == ir->read_position) { // item buffer is full, just wait here.
            std::cout << "Producer is waiting for an empty slot...\n";
            (ir->repo_not_full).wait(lock); // Producer waits for the condition that the product buffer is not full.
        }
    
        (ir->item_buffer)[ir->write_position] = item; // write product.
        (ir->write_position)++; // advance write position.
    
        if (ir->write_position == kItemRepositorySize) // If the write position reaches the end of the queue, reset to the initial position.
            ir->write_position = 0;
    
        (ir->repo_not_empty).notify_all(); // Notify consumers that the product buffer is not empty.
        lock.unlock(); // unlock.
    }
    
    int ConsumeItem(ItemRepository *ir)
    {
        int data;
        std::unique_lock<std::mutex> lock(ir->mtx);
        // item buffer is empty, just wait here.
        while(ir->write_position == ir->read_position) {
            std::cout << "Consumer is waiting for items...\n";
            (ir->repo_not_empty).wait(lock); // Consumer waits for the condition that the product buffer is not empty.
        }
    
        data = (ir->item_buffer)[ir->read_position]; // read a product
        (ir->read_position)++; // advance read position
    
        if (ir->read_position >= kItemRepositorySize) // If the read position moves to the end, reset it.
            ir->read_position = 0;
    
        (ir->repo_not_full).notify_all(); // Notify producers that the product buffer is not full.
        lock.unlock(); // unlock.
    
        return data; // return product.
    }
    
    
    void ProducerTask() // Producer task
    {
        for (int i = 1; i <= kItemsToProduce; ++i) {
            // sleep(1);
            std::cout << "Produce the " << i << "^th item..." << std::endl;
            ProduceItem(&gItemRepository, i); // Loop to produce kItemsToProduce products.
        }
    }
    
    void ConsumerTask() // Consumer task
    {
        static int cnt = 0;
        while(1) {
            sleep(1);
            int item = ConsumeItem(&gItemRepository); // consume one product.
            std::cout << "Consume the " << item << "^th item" << std::endl;
            if (++cnt == kItemsToProduce) break; // If the number of consumed products equals kItemsToProduce, then exit.
        }
    }
    
    void InitItemRepository(ItemRepository *ir)
    {
        ir->write_position = 0; // initialize product write position.
        ir->read_position = 0; // initialize product read position.
    }
    
    int main()
    {
        InitItemRepository(&gItemRepository);
        std::thread producer(ProducerTask); // create producer thread.
        std::thread consumer(ConsumerTask); // create consumer thread.
    }

The above code implements a single‑producer single‑consumer model; a multi‑producer multi‑consumer model only needs to add two counters to track how many items have been produced/consumed so far.

## Lock-Free Circular Queue

The kernel implements a lock‑free circular queue called **kfifo**.

Paste the code:

```c
/**
 * __kfifo_put - puts some data into the FIFO, no locking version
 * @fifo: the fifo to be used
 * @buf: the data source
 * @len: the length of the data
 *
 * Returns the number of bytes copied.
 */
static unsigned int __kfifo_put(struct __kfifo *fifo,
                const unsigned char *buf, unsigned int len)
{
    unsigned int i;
    unsigned int l;
    
    // calculate write space: queue size - write position + read position
    l = __kfifo_len(fifo);
    
    /*
     * Ensure that we sample the fifo->out index -before-
     * we start putting bytes into the kfifo.
     * memory barrier (full barrier)
     */
    smp_mb();
    
    /* first put the data starting from fifo->in to buffer end */
    /* write from the queue write position (mod queue size) to the end of the queue */
    i = min(len, fifo->size - fifo->in);
    memcpy(fifo->buffer + fifo->in, buf, i);
    
    /* then put the rest (if any) at the beginning of the buffer */
    /* write remaining data from the start of the queue */
    memcpy(fifo->buffer, buf + i, len - i);
    
    /*
     * Ensure that we add the bytes to the kfifo -before-
     * we update the fifo->in index.
     * memory barrier (write barrier)
     */
    smp_wmb();
    fifo->in += len;
    
    // monotonically increasing write position
    return len;
}

/**
 * __kfifo_get - gets some data from the FIFO, no locking version
 * @fifo: the fifo to be used
 * @buf: where to store data
 * @len: the length of the data
 *
 * Returns the number of bytes copied.
 */
static unsigned int __kfifo_get(struct __kfifo *fifo,
                unsigned char *buf, unsigned int len)
{
    unsigned int i;
    unsigned int l;
    
    // calculate read space: write position - read position
    l = __kfifo_len(fifo);
    
    /*
     * Ensure that we sample the fifo->in index -before-
     * we start removing bytes from the kfifo.
     * memory barrier (read barrier)
     */
    smp_mb();
    
    /* first get the data from fifo->out until the end of the buffer */
    /* read from the queue read position (mod queue size) to the end of the buffer */
    i = min(len, fifo->size - fifo->out);
    memcpy(buf, fifo->buffer + fifo->out, i);
    
    /* then get the rest (if any) from the beginning of the buffer */
    /* read remaining data from the start of the buffer */
    memcpy(buf + i, fifo->buffer, len - i);
    
    /*
     * Ensure that we remove the bytes from the kfifo -before-
     * we update the fifo->out index.
     * memory barrier (full barrier)
     */
    smp_mb();
    fifo->out += len;
    
    // monotonically increasing read position
    return len;
}

shared_ptr

shared_ptr itself is not a thread‑safe STL, so concurrent reads and writes to the underlying memory region are unsafe.
Because assignment involves releasing the original memory, changing the pointer target, and other modifications, the process is not atomic; therefore concurrent assignment to a shared_ptr is not thread‑safe.
Concurrent copying of a shared_ptr only reads and copies the data pointer and control block pointer, then increments the reference count, which is an atomic operation. Hence copying is thread‑safe.

#include <iostream>
#include <atomic>

class Counter {
public:
    Counter() : count(0) {}
    void add() { count++; }   // increment
    void sub() { count--; }   // decrement
    int get() const { return count; } // get current count
private:
    std::atomic<int> count;
};

template <class T>
class Sp {
public:
    Sp();                                 // default constructor
    Sp(T *ptr);                           // parameterized constructor
    Sp(const Sp &obj);                    // copy constructor
    ~Sp();                                // destructor
    Sp &operator=(const Sp &obj);         // overload =
    T *get();                             // get the class pointed to by the shared pointer
    int getcount();                       // get the reference counter

private:
    T *my_ptr;            // object pointed to by the shared pointer
    Counter* counter;    // reference counter
    void clear();        // cleanup function
};

template <class T>
Sp<T>::Sp()
{
    // default constructor, parameter is null, constructs a reference counter
    my_ptr = nullptr;
    counter = new Counter;
    counter->add();
}

template <class T>
Sp<T>::Sp(const Sp &obj)
{
    // copy constructor, new shared pointer points to the object pointed by the old shared pointer
    // make the pointed object also become the target's pointed object
    my_ptr = obj.my_ptr;
    // obtain the reference counter so that both shared pointers use the same counter
    counter = obj.counter;
    // increment this object's reference counter by 1
    counter->add();
}

template <class T>
Sp<T> &Sp<T>::operator=(const Sp &obj)
{
    // clean up the currently referenced object and reference counter
    clear();
    // point to the new object and obtain its reference counter
    my_ptr = obj.my_ptr;
    counter = obj.counter;
    // increment reference counter by 1
    counter->add();
    // return self
    return *this;
}

template <class T>
Sp<T>::Sp(T *ptr)
{
    // create a shared pointer pointing to the target class, construct a new reference counter
    my_ptr = ptr;
    counter = new Counter;
    counter->add();
}

template <class T>
Sp<T>::~Sp()
{
    // destructor, calls cleanup function when going out of scope
    clear();
}

template <class T>
void Sp<T>::clear()
{
    // cleanup function, decrements the reference counter; if it becomes 0, cleans up the pointed object's memory
    // decrement reference counter by 1
    counter->sub();
    // if reference counter becomes 0, clean up object
    if (0 == counter->get()) {
        if (my_ptr) {
            delete my_ptr;
        }
        delete counter;
    }
}

// number of shared pointers referencing the current object
template <class T>
int Sp<T>::getcount()
{
    return counter->get();
}
template< typename T >
T * sp<T>::get()
{
    return my_ptr;
}

Lrefrain · August 11, 2024, 3:17pm

LRU Replacement Algorithm Implementation

A LeetCode problem that requires O(1) lookup and O(1) cacheline replacement using the LRU method.
It can be implemented using a doubly linked list; STL list is quite convenient.

class LRUCache {
public:
    LRUCache(int capacity) {
        this->capacity = capacity;
    }
    
    int get(int key) {
        auto pos = _lruHash.find(key);
        if (pos != _lruHash.end()) {
            //Cache hit, adjust counter and update the position of the node corresponding to the key
            list<pair<int, int>>::iterator ptr = pos->second;
            //Transfer node using std::list::splice
            //void splice (iterator position, list& x, iterator i);
            //Transfers elements from x into the container, inserting them at position.
            //Transfer the i‑th node of x to the position
            _lruCacheList.splice(_lruCacheList.begin(), ptr);
            return pos->second->second;
        }
        return -1;
    }
    
    void put(int key, int value) {
        auto pos = _lruHash.find(key);
        if (pos != _lruHash.end()) {
            //Update cache, update the position of the corresponding node
            list<pair<int, int>>::iterator ptr = pos->second;
            ptr->second = value; //update data
            _lruCacheList.splice(_lruCacheList.begin(), ptr);
        } else {
            //Add new cache entry
            if (capacity == _lruHash.size()) {
                //Cache is full, replace by deleting the data at the tail
                pair<int, int>& back = _lruCacheList.back();
                _lruHash.erase(back.first);
                _lruCacheList.pop_back();
            }
            _lruCacheList.push_front(make_pair(key, value));
            _lruHash[key] = _lruCacheList.begin();
        }
    }
private:
    unordered_map<int, list<pair<int, int>>::iterator> _lruHash; //Convenient for lookup
    list<pair<int, int>> _lruCacheList; //Store data
    size_t capacity;
};

/**
 * Your LRUCache object will be instantiated and called as such:
 * LRUCache* obj = new LRUCache(capacity);
 * int param_1 = obj->get(key);
 * obj->put(key, value);
 */

LRU replacement for page tables is similar.

Lrefrain · August 11, 2024, 3:56pm

Virtual Memory and Memory Allocation?

Excerpt from this article: 虚拟内存：解决物理内存问题的策略-CSDN博客

Physical Address

Memory utilization issues
The way each process uses memory can cause fragmentation. When malloc tries to allocate a very large block, there may be enough free physical memory overall, but no sufficiently large contiguous free region. Scattered fragments of memory are wasted.
Read/write safety issues
Physical memory itself has no access restrictions; any address can be read or written. Modern operating systems need to enforce different access permissions for different pages, such as read‑only data.
Inter‑process safety issues
Processes do not have independent address spaces. A process that executes erroneous instructions or malicious code can directly modify another process’s data, or even kernel address space data—something the OS wants to prevent.
Memory read/write efficiency issues
When many processes run simultaneously and the total memory required exceeds the available physical memory, some programs must be temporarily swapped to disk, and new programs are loaded into RAM. The frequent swapping of large amounts of data makes memory usage very inefficient.

Virtual Memory

Memory Segmentation

Memory segmentation
Virtual memory is divided into segments such as code, data, stack, and heap.

A. How physical memory maps to virtual memory under a segmentation scheme
Segmentation uses a segment selector and an offset within the segment.
Segment selector: stored in a segment register; it contains the segment number, flags, etc. The segment number indexes a segment table, which holds the base address, limit, and privilege level of each segment.
Offset: the physical address is generally segment base address + offset.

B. Problems
Fragmentation: Segmentation allocates on demand, so internal fragmentation does not occur, but external fragmentation can appear—multiple scattered free regions that are hard to reuse. One way to handle external fragmentation is swapping: each virtual segment is swapped to secondary storage (disk) one by one, which is inefficient. If a very large program is swapped, the whole system can become sluggish.

Memory Paging

Paging cuts the entire virtual and physical address spaces into fixed‑size blocks called pages. In Linux, each page is 4 KB.

Paging creates page tables and virtual page numbers.

Generally, the physical address is obtained by taking the virtual page number and the offset within the page, looking up the corresponding page table entry (the page table resides in memory) to get the physical page number, then adding the offset:
physical address = physical page number + page offset — this is mainly performed by the MMU (Memory Management Unit).

A. Problems
Fragmentation: Paging does not allocate on demand; it allocates whole pages (4 KB each). This leads to internal fragmentation because a page is often not completely filled.

Paging improves swapping efficiency
During swapping, only a few pages are moved between RAM and disk, rather than whole segments. This page‑level swapping reduces the overhead compared with segment‑level swapping.

Page tables consume memory (space overhead)
Because many processes can run concurrently, each needs its own page tables. In a 32‑bit environment, the virtual address space is 4 GB. With a 4 KB page size (2¹²), about 1 M (2²⁰) pages are needed. Each page‑table entry is 4 bytes, so mapping the entire 4 GB requires roughly 4 MB of memory for the page tables. Since each process has its own tables, memory consumption grows with the number of processes.

How to reduce page‑table memory usage
Use multi‑level page tables. Due to locality, most programs use far less than 4 GB, leaving many upper‑level entries unused. If a first‑level entry is not needed, its corresponding second‑level table is not created, saving a large amount of memory.

Benefits of Virtual Memory

Extends usable memory – Processes can use more memory than physically available. Because program execution follows the principle of locality, the CPU tends to access the same memory repeatedly. Infrequently used pages can be swapped out to disk (e.g., the swap area).
Isolation between processes – Each process has its own page table, making its virtual address space independent. Processes cannot access other processes’ page tables, preventing address conflicts.
Improved security – Page‑table entries contain not only the physical address but also bits for attributes such as read/write permissions and presence flags, allowing the OS to enforce stronger memory‑access protection.
Eliminates physical fragmentation – Virtual memory provides a contiguous address space, solving the low utilization problem caused by fragmented physical memory.

Memory Allocation

User‑space calls to malloc allocate heap memory, which actually performs a memory mapping. The relevant system calls are brk() and mmap(). Small allocations ( < 128 KB) use brk, while large allocations ( > 128 KB) use mmap.

`brk`

When small blocks are freed, they are not returned to the system immediately but kept in a cache that can be reused via brk. This reduces the frequency of page‑fault interrupts and improves memory‑access efficiency, but it can lead to fragmentation.

`mmap`

Large blocks are returned to the system directly when freed. Each mmap incurs a page‑fault interrupt; frequent large allocations cause many page faults, degrading performance.

Hypocrisy · August 12, 2024, 6:22am

I asked GPT to do it for me, and it’s the same way of writing.

Lrefrain · August 12, 2024, 7:08am

Which one? The LRU one? I just went online, took a quick look and pasted a copy.

Ziqian_Chen · August 13, 2024, 7:10am

Is it possible that CRTP and virtual functions are not on the same level? Virtual functions correspond to dynamic dispatch.

Lrefrain · August 13, 2024, 7:56am

Is there any problem with what I wrote?

Comparing the advantages and disadvantages of CRTP and virtual functions was the original question asked during a Jump interview.

Ziqian_Chen · August 14, 2024, 3:53am

Your sister points out the good points

Lrefrain · August 14, 2024, 3:56am

Oh, when comparing, it should be limited to scenarios where both CRTP and virtual functions can be used, but we can indeed note that CRTP cannot perform dynamic dispatch.
The strengths of each side are written as the other’s weaknesses; just view it from the opposite perspective.

Hypocrisy · August 15, 2024, 6:35am

What are the three main forms of RTTI support in C++?

(It feels like this written‑test question is poorly written)

Common: typeid + dynamic_cast

There also seems to be a std::bad_cast

Murphy · August 27, 2024, 6:59am

Memory Barriers and Compiler Barriers (C++)

Memory Barriers and Memory Order

Memory barriers are used to prevent the processor from reordering memory operations; they are tools at the hardware or compiler level.

Memory order refers to how multiple threads in a multithreaded program observe each other’s read and write operations; it is a high‑level abstraction provided by C++. It implicitly inserts memory barriers by specifying different memory order models. Programmers do not need to manipulate memory barriers directly, but instead control the order of memory accesses via atomic operations and memory order models.

Both aim to control memory operations, preventing unexpected errors caused by instruction reordering and cache coherence issues.

Memory Order

The discussion concerns how the execution order of instructions within a single thread affects multiple threads. Regardless of the memory order attached, an individual operation only affects its own thread, but several threads can establish indirect ordering relationships through variable synchronization, which are then influenced by their respective memory orders.

Memory Order Tags

c++ standard library defines six memory orders, ranging from relaxed to strict:

memory_order_relaxed: relaxed operation, only guarantees the atomicity of the atomic operation itself
memory_order_consume: consume barrier, only affects writes to the currently loaded atomic variable and its dependent variables; otherwise same as memory_order_acquire. Deprecated in C++20.
memory_order_acquire: acquire barrier: all reads and writes in the current thread may not be reordered before this operation, making writes to the same atomic variable that occurred before a release in other threads visible to the current thread. In other words, it applies an “acquire” to loads of the relevant memory location, guaranteeing the latest value.
memory_order_release: release barrier: all reads and writes in the current thread may not be reordered after this operation, making the writes of this operation visible to other threads that perform an acquire on the same atomic variable later. It applies a “release” to stores of the relevant memory location, ensuring the latest value is written.
memory_order_acq_rel: acquire‑release barrier: reads and writes in the current thread cannot be reordered either after or before this operation.
memory_order_seq_cst: sequentially consistent barrier: the order of reads and writes appears consistent to all threads.

Among them

The default memory order tag for all atomic operations is std::memory_order_seq_cst.
Two acquire barriers: memory_order_consume only affects the current variable and variables that depend on it; memory_order_acquire affects all variables before the current operation.
Two release barriers: memory_order_acq_rel is only effective for the current thread; memory_order_seq_cst is global.

Cooperative Combinations

Common cooperative combinations:

Relaxed order: memory_order_relaxed
Only guarantees atomicity, allowing all instruction reordering.
Release‑Acquire sequence: memory_order_release + memory_order_acquire
Writes before the release are guaranteed to be visible; reads with acquire see the latest values.
- Release‑Consume sequence: memory_order_release + memory_order_consume
  Slightly more relaxed than release‑acquire; only synchronizes the atomic variable itself and variables that depend on it.
- Sequentially consistent order memory_order_seq_cst
  Disallows all reordering, visible to all threads.

Example:

First define two variables

std::atomic<std::string *> ptr;
int data;

Release‑Acquire sequence

void producer() {
  auto *p = new std::string("Hello");
  data = 42;
  // The compiler may not move the `data = 42` instruction after the store operation
  // `memory_order_release` releases the result of the modification (write operation); other threads using `memory_order_acquire` can observe the memory write. This ensures the producer's modification is definitely written.
  ptr.store(p, std::memory_order_release);
}

void consumer() {
  std::string *p2;
  // `memory_order_acquire` reads the latest value; values written before a `memory_order_release` in other threads are guaranteed to be readable. This ensures the consumer reads the value written by the producer.
  while (!(p2 = ptr.load(std::memory_order_acquire)))
    ;
  // Reaching this point indicates that `p2` is non‑null, i.e., the `data = 42` instruction has been executed (guaranteed by `memory_order_release`).
  // Moreover, `data` must be 42 and `p2` must be "Hello" (guaranteed by `memory_order_acquire`).
  assert(*p2 == "Hello");
  assert(data == 42);
  // No error will occur
}

Release‑Consume sequence

void producer() {
    std::string* p  = new std::string("Hello");
    data = 42;
    ptr.store(p, std::memory_order_release);
}
 
void consumer() {
    std::string* p2;
    while (!(p2 = ptr.load(std::memory_order_consume)))
        ;
    assert(*p2 == "Hello"); // No error: *p2 carries a dependency from ptr
    assert(data == 42); // May error: data does not carry a dependency from ptr
}

Memory Barriers

The C++ standard library provides the following interface

std::atomic_thread_fence: an explicit memory barrier. Used together with memory order tags, it divides threads into stages, where each stage must wait for other threads to complete specific operations before proceeding. It applies to all data, not just atomic variables.

The default memory order tag for all atomic operations is std::memory_order_seq_cst.

It can be considered that:

2.a write barrier combined with the write operation 2.b elevates 2.b to a release‑store
3.b read barrier combined with the read operation 3.a elevates 3.a to an acquire‑load
The synchronization relationship is similar to the earlier memory order example, ensuring that 1 always occurs before 4

Example (comments translated):

std::atomic<bool> x,y;
void write_x_then_y() {
    x.store(true, std::memory_order_relaxed); // 1
    std::atomic_thread_fence(std::memory_order_release); // 2.a write barrier
    y.store(true, std::memory_order_relaxed); // 2.b
}
void read_y_then_x() {
    while(!y.load(std::memory_order_relaxed)); // 3.a
    std::atomic_thread_fence(std::memory_order_acquire); // 3.b read barrier
    assert(!x.load(std::memory_order_relaxed));// 4
}

It can be considered that:

2.a write barrier combined with the write operation 2.b elevates 2.b to a release‑store
3.b read barrier combined with the read operation 3.a elevates 3.a to an acquire‑load
The synchronization relationship is similar to the earlier memory order example, ensuring that 1 always occurs before 4

Memory Barriers and Compiler Barriers

Compiler barriers only affect the compiler. They prevent the compiler from reordering instructions, ensuring that the generated assembly preserves the order specified in the source code, but they do not emit actual hardware instructions, so they do not affect CPU or memory access order. They cannot handle cache coherence or memory visibility issues arising from thread concurrency and are typically used on single‑core processors.

Memory barriers affect both the compiler and the CPU. They prevent the compiler from reordering operations and also emit hardware instructions that affect the CPU’s memory access order. They ensure data synchronization between threads and are commonly used on multi‑core processors.

Compiler Barriers

std::atomic_signal_fence: a compiler barrier that applies to all data. Used to prevent the compiler from reordering instructions in signal handlers or when interacting with hardware.
volatile keyword:
- For a volatile variable, operations related to it must read/write to memory, not just read/write a cached register value.
- For multiple volatile variables, the access order between them will not be changed by the compiler.

A compiler barrier essentially tells the compiler that the values of these variables may change outside the program’s control and should not be assumed constant. It is commonly used in signal handling, hardware interaction, and embedded system interrupt routines.

// thread1
volatile bool tag = true;
volatile int x = 0;
while(tag);
assert(!x);

// thread2
x = 10;
tag = false;

On a single‑core processor, these two threads exhibit “pseudo‑concurrency”; volatile has two effects here:

Each iteration reads the latest value of tag from memory, preventing thread1 from looping forever.
The read/write order of x and tag matches the source code, i.e., x = 10 definitely occurs before tag = false.

However, on a multi‑core processor with “true concurrency”, even though volatile forces the compiler to generate code that reads/writes from main memory, the actual execution order is still determined by the CPU. In this case, memory barriers are needed to enforce the required logic, ensuring memory operations occur in the specified order.

Topic		Replies	Views
CS162 学习记录学术专区	85	627	November 4, 2024
CS.05.毕业之后.04.谈谈工作通用&CS攻略 survive-cs	2	627	June 26, 2023
OP 遇到了灵异事件谈笑风生	26	113	May 27, 2024
可不可靠啊，别到时候火了又被谈笑风生	64	1807	June 20, 2023
算法做题记录深入交流	46	396	May 30, 2024
训练一下条件独立的直觉（为什么本站支持上传 zip, exe, apk, mp4, msi, deb, rpm 却不支持上传 html）谈笑风生	9	76	July 7, 2025
[Wait But Why] 婚姻抉择：永生相伴或再也不见深入交流 wait-but-why	1	56	June 23, 2025
CS61a 2023 暑期学习日记深入交流 study	21	1137	September 24, 2023
提问的智慧 How To Ask Questions The Smart Way 网站	1	141	October 15, 2023
我也不知道为什么，突然就这样了谈笑风生	12	292	April 11, 2024

Computer job application template

Preprocessing

Compilation

Assembly

Linking

Relocatable Object File

Static Linking

Advantages and Disadvantages of Static Linking

Dynamic Linking

Runtime Relocation for Dynamic Linking

Advantages and Disadvantages of Dynamic Linking

Loading

Atomic Operations and Atomic Types

std::atomic

CAS Operation

C++ Memory Order

User Mode and Kernel Mode

pthread (POSIX thread)

Mutexes and Spin Locks

Condition Variables

Blocking Circular Queue

shared_ptr

Virtual Memory and Memory Allocation?

Physical Address

Virtual Memory

Memory Segmentation

Memory Paging

Benefits of Virtual Memory

Memory Allocation

brk

mmap

Memory Barriers and Compiler Barriers (C++)

Memory Barriers and Memory Order

Memory Order

Memory Order Tags

Cooperative Combinations

Memory Barriers

Memory Barriers and Compiler Barriers

Compiler Barriers

Related topics

`brk`

`mmap`