Page Aware Programming - Algorithmic Computational Models

Short answer: Yes, but indirectly.

While you can’t directly control where virtual addresses are assigned (the OS and memory allocator handle that), you can influence memory layout to maximize the chance that related data lands on the same page—just like cache-aware programming optimizes for cache lines. Here’s how:

A. Allocate Contiguous Memory Blocks

Use arrays or custom allocators instead of scattered malloc() calls.

Example:

// Good: Allocates 1024 ints contiguously (likely on same/few pages)
int* buffer = new int[1024]; 

// Bad: Fragmented allocations (could span many pages)
int* ptr1 = new int;
int* ptr2 = new int; // Unrelated addresses

B. Force Alignment to Page Boundaries

Align large structures or buffers to page size (4KB/2MB).

Example:

// Allocate 8KB aligned to a 4KB page boundary
alignas(4096) char buffer[8192]; // Guaranteed to occupy 2 full pages

C. Use Memory Pools

Pre-allocate a pool of objects in a contiguous region.

Example:

struct Order {
    int price;
    int volume;
};

// Reserve 1000 Orders in one chunk (likely on 1-2 pages)
Order* pool = (Order*)aligned_alloc(4096, 1000 * sizeof(Order));

D. Leverage Huge Pages (2MB/1GB)

Larger pages = higher chance related data stays together.

Example (Linux):

void* buf = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);
// 2MB page fits ~512 cache lines (vs. 64 in 4KB page)

2. Why This Works

Virtual memory allocators (like malloc) tend to assign contiguous virtual addresses to contiguous allocations.
By bundling related data (e.g., an order book’s price levels), you increase the odds they share a page.
Page alignment ensures no straddling (e.g., a struct split across two pages).

3. Edge Cases to Watch For

Scenario	Risk	Fix
Heap Fragmentation	Repeated `new`/`delete` scatters objects.	Use memory pools.
Compiler Padding	Structs may have gaps between fields.	`#pragma pack(1)` or manual padding.
Multi-threaded Allocators	Thread-local allocators may use different regions.	Use a central pool.

4. HFT-Specific Tricks

Prefault Pages

Touch all pages after allocation to ensure they’re in RAM:

memset(buffer, 0, size); // Forces physical page allocation
mlock(buffer, size);     // Locks pages in RAM (no swapping)

NUMA Binding

Ensure pages are allocated near the executing CPU core:

numa_run_on_node(0); // Pin thread to NUMA node 0
void* buf = numa_alloc_onnode(size, 0);

Custom Allocators
- Override new/malloc to group related objects (e.g., all orders for Symbol X).

5. Verification: How to Check Page Locality

Linux: Use pmap -X <PID> to see memory mappings.

C++: Log pointer addresses and compute page boundaries:

uintptr_t page_mask = ~(4095ULL); // 4KB page mask
bool same_page = (uintptr_t(&obj1) & page_mask) == (uintptr_t(&obj2) & page_mask);

Key Takeaway

You can’t directly control virtual addresses, but by:

Allocating contiguously (arrays/pools).
Aligning to page boundaries.
Using huge pages.

you maximize the odds that related data shares a page—just like cache-aware programming optimizes for cache lines. In HFT, this reduces TLB misses and keeps latency predictable.

Want a deep dive into measuring TLB misses? Try:

perf stat -e dtlb_load_misses.miss_causes_a_walk ./your_program

I should clarify that I don't have complete and specific information about all page-aware programming techniques used in modern High-Frequency Trading (HFT) systems, as many of these techniques are proprietary and continuously evolving. However, I can share several common page-aware optimizations that are known to be used in performance-critical HFT systems:

Page-aligned memory allocation
- Explicitly aligning data structures to page boundaries (typically 4KB)
- Using posix_memalign(), aligned_alloc(), or custom allocators
NUMA (Non-Uniform Memory Access) awareness
- Pinning threads to specific CPU cores
- Allocating memory from the same NUMA node as the CPU running the code
- Using numactl or the libnuma API for explicit control
Huge pages utilization
- Using 2MB or 1GB pages instead of standard 4KB pages
- Reducing TLB misses and page table overhead
- Configuring with madvise() or /proc/sys/vm/hugetlb_* settings
Page coloring
- Organizing data structures to avoid cache conflicts
- Ensuring hot data is on different cache lines
Cache line padding
- Adding padding to data structures to prevent false sharing
- Aligning critical data to cache line boundaries (typically 64 bytes)
Memory prefetching
- Strategic data layout to enable hardware prefetching
- Software prefetch instructions for predictable access patterns
TLB optimization
- Minimizing page table depth with huge pages
- Optimizing data structures to minimize TLB misses
Page fault avoidance
- Pre-touching memory during initialization
- Memory locking with mlock() to prevent swapping
- Disabling copy-on-write with explicit memory copying
Zero-copy techniques
- Mapping shared memory directly to network buffers
- Using kernel bypass technologies (DPDK, netmap)
Memory mapping optimization
- Using MAP_POPULATE flag with mmap() to pre-fault pages
- Careful use of page permissions for security isolation
Cache-conscious data organization
- Grouping frequently accessed data together
- Using structures-of-arrays instead of arrays-of-structures
Lock-free data structures with page considerations
- Ensuring atomic operations don't cross page boundaries
- Considering cache coherence protocol effects

These techniques are often combined and adapted to specific hardware architectures and trading strategies. The effectiveness of each approach depends heavily on the specific workload, system architecture, and trading requirements.

Here are the key tools and commands for profiling page faults and TLB misses on Linux systems:

1. perf (Linux Performance Counters)

# Profile page faults
perf stat -e page-faults ./your_program

# Profile TLB misses (x86)
perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses ./your_program

# For more detailed analysis with call graphs
perf record -e page-faults,dTLB-load-misses ./your_program
perf report

2. PCM (Intel Performance Counter Monitor)

# Install: apt-get install pcm or build from source
pcm.x 1        # Monitor memory accesses
pcm-memory.x   # Detailed memory subsystem stats

3. VTune Profiler (Intel)

# Memory access analysis
vtune -collect memory-access -knob analyze-mem-objects=true ./your_program

# Microarchitecture analysis for TLB stats
vtune -collect uarch-exploration ./your_program

4. PAPI (Performance Application Programming Interface)

# For custom applications with PAPI library
papi_avail    # List available counters
papi_native_avail | grep -i tlb  # Find TLB-related counters

5. valgrind/cachegrind

# For detailed cache and TLB simulation
valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 --LL=8388608,16,64 ./your_program
cg_annotate cachegrind.out.*

6. numastat

# For NUMA-related statistics
numastat -p PID

7. /proc filesystem

# Check page faults for a running process
cat /proc/PID/stat | awk '{print "Minor faults: "$10", Major faults: "$12}'

# Monitor page faults in real-time
while true; do cat /proc/PID/stat | awk '{print "Minor: "$10", Major: "$12}'; sleep 1; done

8. bpftrace/BCC

# Install BCC tools first
# Count page faults by process
sudo bpftrace -e 'kprobe:handle_mm_fault { @[comm] = count(); }'

# BCC scripts
sudo /usr/share/bcc/tools/memleak -p PID      # Memory leak analysis
sudo /usr/share/bcc/tools/funclatency do_page_fault  # Page fault latency

For the most comprehensive analysis, I recommend starting with perf stat to get baseline metrics, then using more specialized tools like VTune or PCM for deeper investigation of specific issues.