Can You Control Relative Addresses to Keep Related Data on the Same Page?
Short answer: Yes, but indirectly.
While you can’t directly control where virtual addresses are assigned (the OS and memory allocator handle that), you can influence memory layout to maximize the chance that related data lands on the same page—just like cache-aware programming optimizes for cache lines. Here’s how:
1. How to Keep Related Data on the Same Page
A. Allocate Contiguous Memory Blocks
- Use arrays or custom allocators instead of scattered
malloc()calls. - Example:
// Good: Allocates 1024 ints contiguously (likely on same/few pages) int* buffer = new int[1024]; // Bad: Fragmented allocations (could span many pages) int* ptr1 = new int; int* ptr2 = new int; // Unrelated addresses
B. Force Alignment to Page Boundaries
- Align large structures or buffers to page size (4KB/2MB).
- Example:
// Allocate 8KB aligned to a 4KB page boundary alignas(4096) char buffer[8192]; // Guaranteed to occupy 2 full pages
C. Use Memory Pools
- Pre-allocate a pool of objects in a contiguous region.
- Example:
struct Order { int price; int volume; }; // Reserve 1000 Orders in one chunk (likely on 1-2 pages) Order* pool = (Order*)aligned_alloc(4096, 1000 * sizeof(Order));
D. Leverage Huge Pages (2MB/1GB)
- Larger pages = higher chance related data stays together.
- Example (Linux):
void* buf = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0); // 2MB page fits ~512 cache lines (vs. 64 in 4KB page)
2. Why This Works
- Virtual memory allocators (like
malloc) tend to assign contiguous virtual addresses to contiguous allocations. - By bundling related data (e.g., an order book’s price levels), you increase the odds they share a page.
- Page alignment ensures no straddling (e.g., a struct split across two pages).
3. Edge Cases to Watch For
| Scenario | Risk | Fix |
|---|---|---|
| Heap Fragmentation | Repeated new/delete scatters objects. | Use memory pools. |
| Compiler Padding | Structs may have gaps between fields. | #pragma pack(1) or manual padding. |
| Multi-threaded Allocators | Thread-local allocators may use different regions. | Use a central pool. |
4. HFT-Specific Tricks
- Prefault Pages
- Touch all pages after allocation to ensure they’re in RAM:
memset(buffer, 0, size); // Forces physical page allocation mlock(buffer, size); // Locks pages in RAM (no swapping)
- Touch all pages after allocation to ensure they’re in RAM:
- NUMA Binding
- Ensure pages are allocated near the executing CPU core:
numa_run_on_node(0); // Pin thread to NUMA node 0 void* buf = numa_alloc_onnode(size, 0);
- Ensure pages are allocated near the executing CPU core:
- Custom Allocators
- Override
new/mallocto group related objects (e.g., all orders for Symbol X).
- Override
5. Verification: How to Check Page Locality
- Linux: Use
pmap -X <PID>to see memory mappings. - C++: Log pointer addresses and compute page boundaries:
uintptr_t page_mask = ~(4095ULL); // 4KB page mask bool same_page = (uintptr_t(&obj1) & page_mask) == (uintptr_t(&obj2) & page_mask);
Key Takeaway
You can’t directly control virtual addresses, but by:
- Allocating contiguously (arrays/pools).
- Aligning to page boundaries.
- Using huge pages.
you maximize the odds that related data shares a page—just like cache-aware programming optimizes for cache lines. In HFT, this reduces TLB misses and keeps latency predictable.
Want a deep dive into measuring TLB misses? Try:
perf stat -e dtlb_load_misses.miss_causes_a_walk ./your_program
I should clarify that I don't have complete and specific information about all page-aware programming techniques used in modern High-Frequency Trading (HFT) systems, as many of these techniques are proprietary and continuously evolving. However, I can share several common page-aware optimizations that are known to be used in performance-critical HFT systems:
-
Page-aligned memory allocation
- Explicitly aligning data structures to page boundaries (typically 4KB)
- Using
posix_memalign(),aligned_alloc(), or custom allocators
-
NUMA (Non-Uniform Memory Access) awareness
- Pinning threads to specific CPU cores
- Allocating memory from the same NUMA node as the CPU running the code
- Using
numactlor thelibnumaAPI for explicit control
-
Huge pages utilization
- Using 2MB or 1GB pages instead of standard 4KB pages
- Reducing TLB misses and page table overhead
- Configuring with
madvise()or/proc/sys/vm/hugetlb_*settings
-
Page coloring
- Organizing data structures to avoid cache conflicts
- Ensuring hot data is on different cache lines
-
Cache line padding
- Adding padding to data structures to prevent false sharing
- Aligning critical data to cache line boundaries (typically 64 bytes)
-
Memory prefetching
- Strategic data layout to enable hardware prefetching
- Software prefetch instructions for predictable access patterns
-
TLB optimization
- Minimizing page table depth with huge pages
- Optimizing data structures to minimize TLB misses
-
Page fault avoidance
- Pre-touching memory during initialization
- Memory locking with
mlock()to prevent swapping - Disabling copy-on-write with explicit memory copying
-
Zero-copy techniques
- Mapping shared memory directly to network buffers
- Using kernel bypass technologies (DPDK, netmap)
-
Memory mapping optimization
- Using
MAP_POPULATEflag withmmap()to pre-fault pages - Careful use of page permissions for security isolation
- Using
-
Cache-conscious data organization
- Grouping frequently accessed data together
- Using structures-of-arrays instead of arrays-of-structures
-
Lock-free data structures with page considerations
- Ensuring atomic operations don't cross page boundaries
- Considering cache coherence protocol effects
These techniques are often combined and adapted to specific hardware architectures and trading strategies. The effectiveness of each approach depends heavily on the specific workload, system architecture, and trading requirements.
Here are the key tools and commands for profiling page faults and TLB misses on Linux systems:
1. perf (Linux Performance Counters)
# Profile page faults
perf stat -e page-faults ./your_program
# Profile TLB misses (x86)
perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses ./your_program
# For more detailed analysis with call graphs
perf record -e page-faults,dTLB-load-misses ./your_program
perf report
2. PCM (Intel Performance Counter Monitor)
# Install: apt-get install pcm or build from source
pcm.x 1 # Monitor memory accesses
pcm-memory.x # Detailed memory subsystem stats
3. VTune Profiler (Intel)
# Memory access analysis
vtune -collect memory-access -knob analyze-mem-objects=true ./your_program
# Microarchitecture analysis for TLB stats
vtune -collect uarch-exploration ./your_program
4. PAPI (Performance Application Programming Interface)
# For custom applications with PAPI library
papi_avail # List available counters
papi_native_avail | grep -i tlb # Find TLB-related counters
5. valgrind/cachegrind
# For detailed cache and TLB simulation
valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 --LL=8388608,16,64 ./your_program
cg_annotate cachegrind.out.*
6. numastat
# For NUMA-related statistics
numastat -p PID
7. /proc filesystem
# Check page faults for a running process
cat /proc/PID/stat | awk '{print "Minor faults: "$10", Major faults: "$12}'
# Monitor page faults in real-time
while true; do cat /proc/PID/stat | awk '{print "Minor: "$10", Major: "$12}'; sleep 1; done
8. bpftrace/BCC
# Install BCC tools first
# Count page faults by process
sudo bpftrace -e 'kprobe:handle_mm_fault { @[comm] = count(); }'
# BCC scripts
sudo /usr/share/bcc/tools/memleak -p PID # Memory leak analysis
sudo /usr/share/bcc/tools/funclatency do_page_fault # Page fault latency
For the most comprehensive analysis, I recommend starting with perf stat to get baseline metrics, then using more specialized tools like VTune or PCM for deeper investigation of specific issues.