IMPROVE: various things

This commit is contained in:
Jean-Claude 2022-08-21 23:29:20 +02:00
parent c2c4723b7b
commit 995300f7ce
Signed by: jeanclaude
GPG Key ID: 8A300F57CBB9F63E
11 changed files with 118 additions and 89 deletions

View File

@ -20,7 +20,7 @@ The successful development of this variant confirms that \retbleed{} cannot be m
In-depth mitigations are necessary to secure the CPU, which will likely affect performance significantly.
We have benchmarked the patches to evaluate the induced overhead.
While the overhead depends on the microarchitectures, it lies in the range of \SIrange{13}{37}{\percent}.
While the overhead depends on the microarchitectures, it lies in the range of \SI{13}{\percent} to \SI{37}{\percent}.
\vfil
\clearpage

View File

@ -27,7 +27,7 @@ The kernel is the core part of each \ac{os} and facilitates the interactions bet
It can be thought of as a special program controlling all hardware.
When software and hardware need to communicate, the kernel serves as a middleman.
The kernel has full read and write, but not execute, access to all the system's memory.
The kernel is given this access via the \techterm{physmap}, a large, contiguous region in the kernel's address space that maps to all the system's physical memory.~\cite{ret2dir}
The kernel is given this access via the \techterm{physmap}, a large, contiguous region in the kernel's address space that maps to all the system's physical memory~\cite{ret2dir}.
For this thesis, we consider the Linux Kernel~\cite{linuxKernel} as it is \ac{foss}, meaning we can freely download, inspect and modify it.
However, \retbleed{} exploits features implemented in hardware, and therefore, other kernels are also applicable/vulnerable.
@ -53,7 +53,7 @@ Before creating a new mapping, it is always verified whether the subject\footnot
If the subject references a page that he is not allowed to access, the type of the \ac{pf} is \techterm{invalid}.
The handler informs the subject about the privilege boundary breach by sending the \verb+SIGSEGV+ signal.
An invalid \ac{pf} should never be raised in the absence of bugs in the kernel/user process except if the user process has malicious intentions.
When talking about \acp{pf} in the remainder of this thesis, we refer to \techterm{invalid} \acp{pf}.
When talking about \acp{pf} in the remainder of this thesis, we refer to invalid \acp{pf}.
\section{Stack}
\label{sec:stack}
@ -72,27 +72,27 @@ Once the function is completed and calls \verb+return+, the return address is lo
\section{Cache Structure}
\label{sec:cache}
\todo{Correct whole Cache Structure section.}
CPUs got faster and faster over time.
While memory also got faster, the pace was slower than for CPUs.
This gap widened over time.
Caches try to compensate for that difference in speed.
They are small banks of fast memory used by the CPU to store recently used data.
Caches store data in \techterm{lines} which hold fixed number of \si{bytes} each.
Multiple \techterm{lines} are combined in a \techterm{set}.
A cache consists of multiple sets where the number of such sets is described by the \techterm{associativity}.
Caches store data in \techterm{lines} that hold a fixed number of \si{bytes} each.
Multiple lines are combined in a \techterm{set}.
A cache consists of multiple sets.
The number of lines per set is described by the \techterm{associativity} of the cache.
It ranges from \techterm{1-way associative}, in which case the set contains exactly one line, to \techterm{fully associative}, in which case there is only one set that contains all lines.
Memory is split into aligned junks of the same size as the \techterm{cache line}.
Each of these junks maps to precisely one \techterm{cache set}, determined by certain bits of the memory address.
To differentiate between multiple \techterm{lines} in the matching set a \techterm{tag} is used.
Memory is split into aligned chunks of the same size as the cache line.
Each of these chunks maps to precisely one cache set, determined by certain bits of the memory address.
To differentiate between multiple lines in the matching set, a \techterm{tag} is used.
When the CPU wants to read data from a given address, it indexes the cache to see if the data is present.
If a line with a matching tag is found in the right set, the CPU loads it.
Otherwise, it must load the line containing the desired data into the cache and then to the CPU.
If the cache is full, a \techterm{eviction policy} is used to determine the \techterm{line} to replace.
To achieve the best possible performance, these policies are carefully crafted with the goal to evict the \techterm{line} which is accessed furthest in the future.
If the cache is full, an \techterm{eviction policy} is used to determine the line to replace.
To achieve the best possible performance, these policies are carefully crafted with the goal to evict the line which is accessed furthest in the future.
Modern processors employ multiple layers of caches, where each layer has different properties.
Most importantly, there is a trade-off between speed and size.
@ -100,20 +100,20 @@ However, caches can also be \techterm{private} or \techterm{shared}, which descr
The \techterm{inclusion policies} describe the interaction between multiple layers.
When a cache is \techterm{inclusive}, all blocks it stores must also be present in all lower-level caches.
An \techterm{exclusive} cache is one where a cached block is not allowed to be cached by any lower layer.
If the cache is neither \techterm{inclusive} nor \techterm{exclusive}, then it is \techterm{non-inclusive}.
If the cache is neither inclusive nor exclusive, then it is \techterm{non-inclusive}.
In this case, the blocks from the upper-level cache may or may not be present in the lower-level cache.
Commodity CPUs consist of three levels, with L1 being the smallest but fastest one, and L3, also called \ac{llc}, being large but slow.
L2 lies somewhere in between.
L3 commonly \techterm{shared} while the other two are \techterm{private}.
All three levels are commonly \techterm{inclusive}.
L3 is shared while the other two are private.
All three levels are commonly inclusive.
\subsection{Cache Attacks}
\label{subsec:cacheAttacks}
The base principle of caches is that retrieving data from cache is much faster than loading it from memory.
This property has been exploited using different \techterm{side-channel attacks}\cite{flushAndReload, flushAndFlush, hund2013practical, osvik2006cache}
This property has been exploited using different \techterm{side-channel attacks}~\cite{flushAndReload,flushAndFlush,hund2013practical,osvik2006cache}.
We will discuss \textsc{Flush+Reload}\cite{flushAndReload}, as it is the one most relevant for \retbleed.
We will discuss \textsc{Flush+Reload}~\cite{flushAndReload}, as it is the one most relevant for \retbleed.
It targets the L3, meaning the attacker and victim do not need to share the execution core.
However, the attacker and the victim must have shared memory.
@ -121,7 +121,7 @@ The attacker proceeds as follows:
Using the \verb+clflush+ instruction, the attacker ensures that none of the memory lines are cached.
Then, the attacker waits to give the victim time to do some memory operations.
By reloading the memory pages and measuring the required time, the attacker knows if a page was cached and can infer if the victim has accessed it.
The attacker uses a so-called \techterm{convert channel} to make this inference.
This process is referred to as \techterm{convert channel}.
\section{Speculative Execution}
\label{sec:specularExecution}
@ -140,17 +140,17 @@ Before retiring, the induced changes are only visible internally to the microarc
The aforementioned tasks are implemented as hardware units, and instructions ``flow'' from one stage to the next.
Since these stages are independent circuits, the next instruction can already be fetched when the first one gets passed on to the decode stage.
Since these stages are independent circuits, when one instruction is in the \techterm{decode stage}, the next instruction can already be \techterm{fetched}.
Since these stages are independent circuits, when one instruction is in the decode stage, the next instruction can already be fetched.
An internal clock orchestrates these stages.
This way, each stage knows when it receives new input from the lower stage and when the results must be ready to be passed to the next stage.
One says that such a CPU is \techterm{pipelined}.
One says that such a CPU is pipelined.
Different microarchitectures have different types and numbers of stages.
Intel Haswell has, for example, over 14 stages~\cite{haswellChipWiki}.
Intel Haswell has, for example, over 14 stages~\cite{haswellChipWiki,agnerFogManual}.
\subsection{Speculation}
\label{subsec:speculation}
While pipelining increases throughput and hence the overall performance of the CPU, it happens that the CPU \techterm{executes} instructions that it should never have.
While pipelining increases throughput and hence the overall performance of the CPU, it happens that the CPU executes instructions that it should never have.
Multiple cases can lead to that.
When a pipeline stage cannot complete within the given timeframe, which may happen when an operand needs to be fetched from memory, the whole pipeline comes naturally to a halt.
@ -161,7 +161,7 @@ For example, it cannot execute any instructions whose operand depends on the res
In addition, the system must give the illusion that the ``execution'' follows the \techterm{program order}.
This is achieved by retiring the instruction in the program order.
\ac{ooo} execution leads to \techterm{speculation}, as it is potentially unclear if an \ac{ooo} executed instruction lies on the architectural path.
\ac{ooo} execution leads to speculation, as it is potentially unclear if an \ac{ooo} executed instruction lies on the architectural path.
Speculation happens if the execution of a branching instruction gets delayed while further downstream instructions are already executed.
Alternatively, speculation can be caused if an instruction whose execution gets delayed, introduces an unrecoverable fault or an abort, once it is executed.
Attacks exploiting the later case are generally referred to as \techterm{Meltdown}~\cite{meltdown} vulnerabilities.
@ -181,17 +181,17 @@ When the CPU detects a branching instruction, which it does after decoding the i
The \ac{bpu} consists of multiple predictors, and the one to use for a particular instruction is chosen based on the type of the instruction.
In the following, we will look at the two predictors relevant for this thesis.
Due to the proprietary nature of CPUs, only very little is officially communicated about the \ac{bpu}'s internals, and researchers have to rely on reverse-engineering efforts.~\cite{projectZero}
Due to the proprietary nature of CPUs, only very little is officially communicated about the \ac{bpu}'s internals, and researchers have to rely on reverse-engineering efforts~\cite{projectZero}.
\subsubsection{Direct/Indirect Branch Prediction}
\label{subsubsec:dirIndirBranchPrediction}
\techterm{Direct} branches are branches where the destination address is given explicitly as an address or relative offset to a value stored in a register.
\techterm{Indirect} branches provide a pointer to a memory location, which gives the jump destination.
Furthermore, branches are either \techterm{conditional} or \techterm{unconditional}.
As the name indicates, the destination for \techterm{conditional} branches depends on a condition, while \techterm{unconditional} branches are always taken.
As the name indicates, the destination for conditional branches depends on a condition, while unconditional branches are always taken.
The prediction has two aspects.
When a branch is \techterm{conditional}, the \ac{bpu} needs to predict if the branch is taken or not, and where does the branch lead to, while for \techterm{indirect} branches, only the destination needs to be predicted.
When a branch is conditional, the \ac{bpu} needs to predict if the branch is taken or not, and where does the branch lead to, while for indirect branches, only the destination needs to be predicted.
\techterm{Direct + unconditional} branches are the easiest to deal with as they do not require any prediction.
However, predicting \techterm{indirect + conditional} requires two predictions; is the branch taken or not, and if it is taken, what is its destination?
They have the highest risk of misprediction.
@ -209,15 +209,15 @@ The indexing of the \ac{btb} is microarchitecture dependent but generally done u
In addition, auxiliary data structures may be used.
Some Intel microarchitectures use the branch history, condensed in a hash-like format, stored in the \ac{bhb}, for the indexing.
It allows for efficient representation of the $N$ last encountered branches.
For Intel Haswell, for example, $N = 29$.~\cite{projectZero,spectre}
For Intel Haswell, for example, $N = 29$~\cite{projectZero,spectre}.
Unlike caches, the ways of the \ac{btb} have much smaller tags, which results in \techterm{collisions}.
A collision is when two different branching instructions map to the same \ac{btb} set.
It has been shown that for Intel Haswell, two branches collide if bits \numrange{6}{11} are equivalent.~\cite{retbleed}
It has been shown that for Intel Haswell, two branches collide if bits \numrange{6}{11} are equivalent~\cite{retbleed}.
On AMD, it is a bit more complicated.
Two addresses collide if the address bits are permuted according to a specific schema.
While the collision works for Intel across privilege domains, for AMD, it only works for Zen 1 and Zen 2.
For Zen 3, \citeauthor{retbleed} could not detect collisions across privilege boundaries.~\cite{retbleed}
For Zen 3, \citeauthor{retbleed} could not detect collisions across privilege boundaries~\cite{retbleed}.
\subsubsection{Function Return Prediction}
\label{subsubsec:funcReturnPrediction}
@ -227,7 +227,7 @@ It uses a stack-like cache called \ac{rsb} to support multiple nested function c
When encountering a call, the expected return address is pushed to the \ac{rsb}.\footnote{For the x86\_64 \ac{isa}, a function residing at \ac{pc} is expected to return to $\ac{pc} + 4$.}
Once we return, the top address of the \ac{rsb} is popped, and we speculatively execute the path starting at the retrieved address.
The capacity of the \ac{rsb} is limited, with, for example, $16$ entries for Haswell or $22$ for Ice Lake.~\cite{agnerFogManual}
The capacity of the \ac{rsb} is limited, with, for example, $16$ entries for Haswell or $22$ for Ice Lake~\cite{agnerFogManual}.
In case of an overflow, the \ac{rsb} wraps around and overwrites the oldest element.
About the behavior of an underflow is not much known.
\citeauthor{retbleed} have shown that certain microarchitectures fall back on using the \ac{btb}~\cite{retbleed}.
@ -237,17 +237,17 @@ We will look more closely at this behavior in \autoref{subsubsec:retbleed}.
\label{subsec:spectreAttacks}
While speculatively executed instructions do not retire and hence, they do not affect the architectural state, they leave traces in the microarchitectural state.
Using \techterm{side-channel attacks}, these trances can be used to infer information about the operants of the speculatively executed instructions.
Since the publication of \techterm{Spectre}~\cite{spectre} in 2018, new versions of \techterm{Spectre}, more generally referred to as \techterm{transient execution attacks}, were discovered.
Since the publication of \techterm{Spectre}~\cite{spectre} in 2018, new versions of Spectre, more generally referred to as \techterm{transient execution attacks}, were discovered.
By influencing the \techterm{transient code}, attackers can control what data to leak.
That is not just limited to data of the current user, but a wide variety of attacks work across privilege boundaries.
To achieve the desired speculation, attackers often manipulate the \ac{bpu} in carefully selected ways.~\cite{spectre,meltdown,retbleed}
To achieve the desired speculation, attackers often manipulate the \ac{bpu} in carefully selected ways~\cite{spectre,meltdown,retbleed}.
For a \techterm{Spectre attack} to be successful, three things are needed:
For a Spectre attack to be successful, three things are needed:
\begin{enumerate}
\item \textbf{Speculation Primitive.} Causes the speculation and makes the \techterm{disclosure gadget} to get executed \techterm{speculatively}.
\item \textbf{Speculation Primitive.} Causes the speculation and makes the disclosure gadget to get executed speculatively.
\item \textbf{Disclosure Gadget.} Is executed speculatively and causes the leakage.
\item \textbf{Convert Channel.} Is used to read out the microarchitectural traces left over by the \techterm{transient execution}.
\item \textbf{Convert Channel.} Is used to read out the microarchitectural traces left over by the transient execution.
\end{enumerate}
Mitigating Spectre attacks is non-trivial as speculation is a core principle of modern CPUs.
@ -262,7 +262,7 @@ Spectre attacks can be mitigated in three different ways:~\cite{retbleed}
\end{itemize}
\end{itemize}
Next, we will look at a few relevant \techterm{transient execution attacks}.
Next, we will look at a few relevant transient execution attacks.
We start by looking at \techterm{Spectre~V1}, which should serve as a simple introductory example.
As \techterm{Spectre V2} shares some similarities with \retbleed, we will discuss that after.
\techterm{Spectre\acs{rsb}} is relevant because we use it to craft our speculative version of \retbleed.
@ -295,8 +295,8 @@ for (int i = 0; i <= array_size; i++) {
}
\end{lstlisting}
To mitigate Spectre V1, one can use the \verb+lfence+ instruction or create data dependencies, to prevent the speculative out-of-bounds access.~\cite{amdSpectreMitigation}
Compilers, such as GCC, have been patched to be able to do that automatically.~\cite{spectreV1Mitigation}
To mitigate Spectre V1, one can use the \verb+lfence+ instruction or create data dependencies, to prevent the speculative out-of-bounds access~\cite{amdSpectreMitigation}.
Compilers, such as GCC, have been patched to be able to do that automatically~\cite{spectreV1Mitigation}.
Therefore, by recompilation, Spectre V1 can be mitigated.
This attack is often employed with array out-of-bounds accesses, as in our example.
@ -318,8 +318,8 @@ This steers the speculative control flow to \verb+GADGET+.
Intel proposed \ac{ibrs}~\cite{ibrs} as a possible mitigation for Spectre V2.
\ac{ibrs} prevents the predictor from using branch prediction resolutions from lower privilege levels on higher ones.
This measure prevents the hijack of a victim process across privilege boundaries.
In the end, a different mitigation, called \techterm{retpoline}~\cite{retpolineIntel,retpolineGoogle}, was used due to its lower overhead.~\cite{retpolineOverIbrs}
\techterm{Retpoline} prevents speculative execution from using the indirect branch predictor.
In the end, a different mitigation, called \techterm{retpoline}~\cite{retpolineIntel,retpolineGoogle}, was used due to its lower overhead~\cite{retpolineOverIbrs}.
Retpoline prevents speculative execution from using the indirect branch predictor.
\subsubsection{Spectre\acs{rsb}}
\label{subsubsec:spectreRsb}
@ -336,7 +336,7 @@ In \autoref{lst:roguePrimitive}, we have used this method to cause speculation.
\subsubsection{Retbleed}
\label{subsubsec:retbleed}
\retbleed~\cite{retbleed} is a \techterm{transient execution attack}, sharing many similarities with Spectre V2.
\retbleed~\cite{retbleed} is a transient execution attack, sharing many similarities with Spectre V2.
It also hijacks branches by poisoning \ac{btb}, but in contrast to Spectre V2, it targets return instructions.
While Spectre\acs{rsb}~\cite{spectreRsb}, Ret2Spec~\cite{ret2spec}, and Spring~\cite{spring} all are return-based Spectre attacks, they target the \ac{rsb}, while \retbleed{} targets the \ac{btb}.
@ -375,7 +375,7 @@ A system where both \acp{poc} works successfully, is susceptible by \retbleed.\f
\label{para:rebleedE2E}
The actual \retbleed{} attack can leak arbitrary privileged memory at a rate of \SI{219}{\byte\per\second} for Intel Coffee Lake and \SI{3.9}{\kilo\byte\per\second} for AMD Zen 2.
All systems susceptible to both mentioned primitives are vulnerable to \retbleed.
AMD Zen 1, Zen 1+ and Zen 2, as well as Intel Kaby Lake and Coffee Lake, are among the vulnerable microarchitectures.~\cite{retbleed}
AMD Zen 1, Zen 1+ and Zen 2, as well as Intel Kaby Lake and Coffee Lake, are among the vulnerable microarchitectures~\cite{retbleed}.
To make this attack work, an attacker must overcome multiple challenges.
Firstly, vulnerable and exploitable return instructions must be detected and identified.
@ -389,11 +389,11 @@ Only then the victim return can be executed using a system call.
We have already introduced \ac{ibpb}-on-\ac{pf} as a possible mitigation in \autoref{sec:motivation} and \ref{sec:researchQuestions}.
While it enforces isolation by preventing \ac{bti} from user to kernel space, it is said to be incomplete.
\ac{eibrs} was shown to mitigate \retbleed.
\ac{eibrs}~\cite{ibrs} was shown to mitigate \retbleed.
Therefore, more recent microarchitectures like Coffe Lake Refresh and Alder Lake are secure.
On older microarchitectures, where \ac{rsb}-to-\ac{btb} can be observed and which do not support \ac{eibrs}, \ac{ibrs} is enabled.~\cite{retbleedIntelMitigation,retbleed}
On older microarchitectures, where \ac{rsb}-to-\ac{btb} can be observed and which do not support \ac{eibrs}, \ac{ibrs} is enabled~\cite{retbleedIntelMitigation,retbleed}.
While Intel takes the route of isolation, AMD prevents the speculation from happening altogether.
Their mitigation, called \techterm{jmp2ret}, replaces all return instructions in the kernel with jumps to a return thunk.
An \techterm{untrain} procedure secures the last remaining return instruction.~\cite{retbleedAmdMitigationI,retbleedAmdMitigationII,retbleed}
An \techterm{untrain} procedure secures the last remaining return instruction~\cite{retbleedAmdMitigationI,retbleedAmdMitigationII,retbleed}.
Some more details on the mitigation is given in \autoref{sec:discussion}, where we discuss mitigation overheads.

View File

@ -2,9 +2,9 @@
\section{Implementation}
\label{sec:implementation}
We use the PoCs provided by \retbleed{} as a basis for our PoCs\footnote{We will consider a slightly modified version of the \retbleed{} PoCs where we have done some simplifications.}.~\cite{retbleedRepo}
Before we develop a speculative version of the \cpbti{} PoC, we want to verify that speculative \ac{bti} works in the same privilege domain.
For that, we modify the \retbti{} PoC to create a version of it where the \ac{bti} is done speculatively.
We use the \acp{poc} provided by \retbleed{} as a basis for our \acp{poc}\footnote{We will consider a slightly modified version of the \retbleed{} \acp{poc} where we have done some simplifications.}~\cite{retbleedRepo}.
Before we develop a speculative version of the \cpbti{} \acp{poc}, we want to verify that speculative \ac{bti} works in the same privilege domain.
For that, we modify the \retbti{} \acp{poc} to create a version of it where the \ac{bti} is done speculatively.
We will refer to it as \specretbti.
If we succeed, we proceed by creating a speculative version of \cpbti{} to see if the speculative \ac{bti} works across privilege domains, as it does with non-speculative \ac{bti}.
@ -13,40 +13,36 @@ AMD and Intel microarchitectures have shown this behavior, but as seen in \autor
For Intel, this is achieved by underflowing the \ac{rsb}.
Which happens if too many return instructions are encountered in a row.
We refer to the mechanism we use to achieve that in our \ac{poc} as a \techterm{return cycle}.
Besides underflowing the \ac{rsb}, the \techterm{return cycle} has the additional purpose of creating a ``normalized'' branch history, meaning that the \ac{bhb} is set into a known and easily reproducible state.
Besides underflowing the \ac{rsb}, the return cycle has the additional purpose of creating a ``normalized'' branch history, meaning that the \ac{bhb} is set into a known and easily reproducible state.
Normalizing the history is important since the \ac{btb} is indexed using the \ac{bhb}.
The \ac{bpu} will not use the injected \ac{btb} entry, in case the history diverges too much from the one present during the poisoning.
To create the \techterm{return cycle}, a memory address holding a return instruction is repeatedly pushed to the stack.
To create the return cycle, a memory address holding a return instruction is repeatedly pushed to the stack.
An initial return instruction starts the recursive cycle.
After all addresses of the return instruction are popped from the stack, the control flow returns to the address one the stack, which was pushed prior to creating the cycle.
\autoref{lst:createReturnCycle} shows the used method.
Alternatively, a recursive function call can also create the return cycle.
The length of the return cycle is microarchitecture dependent, but $28$ cycles are optimal for Coffee Lake and $29$ for Coffee Lake Refresh.~\cite{retbleed}
The length of the return cycle is microarchitecture dependent, but $28$ cycles are optimal for Coffee Lake and $29$ for Coffee Lake Refresh~\cite{retbleed}.
\begin{lstlisting}[style=CStyle,caption={Creation of a \techterm{return cycle}. The address \lstinline+cycle\_dst+, to which the control flow return after the \techterm{return cycle}, is pushed to the stack first. Next, the \techterm{return cycle} is created by repeatedly pushing the address \lstinline+RET_PATH+, where a return instruction is stored, to the stack. An initial return instruction starts the cycle.},label={lst:createReturnCycle}]
\begin{lstlisting}[style=CStyle,caption={Creation of a return cycle. The address \lstinline+cycle\_dst+, to which the control flow return after the return cycle, is pushed to the stack first. Next, the return cycle is created by repeatedly pushing the address \lstinline+RET_PATH+, where a return instruction is stored, to the stack. An initial return instruction starts the cycle.},label={lst:createReturnCycle}]
// Store return instruction to memory location RET_PATH
memcpy((u8*)RET_PATH, "\xc3", 1);
asm(
// Address to which to return after the cycle
"pushq %[cycle_dst]\n\t"
// Create return cycle of length 30
".rept 30\n\t"
"push $RET_PATH\n\t"
".endr\n\t"
// Start the cycle
"ret\n\t"
:: [cycle_dst]"r"(BR_SRC1) :
);
\end{lstlisting}
%stopzone
AMD CPUs exhibit the \ac{rsb}-to-\ac{btb} fallback for return instruction in case they collide with the address of a previously encountered indirect branch.
Also, the cycle is also not required for setting up the branch history, as the \ac{btb} does not seem to be indexed using any kind of branch history.~\cite{retbleed}
Also, the cycle is also not required for setting up the branch history, as the \ac{btb} does not seem to be indexed using any kind of branch history~\cite{retbleed}.
Instead, the \ac{btb} is indexed using the start and end addresses of the branch, which can be thought of as a ``basic block''.
\paragraph{Preview.}

View File

@ -7,7 +7,7 @@ This \ac{poc} aims to verify that speculative \ac{bti} works in the same privile
However, before working on \specretbti, we will discuss how the plain \retbti{} \ac{poc} works.
\paragraph{\retbti{} in detail.}
After the \techterm{return cycle}, spinning on a particular memory location, we get to \verb+BR1+.
After the return cycle, spinning on a particular memory location, we get to \verb+BR1+.
Here, the speculation primitive, a return instruction, is located.
This return brings us to the disclosure gadget, stored at \verb+TRAIN+, during the training phase.
It also does the \ac{bti}.
@ -57,7 +57,7 @@ That function changes the architectural return address such that it returns to \
\caption{The rogue function is also executed during the speculation phase, to ensure that histories are equivalent. However, in contrast to the training phase, no speculation window is created. After returning from the rogue function, the return instruction is mispredicted, using the injected entry pointing to \lstinline+TRAIN+.}
\label{fig:specRetbtiSpec}
\end{subfigure}
\caption{Control flow of the \specretbti{} PoC for Intel. During the training phase, depicted in (a), a speculatively executed return poisons the \ac{btb}. This leads to the hijacking of a return instruction, as visible in (b). Speculatively executed branches are indicated in red, while the architectural branches are black.}
\caption{Control flow of the \specretbti{} \ac{poc} for Intel. During the training phase, depicted in (a), a speculatively executed return poisons the \ac{btb}. This leads to the hijacking of a return instruction, as visible in (b). Speculatively executed branches are indicated in red, while the architectural branches are black.}
\label{fig:specRetbti}
\end{figure}

View File

@ -4,7 +4,7 @@
\label{subsubsec:specCpBti}
The development of the speculative version of \cpbti{} is of main interest to us.
It shows if Spec \ac{bti} works across privilege boundaries and, therefore, demonstrates if \retbleed's primitives can be implemented without raining any \acp{pf}.
A successful implementation of \speccpbti{} implies a positive answer to \rqref{rq1}.
% A successful implementation of \speccpbti{} implies a positive answer to \rqref{rq1}.
These \acp{poc} consist of a user space program and a kernel module.
The user space program is the attacker who poisons the \ac{btb} across privilege boundaries to hijack a return instruction executed by the kernel.
@ -37,10 +37,10 @@ The speculation phase, which we will discuss next, is depicted in \autoref{fig:c
\end{figure}
To make use of the poisoned \ac{btb} entry, control is handed over to the kernel.
It executes the \techterm{return cycle} located at \verb+KBR_SRC+.
It executes the return cycle located at \verb+KBR_SRC+.
Similarly, as with the \retbti{} \ac{poc} for AMD, the source of the victim and attacker branch are different, as they lie in different address spaces.
As with the mentioned AMD \ac{poc}, \verb+KBR_SRC'+ is selected so that it collides with \verb+KBR_SRC+.
Therefore, as the \acp{pc} collide and the histories are equivalent\footnote{Even if the return cycle spins on colliding addresses, it hard to say if the histories are actually equivalent or just ``similar enough'' to make the prediction work. We will comment on that matter in \autoref{sec:discussion}.\todo{actually do that}}, the \ac{bpu} will use the malicious entry for its prediction, guiding the speculative control flow to \verb+KBR_DST+.
Therefore, as the \acp{pc} collide and the histories are equivalent\footnote{Even if the return cycle spins on colliding addresses, it hard to say if the histories are actually equivalent or just ``similar enough'' to make the prediction work.}, the \ac{bpu} will use the malicious entry for its prediction, guiding the speculative control flow to \verb+KBR_DST+.
The version for AMD only differs in the way that no return cycles are used.

View File

@ -3,10 +3,10 @@
\section{Evaluation}
\label{sec:evaluation}
We have developed a speculative version of both \retbti{} and \cpbti.
The \retbti{} \acp{poc} for AMD and Intel, and \cpbti{} for Intel are working.
The \specretbti{} \acp{poc} for AMD and Intel, and \speccpbti{} for Intel are working.
For all \acp{poc} for Intel, we get reliable signals and have not observed any wrong outputs.
While we get some signal for the \specretbti{} \ac{poc} for AMD, the signal is not very stable.
For all working \acp{poc}, we have verified that the source of speculation comes from the right place.
For all working \acp{poc}, we have verified that the source of speculation comes from the desired location.
This process is further described in \autoref{a:sec:verifySourceOfSpeculation}.
The \cpbti{} \ac{poc} for AMD is not working.
@ -25,17 +25,16 @@ Since we have observed that the chosen $N$ can have an impact on the performance
To check for consistency, we repeat the experiment $10$ times for each $N$.
The results are aggregated over the runs as follows;
The number of false outputs was summed up.
The mean, as well as the standard deviation, of true outputs is calculated.
The mean and the standard deviation of true outputs is calculated.
In addition, the standard derivation was normalized to $N$.
\paragraph{Results.}
\autoref{tab:intelRetBti} and \ref{tab:intelSpecRetBti} show the results for \retbti{} and \specretbti{} for Intel.
For \retbti{}, the percentage of correct outputs is mostly stable over increasing $N$, with a mean of \SI{29.53}{\percent}.
For \retbti{}, the percentage of correct outputs is mostly constant over $N$, with a mean of \SI{29.53}{\percent}.
The mean of the normalized standard deviation is $0.0216$.
For \specretbti, the mean is \SI{67.82}{\percent}, which is \SI{\sim 38}{\pp} higher than for the non-speculative version.
However, the speculative version is less stable, which is confirmed by the higher mean of the normalized standard derivation.
That is $0.0777$ for the speculative case.
For \specretbti, the mean success rate is \SI{67.82}{\percent}, which is \SI{\sim 38}{\pp} higher than for the non-speculative version.
However, the speculative version is less stable, which is confirmed by the higher mean of the normalized standard derivation, which is $0.0777$.
\begin{table}[ht]
\centering
@ -75,13 +74,13 @@ That is $0.0777$ for the speculative case.
}
\end{table}
The results of the \cpbti{} PoCs for Intel are listed in \autoref{tab:intelRetBti} and \ref{tab:intelSpecRetBti}.
For the non-speculative \cpbti{} \ac{poc}, the percentage of correct answers starts relatively low for a low $N$, and increases with increasing $N$.
The results of the \cpbti{} \acp{poc} for Intel are listed in \autoref{tab:intelRetBti} and \ref{tab:intelSpecRetBti}.
For the non-speculative \cpbti{} \ac{poc}, the percentage of correct answers starts relatively low for a small $N$, and increases with increasing $N$.
After a peak at $N = 20000$, it decreases again.
The mean percentage of correct answers is \SI{12.23}{\percent}, and the mean normalized standard deviation is $0.1665$.
For the speculative version, the success percentage increases with increasing $N$.
The highest reached percentage is \SI{88.77}{\percent}, which we got for the highest tested $N$.
The mean success rate is with \SI{60.07}{\percent} $\sim 5$ times as large as for the non-speculative case.
The mean success rate is with \SI{60.07}{\percent} approximately $5$ times as large as for the non-speculative case.
The output is much less stable, which is confirmed by the mean normalized standard derivation, which is with $0.2839$ twice as large as for the non-speculative \ac{poc}.
\begin{table}[ht]
@ -125,14 +124,15 @@ The output is much less stable, which is confirmed by the mean normalized standa
In both cases, the speculative version performs better than the non-speculative one.
We were not able to collect results for AMD.
While we could not get the \speccpbti{} \ac{poc} to work, \specretbti{} is functional.
The \specretbti{} for AMD is functional, but due to its instability and low reliability, it was not possible to do any proper recording.
However, the signal strength and reliability were too low to be properly recorded.
We were unable to get the \speccpbti{} \ac{poc} to work for AMD.
\paragraph{Discussion.}
The most notable result is that the speculative versions of the \acp{poc} have a higher success rate than the non-speculative ones.
However, the standard deviation is generally higher, indicating that the performance fluctuates more.
The standard deviation is generally also higher, indicating that the performance fluctuates more.
When looking at the non-speculative \acp{poc}, one might notice that their performance is much worse than the ones claimed by \retbleed.
This has multiple reasons.
When looking at the non-speculative \acp{poc}, one might notice that their performance is much worse than the ones shown by \citeauthor{retbleed}~\cite{retbleed}.
This has multiple reasons;
Firstly, we did not use the latest and most performant versions of the \acs{poc} to gather these results, but we used the \acp{poc} as of the version on which we based the speculative \acp{poc}.
In addition, we disabled compiler optimizations for all \acp{poc}, as higher optimizations caused the speculative \acp{poc} to stop working in some cases.
In addition, we disabled compiler optimizations for all \acp{poc}, as some optimizations caused the speculative \acp{poc} to stop working in some cases.

View File

@ -8,10 +8,11 @@ This gives a positive answer to \rqref{rq1}.
Moreover, the successful development of \specretbti{} shows that speculative \ac{bti} is possible in the same privilege domain.
In addition, \speccpbti{} shows that it even works across privilege boundaries.
The later, we were only able to show for Intel.
\subsection{Challenges}
\label{subsec:challenges}
With the proprietary nature of CPUs and the limited public knowledge \ac{bpu}'s internals, developing the speculative primitives was challenging.
With the proprietary nature of CPUs and the limited public knowledge on the \ac{bpu}'s internals, developing the speculative primitives was challenging.
On and off, the \ac{bpu} behaved unexpectedly.
Sometimes trial and error was required to figure out how to align certain instructions to achieve the best possible performance.
@ -19,15 +20,15 @@ While we failed in creating the \speccpbti{} primitive for AMD, we believe that
We presume that our \ac{poc} is not working due to some minor error, like alignment issues or other requirements of the \ac{bpu}, of which we are unaware of.
All in all, these \acp{poc} are very fragile.
So many different components impact the success of these \acp{poc}.
If one is not right, the whole \ac{poc} breaks.
Many different components influence the success rate of these \acp{poc}.
If one is not right, the whole \ac{poc} may not work.
In \autoref{a:sec:improveSignalStrength} we have a listing of components that we have considered during the development.
\subsection{Retbleed in Virtualised Systems}
\label{subsec:retbleedInVirtualizedEnv}
So far, we have only considered bare-metal systems for our discussion on \ac{pf}-free \retbleed.
We will briefly discuss another scenario that allows for employing \retbleed{} without causing any \acp{pf}.
So, even if it is impossible to create a \ac{pf}-free training primitive for AMD, this gives reason enough to consider a more in-depth mitigation.
Hence, even if it is impossible to create a \ac{pf}-free training primitive for AMD, this gives reason enough to consider a more in-depth mitigation.
We assume a virtualized environment with a guest \ac{os} where an attacker has obtained root privileges.
The attacker aims to hijack a return instruction of a hypervisor process.
As the \ac{btb} is not flushed on a guest-to-hypervisor switch, the hypervisor is influenced by branch feedback from the guest.

View File

@ -15,6 +15,8 @@ We describe the used methodology in the next section, followed be an evaluation
Byte-UnixBench~\cite{byteUnixBench} is the benchmarking suite we use for this evaluation.
This suite is composed of $12$ end-to-end benchmarks which test different aspects of the system.
While certain tests will not interact with the mitigations, as they do not leave the user space, other tests, like the \techterm{System Call Overhead} test, are designed to benchmark the cost of entering and leaving the kernel space.
\autoref{a:tab:rawIndexScore} shows the index score of all tests, for mitigation disable and enabled.
It is clearly visible which tests are affected by the mitigation and which not.
Most tests are composed of a loop, where a specific task is performed in each iteration.
Depending on the \ac{sut}'s performance, this task takes a different amount of time to complete.
@ -30,7 +32,7 @@ Firstly, only a single instance of the test is run, while in the second run, N c
\paragraph{System Under Test.}
As discussed in \autoref{para:retbleedMitigation}, different mitigations have been released for different microarchitectures.
To get a good overview of the performance impacts, we benchmark multiple vulnerable microarchitectures; Intel Coffee Lake, AMD Zen 1, and AMD Zen 2.
All \ac{sut} run unmodified Linux 5.19.0-rc6, where \retbleed{} mitigations have been merged into.~\cite{retbleedMitigation}
All \ac{sut} run unmodified Linux 5.19.0-rc6, where \retbleed{} mitigations have been merged into~\cite{retbleedMitigation}.
For each \ac{sut}, we make two measurements; Once with \retbleed{} mitigation enabled and once with all mitigations disabled.
The mitigation is controlled using the kernel parameters \verb+retbleed=auto+ and \verb+retbleed=off+, respectively.
For AMD Zen 1, the default mitigation setting does not fully mitigate the issue, as a thread remains vulnerable to attacks from its sibling thread.
@ -109,16 +111,16 @@ However, with less than \SI{1.5}{\percent} fluctuations in the index score, the
As these instabilities also occur with mitigation disable, it is most likely not related to the patches.
Coffee Lake exhibits a rather significant overhead.
This is due to the \ac{ibrs}~\cite{ibrs}-based mitigation.~\cite{retbleedIntelMitigation}
This is due to the \ac{ibrs}~\cite{ibrs}-based mitigation~\cite{retbleedIntelMitigation}.
\ac{ibrs} is an indirect branch control mechanism that restricts the speculation of indirect branches, preventing \retbleed{} from hijacking branches across privilege boundaries.
Kernel developers started considering lower-cost mitigation for Intel, which works by detecting and preventing the \ac{rsb} from underflowing.~\cite{retbleedNewMitigation}
Kernel developers started considering lower-cost mitigation for Intel, which works by detecting and preventing the \ac{rsb} from underflowing~\cite{retbleedNewMitigation}.
AMD Zen 1 and Zen 2 use the same jmp2ret ``untrain return thunk'' as the basis for their mitigation.
The \SI{4.89}{\percent} overhead for Zen 1, with \ac{smt} enabled, in the multi-threaded case, probably directly represents the overhead of jmp2ret.
However, jmp2ret does not fully mitigate the issue, as it leaves a thread vulnerable to attacks from its sibling thread.
Therefore, we consider \SI{5.0}{\sim} as the base overhead for AMD Zen 1 and Zen 2.
To overcome this vulnerability issue, Zen 2 uses \ac{stibp}.~\cite{stibp}
To overcome this vulnerability issue, Zen 2 uses \ac{stibp}~\cite{stibp}.
This mechanism prevents indirect branch prediction resolution from influencing sibling threads.
\ac{stibp} adds \SI{\sim 8}{\percent} overhead on top of the base overhead, resulting in an overhead of \SI{13.13}{\percent}.
Without \ac{stibp} on Zen 1, the only option to protect a thread from its sibling is to disable \ac{smt}.

View File

@ -5,15 +5,15 @@
%Here you could also show important code snippets or additional plots.
\section{Improve Signal Strength and Reliability}
\label{a:sec:improveSignalStrenght}
\label{a:sec:improveSignalStrength}
In this subsection, we want to briefly mention a few techniques we have employed to improve the signal strength and reliability of the \acp{poc}.
We will also emphasize findings we have made during the implementation related to these techniques.
\paragraph{Branch History.}
As the \ac{btb} for Intel is indexed using the branch history condensed in the \ac{bhb}, the probability that the injected branch is used for the prediction is the highest, if the history at the time of the hijack is equivalent to the one in training.
While the return cycle is vital for creating a normalized history, subsequent branches can easily disrupt the history again.
That is why we call the rogue or fake-rogue function, respectively, also in the speculation phase.
While having equivalent branch histories is ideal, it is not possible for \ac{cp} attacks.
That is why we call the rogue or fake-rogue function, also in the speculation phase.
While having equivalent branch histories is ideal, this is not achievable for \ac{cp} attacks.
Besides the overall branching schema, which should match, branch addresses should also be aligned.
This is achieved by the appropriate use of \verb+nop+s.
@ -48,3 +48,33 @@ To verify that, we can use a trick.
Right before the victim branch, we modify the secret by, for example, incrementing it.
If the leaked secret conforms with our changes, the speculation comes from the right place.
Otherwise, we must reconsider our design and prevent speculation from other parts of the primitive.
\chapter{Mitigation}
\begin{table}[h]
\caption{Byte-UnixBench Benchmark for Coffee Lake with \retbleed{} mitigations disabled and enabled. ST represents the single-threaded case, where MT represents the multi-threaded case.}
\label{a:tab:rawIndexScore}
\begin{center}
\begin{tabular}{l || r r | r r}
\toprule
& \multicolumn{2}{c|}{Mitigation Off} & \multicolumn{2}{c}{Mitigation Auto}\\
Testcase & ST & MT & ST & MT \\\midrule
Dhrystone 2 using register variables & $5152.1$ & $39353.2$ & $5144.6$ & $39369.9$ \\
Double-Precision Whetstone & $1840.7$ & $18228.2$ & $1839.3$ & $18248.0$ \\
Execl Throughput & $1306.0$ & $8836.7$ & $1011.2$ & $7410.7$ \\
File Copy 1024 bufsize 2000 maxblocks & $2452.1$ & $5777.0$ & $1705.5$ & $4672.2$ \\
File Copy 256 bufsize 500 maxblocks & $1511.0$ & $3670.2$ & $1039.3$ & $2849.1$ \\
File Copy 4096 bufsize 8000 maxblocks & $5199.1$ & $11045.8$ & $3931.9$ & $10453.6$ \\
Pipe Throughput & $960.8$ & $6627.8$ & $643.2$ & $4431.5$ \\
Pipe-based Context Switching & $641.4$ & $3964.4$ & $567.4$ & $3058.2$ \\
Process Creation & $1057.4$ & $8014.8$ & $842.8$ & $6693.8$ \\
Shell Scripts (1 concurrent) & $3536.1$ & $19618.4$ & $3039.2$ & $16238.6$ \\
Shell Scripts (8 concurrent) & $12226.1$ & $17954.7$ & $9617.3$ & $14930.5$ \\
System Call Overhead & $445.1$ & $2959.5$ & $290.3$ & $1810.2$ \\\midrule
System Benchmarks Index Score & $1948.3$ & $9108.0$ & $1537.2$ & $7455.6$ \\
\bottomrule
\end{tabular}
\end{center}
\end{table}

View File

@ -49,6 +49,7 @@ With my signature I confirm that
\noindent
\begin{center}
\begin{tabular}{l@{\hskip .5in}l}
Zürich, 21.08.2022 & \includegraphics[scale=0.6]{../figures/SignatureSmall.pdf} \\
\makebox[2in]{\hrulefill} & \makebox[2.75in]{\hrulefill} \\
{\small{Place, Date}} & {\small{\authorName}}
\\[0.45in]

View File

@ -61,6 +61,5 @@
\clearpage
\input{./chapters/99-declarationOriginality.tex}
\todo{Sign Declaration of Originality}
\end{document}