Intro to Assembly Language

Memory

Where the temporary data and instructions of currently running programs are located. Computer memory also known as Primary Memory.

Two main types of memory: - Cache - Random Access Memory (RAM)

Cache

Cache located within the CPU itself and hence is extremely fast compared to RAM, as it runs at the same clock speed as the CPU. Very limited in size and very sophisticated, expensive to manufacture due to it being so close to the CPU core.

Main benegit of cache memory, enabling the CPU to access upcoming instructions and data quicker than retrieving them from RAM.

Three levels of cache:

RAM

Much larger than cache and accessing data from RAM takes many more instructions!

Example: retrieving an instruction from the registers takes only one clock cycle, and retrieving it from the L1 cache takes a few cycles, while retrieving it from RAM takes around 200 cycles…

Maximum possible RAM was 2^32 bytes for 32 bit OS, which is only 4 GB at which point we run out of unique address. With 64-bit, range is now up to 0xffffffffffffffff, theoretical max of 2^64 -> 18.5 exabytes (18.5 million terabytes)

RAM is split into four main segments:

IO/Storage

Input/Output devices like keyboards, screen, or long-term storage unit. Processor can access and control IO devices using Bus Interfaces, acting as ‘highways’ to transfer data and addresses, using electrical charges for binary data.

Each bus has a capacity of bits (or electrical charges) it can carry simultaneously. Usually a multiple of 4-bits, ranging up to 128-bits. Bus interfaces are also usually used to access memory and other components outside the CPU itself. (Think about the PCB lanes on a board, those are the bus lines.

CPU Architecture

CPU contains both Control Unit (CU) -> in charge of moving and controlling data Arithmetic/Logic Unit (ALU) -> in charge of performing various arithmetics and logical calculations as requested by a program through the assembly instructions.

Instruction Set Architecture (ISA) -> different ways that a CPU processes its instructions, can can also influence how efficiently the CPU processes instructions as well. Basically why ARM binaries don’t run on x86 arch computers, vice versa.

Example writting the same instruction on different ISAs:

A single ISA may have several syntax interpretations for the same assembly code (Intel syntax add rax, 1 vs. addb $0x1,%rax AT&T syntax)

RISC -> based on processing more simple instructions, taking more cycles, but each one shorter and taking less power

CISC -> based on fewer cycles, but each instruction taking more time and power to be processed

Clock Speed & Clock Cycle

Each CPU has a clock speed to indicate overall speed. Every tick of clock runs a clock cycle that processes a basic instruction, like fetching an address or storing an address. This is done by the CU or ALU

Frequency in which cycles occur is counted is cycles per second (Hertz, Hz). Ex: If CPU has speed of 3.0 GHz, that means it can run 3 billion cycles every second (per core).

clock speed and clock cycle

Instruction Cycle

Instruction Cycle -> cycle it take the CPU to process a single machine instruction

Four stages:

  1. Fetch -> take the next instruction’s address from the Instruction Address Register (IAR) that tells it where the next instruction is located
  2. Decode -> take instruction from the IAR and decode it from binary to see what is required to executed
  3. Execute -> Fetch instruction operands from register/memory, and process the instruction in the ALU or CU
  4. Store -> Store the new value in the destination operand

All stages in the instruction cycle are carried out by the CU, but arithmetic instruction like “add, sub, etc.” are executed by the ALU

Ex with add rax, 1, this is the instruction cycle:

  1. Fetch instruction from the rip register, 48 83 C0 01
  2. Decode 48 83 C0 01 to know it needs to perform an add of 1 to value at rax
  3. Get the current value at rax (by the CU), add 1 to it (by the ALU)
  4. Store the new value back to rax

Above is executing process instructions sequentially as it was done in the past. With modern multi-thread and multi-core design CPUs, processors can process multiple instructions in parallby having multiple instruction/clock cycles running at the same time.

TIP: If we want to know what architecture the system supports use lscpu - can also use uname -m to get CPU architecture.

Instruction Set Architecutres

Instruction Set Architecture (ISA) specifies the syntax and semantics of assembly language on each architecture. Not just different syntax, but afffects the way and order instructions are executed and the level of complexity

ISA consists of:

And there are two main ISA widely used:

  1. Complex Instruction Set Computer -> Intel and AMD processors in most computers and servers
  2. Reduced Instruction Set Computer -> Used in ARM and Apple processors, in most smartphones and some modern laptops

CISC vs RISC

Registers, Addresses, and Data Types

Registers

Each CPU core has a set of registers, they are the fastest components in any computer close to the CPU core.

Two main types of registers to focus on: Data Registers and Pointer Registers

Data Registers Pointer Registers
rax rbp
rbx rsp
rcx rip
rdx
r8
r9
r10

Sub-Registers

64-bit register can be divided into smaller sub-registers, one byte 8-bits, 2 bytes 16-bits, and 4 bytes32-bits`


Size in bits Size in bytes Name Example
16-bit 2 bytes the base name ax
8-bit 1 bytes base name and/or ends with l al
32-bit 4 bytes base name + starts with the e prefix eax
64-bit 8 bytes base name + starts with the r prefix raw

Description 64-bit Register 32-bit Register 16-bit Register 8-bit Register
Data/Arguments Registers
Syscall Number/Return value rax eax ax al
Callee Saved rbx ebx bx bl
1st arg - Destination operand rdi edi di dil
2nd arg - Source operand rsi esi si sil
3rd arg rdx edx dx dl
4th arg - Loop counter rcx ecx cx cl
5th arg r8 r8d r8w r8b
6th arg r9 r9d r9w r9b
Pointer Registers
Base Stack Pointer rbp ebp bp bpl
Current/Top Stack Pointer rsp esp sp spl
Instruction Pointer ‘call only’ rip eip ip ipl

Note: there are various other registers like RFLAGS register, used to maintain various flags used by the CPU, like the zero flag ZF used for conditional instructions

Memory Addresses

RAM is segmented into various regions like the Stack, heap, and other kernel-specific regions.

Each memory region has specific read, write,execute` permissions that specify if you can read from it, write to it, or call an address in it.

When a instruction goes through the Instruction Cycle to be executed, first step is to fetch the instruction from the address it’s located at.

Different types of address fetching (i.e. addressing modes) in x86:

Addressing Mode Description Example
Immediate value is given within the instruction add 2
Register register name that holds the value is given in the instruction add rax
Direct direct full address is given in the instruction call 0xfffffffffaa8a25ff
Indirect reference pointer is given in the instruction call 0x44d000 or call [rax]
Stack address is on top of the stack add rsp

In the table above, lower is slower. The less immediate the value is, the slower it is to fetch it.

Speed isn’t the biggest concern when learning Assembly, but understand where and how each address is located - helps in binary exploitation like Buffer Overflows. Same understanding will have an even more significant implication with advanced binary exploitation, like ROP or Heap exploitation.


Address Endianness

This is the order of its bytes in which they are stored or retrieved from memory. Two types:

For address 0x0011223344556677 to be stored in memory, little-endian procesors would store 0x00, then 0x1100, then 0x221100, until all of them are in place it would look like 0x7766554433221100

Another example to show how this can affect stored values in binary. For a 2-byte integer 426, its binary representation is 00000001 10101010, value becomes 43521. Big-endian processors would store these bytes as 00000001 10101010 (left-to-right), while little-endian processors store them as 10101010 00000001 (right-to-left).

When we retrieve the value, the processor has to use the same endianness used when storing them, or it will get the wrong value. This indicates that the order in which the bytes are stored/retrieved make a big difference

For this course, we’ll be using little-endian order, which is used with Intel/AMD x86 modern operating systems.

The bytes we will be storing into memory from right-to-left

Seems counter-intuitive since people are used to reading from left-to-right. But there are multiple advantages when processing data, like being able to retrieve a sub-register without having to go through the entire register or being able to perform arithmetic in the correct order right-to-left.

Data Types

x86 arch supports many types of data sizes, which can be used with various instructions. Here are the most common data types:

Component Legth Example
byte 8 bits 0xab
word 16 bits - 2 bytes 0xabcd
double word (dword) 32 bits - 4 bytes 0xabcdef12
quad word (qword) 64 bits - 8 bytes 0xabcdef1234567890

Whenever we use a variable with a certain data type or use a data type with an instruction, both operands should be of the same size!

Example: can’t use a variable defined as byte with rax, since rax is 8 bytes. We would have to use al, which is 1 byte.

Sub-register Data Type
al byte
ax word
eax dword
rax qword

Assembly File Structure

We need to first understand the general structure of an assembly file and then how to assemble it and debug it.

         global  _start

         section .data
message: db      "Hello HTB Academy!"

         section .text
_start:
         mov     rax, 1
         mov     rdi, 1
         mov     rsi, message
         mov     rdx, 18
         syscall

         mov     rax, 60
         mov     rdi, 0
         syscall

Looking at the vertical parts of code, each line can have three elements

1. Labels 2. Instructions 3. Operands

References