Wednesday, March 26, 2025

Hard Won Esoteric Knowledge

I do my development in Ubuntu-22.04 Linux running on the Windows Subshell for Linux. I recently got a laptop refresh and the latest software doesn’t run. The fix is obscure, so I thought I’d document it.

sbcl runs fine out of the box in WSL2, but I’m encountering a bug where TCP connections to one particular server are being left in the CLOSE_WAIT state indefinitely. After several minutes, I hit the limit on the number of open files.

The “right thing” would be to track down who isn’t closing the connection properly, but it’s only a few hundred connections. It appears that ulimit is set to 1024, which is pretty easy to hit with this bug. Bumping ulimit to something more reasonable is a lazy workaround. It isn’t a solution — I’m still leaking open files — but I’ll be able to leak thousands of them without having problems.

But increasing nofiles turned out to be a problem. I edited all the magic files in /etc until they all said I could have 131071 open files. When I re-started WSL, all the ways I could start a shell agreed that the ulimit was 131071, yet when I started sbcl and ran this:

(uiop:run-program "prlimit" :output *standard-output*)

RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                         0 unlimited bytes
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited bytes
LOCKS      max number of file locks held      unlimited unlimited locks
MEMLOCK    max locked-in-memory address space  67108864  67108864 bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0 
NOFILE     max number of open files                1024   1048576 files
NPROC      max number of processes                62828     62828 processes
RSS        max resident set size              unlimited unlimited bytes
RTPRIO     max real-time priority                     0         0 
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals          62828     62828 signals
STACK      max stack size                       8388608 unlimited bytes
NIL
NIL
0 (0 bits, #x0, #o0, #b0)

The limit was at the old value of 1024.

WSL launched sbcl without a shell, so the ulimit setting was not being run.

The solution is easy, but it took me a long time to figure it out. Not only do you need to edit all the magic in /etc, and add ulimit statements to your .bashrc, you should also add ulimit statements to your .profile, and then instruct wsl to launch your program under a login shell:

(require ’sly)
(setq sly-lisp-implementations
      ’((sbcl  ("C:\\Program Files\\WSL\\wsl.exe"
                      "--distribution-id" "{df4f07a6-2142-405c-8a6a-63f1ca3a7e8d}"
                      "--cd" "~"
                      "--shell-type" "login"
                      "/usr/local/bin/sbcl")
               )))

This bit of insanity allows me to run sbcl with 131071 open files in Linux as my inferior lisp program in a Windows Emacs running SLY. (Running Emacs under Windows gives me a way to use a modified Dvorak keyboard. I could run Emacs in the Linux subsystem, but the Wayland server is in a container and doesn’t let you modify the keyboard.)

3 comments:

Anonymous said...

This bug sounds like it could belong to a Embrace Extend Exterminate strategy of some "leading" company. I doubt that Canonical would distribute a Ubuntu version with such a bug. Anyway, the Emacs-Windows thing: why use Wayland when Xorg would do? Just because Wayland is the new thing, which proclaims to be a better architecture? Xorg is more mature, faster, has less bugs and better tools. :)

Anonymous said...

Seems super complicated. Not sure why folks claim Linux is hard to use when Windows requires all these hoops.

Joe Marshall said...

It is complicated. But I wouldn't attribute it all to Windows. It is me trying to run both at the same time. The fact that it works at all is pretty remarkable.

Update: I found the socket leak. The server I was contacting was sending a JSON reply, but was sending extra characters after the JSON. After reading the JSON, I closed the stream, but since there were unread characters pending, the OS kept the socket open. I added some code to drain the stream after reading the JSON and all the sockets in CLOSE_WAIT state disappeared.

I'm not sure if this is a bug. If you close an input stream (or abandon it to garbage collection) and it still has pending characters, maybe the system should drain the stream for you. There is no way to drain it yourself after you've closed or abadoned it.