I do my development in Ubuntu-22.04 Linux running on the Windows Subshell for Linux. I recently got a laptop refresh and the latest software doesn’t run. The fix is obscure, so I thought I’d document it.
sbcl runs fine out of the box in WSL2, but I’m encountering a bug where TCP connections to one particular server are being left in the CLOSE_WAIT state indefinitely. After several minutes, I hit the limit on the number of open files.
The “right thing” would be to track down who isn’t
closing the connection properly, but it’s only a few hundred
connections. It appears that ulimit
is set to 1024,
which is pretty easy to hit with this bug.
Bumping ulimit
to something more reasonable is a lazy
workaround. It isn’t a solution — I’m still leaking open
files — but I’ll be able to leak thousands of them without
having problems.
But increasing nofiles
turned out to be a problem. I
edited all the magic files in /etc
until they all said
I could have 131071 open files. When I re-started WSL, all the ways
I could start a shell agreed that the ulimit
was
131071, yet when I started sbcl
and ran this:
(uiop:run-program "prlimit" :output *standard-output*) RESOURCE DESCRIPTION SOFT HARD UNITS AS address space limit unlimited unlimited bytes CORE max core file size 0 unlimited bytes CPU CPU time unlimited unlimited seconds DATA max data size unlimited unlimited bytes FSIZE max file size unlimited unlimited bytes LOCKS max number of file locks held unlimited unlimited locks MEMLOCK max locked-in-memory address space 67108864 67108864 bytes MSGQUEUE max bytes in POSIX mqueues 819200 819200 bytes NICE max nice prio allowed to raise 0 0 NOFILE max number of open files 1024 1048576 files NPROC max number of processes 62828 62828 processes RSS max resident set size unlimited unlimited bytes RTPRIO max real-time priority 0 0 RTTIME timeout for real-time tasks unlimited unlimited microsecs SIGPENDING max number of pending signals 62828 62828 signals STACK max stack size 8388608 unlimited bytes NIL NIL 0 (0 bits, #x0, #o0, #b0)
The limit was at the old value of 1024.
WSL launched sbcl
without a shell, so
the ulimit
setting was not being run.
The solution is easy, but it took me a long time to figure it out.
Not only do you need to edit all the magic in /etc
, and
add ulimit
statements to your .bashrc
, you
should also add ulimit
statements to
your .profile
, and then instruct wsl
to
launch your program under a login shell:
(require ’sly) (setq sly-lisp-implementations ’((sbcl ("C:\\Program Files\\WSL\\wsl.exe" "--distribution-id" "{df4f07a6-2142-405c-8a6a-63f1ca3a7e8d}" "--cd" "~" "--shell-type" "login" "/usr/local/bin/sbcl") )))
This bit of insanity allows me to run sbcl
with 131071
open files in Linux as my inferior lisp program in a Windows Emacs
running SLY. (Running Emacs under Windows gives me a way to use a
modified Dvorak keyboard. I could run Emacs in the Linux subsystem,
but the Wayland server is in a container and doesn’t let you modify
the keyboard.)
5 comments:
This bug sounds like it could belong to a Embrace Extend Exterminate strategy of some "leading" company. I doubt that Canonical would distribute a Ubuntu version with such a bug. Anyway, the Emacs-Windows thing: why use Wayland when Xorg would do? Just because Wayland is the new thing, which proclaims to be a better architecture? Xorg is more mature, faster, has less bugs and better tools. :)
Seems super complicated. Not sure why folks claim Linux is hard to use when Windows requires all these hoops.
It is complicated. But I wouldn't attribute it all to Windows. It is me trying to run both at the same time. The fact that it works at all is pretty remarkable.
Update: I found the socket leak. The server I was contacting was sending a JSON reply, but was sending extra characters after the JSON. After reading the JSON, I closed the stream, but since there were unread characters pending, the OS kept the socket open. I added some code to drain the stream after reading the JSON and all the sockets in CLOSE_WAIT state disappeared.
I'm not sure if this is a bug. If you close an input stream (or abandon it to garbage collection) and it still has pending characters, maybe the system should drain the stream for you. There is no way to drain it yourself after you've closed or abadoned it.
"I closed the stream, but since there were unread characters pending, the OS kept the socket open." — strange. If a socket with unread data is close(2)'d, both Windows and Linux will send RST to the other side of it, and then close it unconditionally. I suspect it's the SBCL's logic that never closes such sockets for some reason?
Anyway, the Linux's default limit of 1024 open files is ridiculous, of course — Windows XP Home Edition has zero problems a million of open files handles. It takes it quite some time to open them all, yes, but it doesn't run out of handles at any point. The reason this limit exists on Linux is, I believe, because of the way select(2) has always been implemented on UNIXen (and on Linux as well): the fd_set struct is defined in the system-wide header to be large enough to fit only 1024 file descriptors; and this definition is compiled into the binaries you run so if you change your ulimit, and then a program calls FD_SET with an fd equal to 1500 or something, well, it will actually smash its own stack and die with SIGSEGV very soon. I've been bitten by this enough times to just stop using select(2) and always use either epoll(2) or kqueue(2) on non-Windows platforms. It's not like Linux's implementation of select(2) actually follows POSIX anyhow (neither does Windows but it doesn't follow it in a different way; what a joy).
Anyway, Windows never had this problem with select, its version of FD_SET has always been a dynamically allocated list which is slower and takes more memory but at least it doesn't segfaults — they had to do it this way since their file handles never were small consecutive integers to begin with.
Well, I don't know, but I arranged to have an unwind-protect close the stream, but this just accumulated streams in the CLOSE_WAIT state. I even tried close :abort t to no avail. What finally worked was reading the stream to EOF before calling close.
Post a Comment