[syslinux] truncated files on write with tftpd-hpa

Alan Sundell asundell+syslinux at fas.harvard.edu
Wed May 7 10:25:08 PDT 2003


On Fri, May 02, 2003 at 05:00:08PM -0400, Alan Sundell wrote:
> On Thu, May 01, 2003 at 11:31:35PM -0700, H. Peter Anvin wrote:
> > This is bizarre.  Do you have an strace -f -tt ?
> 
> Not at the moment.  I don't think DigitalUNIX has that kind of strace,
> but I'll see what I can do (unfortunately, testing this involves me
> taking one of the network guys hostage to upload a switch
> configuration again and again).  That will have to be a next week sort
> of thing.

Well, I couldn't get you exactly what you asked for, but I think I've
got enough.

I tried a trace -f (there is no -t equivalent for DigitalUNIX's trace),
several times, but the trace seems to slow down the child so much that
all TFTP transfers timeout.  The server spends over 10 seconds in NIS
calls scanning the group map, I think because of initgroups(), so the
client gives up.

However, now I'm beginning to think that NIS slowness is actually at
least part of what's causing these symptoms to show up in the first
place, so I think the data from the trace are still useful.  The config
file still ends up truncated eventually (after the NIS calls finish), it
just never gets written successfully, since all the transfers timeout
and get the ICMP response from the client.


So, basically what I am saying is that even though the trace slows
things down a lot, which changes the behavior a bit, it still produces
the failure behavior, which is what I think we're interested in, anyway.

I hope you'll pardon me for filtering the trace a bit, since the full
file is almost 6 megabytes.  If you want me to send it to you, I can do
so prviately.  Anyway, following the first child PID (there are ten, one
for each WRQ, and they behave identically) as it forks off the parent,
we see something like:

| Tracing process /proc/25861
| 1 [ , {}, {}, {}, ]
| setsockopt (4, 0, 7, 0x11ffff870, 4) = 0
| syscall (4ld, 536868720ld, 0ld, 0ld, 0ld, 1073747528ld) = 29
| fork () = 26349
| Tracing process /proc/26349
| 26349
| [26349]: sigaction (SIGHUP, {{0x1,0x0,0}}, (struct sigaction *)0) = 0
| [26349]: close (4) = 0
| [26349]: socket (AF_INET,DGRAM,PF_UNSPEC) = 4
[beginning of initgroups()]
| [26349]: open ("/etc/group", O_RDONLY, 0666) = 6
| [26349]: fstat (6, 0x11fffd570) = 0 [ , <3822.482572.390 -rw-r--r-- 1 root system 386 1052325582,1052325578,1052325578> ]
| [26349]: ioctl (6, 0x40067408<Out,TIOCGETP,6>, 11fffd548) = -1 (Not a typewriter)
| [26349]: read (6, 0x40044008, 8192) = 386 [, "system:*:0:..., ]
| [26349]: getdomainname ("[our NIS domain]", 256) = 0
| [26349]: getpid () = 26349
| [26349]: close (5) = 0
| [26349]: open ("/var/yp/binding/[our NIS domain].2", O_RDONLY, 03777752550) = 5
[...NIS calls...]
| [26349]: setgroups (1, 0x11ffff808) = 0
[end of initgroups()]
| [26349]: chroot (".") = 0
| [26349]: syscall (65534ld, 65534ld, 65534ld, 0ld, 0ld, 536860344ld) = 0
| [26349]: sigaltstack () = 0
| [26349]: syscall (112ld, 112ld, 112ld, 0ld, 0ld, 536860344ld) = 0
| [26349]: sigaltstack () = 0
| [26349]: bind (4, 0x11ffff8e0=<2/INET, 0, 0.0.0.0>, 16) = 0
| [26349]: connect (4, 0x140011900=<2/INET, 1050, 140.247.14.128>, 16) = 0
| [26349]: gettimeofday (0x11fffd678, 0x0) = 0
| [26349]: open ("/etc/zoneinfo/localtime", O_RDONLY, 0151) = -1 (No such file or directory)
| [26349]: gettimeofday (0x11fffd5f0, 0x11fffd5e8) = 0
| [26349]: getpid () = 26349
| [26349]: send (3, 0x11fffd778, 122ld, 0ld) = 122
| [26349]: open ("/switch.config", O_WRONLY|O_TRUNC, 0666) = 6
| [26349]: fstat (6, 0x11ffff750) = 0 [ , <129.0.1839 -rw-rw-r-- 1 noc noc 0 1052268306,1052325634,1052325634> ]
| [26349]: fcntl (6, F_GETFL, 438) = 1
| [26349]: sigprocmask (1, 0x0, 0x0) = 0
| [26349]: send (4, 0x140011910, 4ld, 0ld) = 4
| [26349]: select (5, 0x11ffff530={0x00000010}, 0x0={}, 0x0={}, 0x11ffff530={1,0}) = 1 [ , {0x00000010}, {}, {}, ]
| [26349]: fcntl (4, F_GETFL, 0) = 2
| [26349]: fcntl (4, F_SETFL, 6) = 0
| [26349]: recv (4, 0x1400218d4, 65468ld, 0ld) = -1 (Connection refused)
| [26349]: fcntl (4, F_GETFL, 0) = 6
| [26349]: fcntl (4, F_SETFL, 2) = 0
| [26349]: gettimeofday (0x11fffd618, 0x0) = 0
| [26349]: getpid () = 26349
| [26349]: getpid () = 26349
| [26349]: table (PROCINFO, 26349, 0x1fffa770, 1, 120) = 1
| [26349]: getgid () = 65534
| [26349]: open ("/usr/lib/nls/msg/C/libc.cat", O_RDONLY, 072) = -1 (No such file or directory)
| [26349]: open ("/usr/share/.msg_conv-C", O_RDONLY, 0352) = -1 (No such file or directory)
| [26349]: send (3, 0x11fffd718, 66ld, 0ld) = 66
| [26349]: close (0) = 0
| [26349]: close (1) = 0
| [26349]: close (2) = 0
| [26349]: close (6) = 0
| [26349]: close (5) = 0
| [26349]: sigprocmask (1, 0xfffffffffffff137, 0x0) = 0
| [26349]: exit (0) = (0)


FD 3 is the syslog socket, so the sends there correspond to these two
syslog messages:

| May  7 11:40:34 noc2 tftpd[26349]: WRQ from [client] filename switch.config remapped to /switch.config
| May  7 11:40:34 noc2 tftpd[26349]: tftpd: read: Connection refused


I think that fits with the hypothesis I put forth in my previous
message.

Finally, to tie everything together and give some sense of overall
timing, here's the tcpdump:

| 12:39:58.143347 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:39:59.640360 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:01.163891 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:02.684298 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:04.382678 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:05.817275 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:07.342511 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:08.866143 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:10.386543 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:11.906220 [client].1050 > [server].tftp: 29 WRQ "switch.config"
| 12:40:34.351835 [server].3424 > [client].1050: udp 4
| 12:40:34.359648 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:38.213899 [server].3425 > [client].1050: udp 4
| 12:40:38.219764 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:45.168404 [server].3426 > [client].1050: udp 4
| 12:40:45.175240 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:46.769020 [server].3427 > [client].1050: udp 4
| 12:40:46.776832 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:48.367667 [server].3428 > [client].1050: udp 4
| 12:40:48.373527 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:48.842292 [server].3429 > [client].1050: udp 4
| 12:40:48.848151 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:49.188972 [server].3430 > [client].1050: udp 4
| 12:40:49.197761 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:49.723166 [server].3431 > [client].1050: udp 4
| 12:40:49.733908 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:50.091330 [server].3432 > [client].1050: udp 4
| 12:40:50.108908 [client] > [server]: icmp: [client] udp port 1050 unreachable
| 12:40:50.266135 [server].3433 > [client].1050: udp 4
| 12:40:50.274924 [client] > [server]: icmp: [client] udp port 1050 unreachable


It seems to me at this point that we might be able to make some changes
to the way we do group lookups to speed up initgroups() and possibly
reduce the frequency of the problem, but it also seems like this sort of
problem could occur for other reasons, too (e.g. filesystem slowness),
and that the daemon shouldn't be truncating these files, so it would be
nice to see some sort of resolution in the code, as well.

Thanks for your help,

--Alan Sundell



More information about the Syslinux mailing list