[syslinux] EFI & PXE-booting: very slow TFTP performance on a VMWare test setup

Wed Jun 17 10:02:51 PDT 2015

On Jun 17, 2015 11:27 AM, "Oscar Roozen via Syslinux" <syslinux at zytor.com>
wrote:
>
> Dear people on the Syslinux Mailinglist,
>
> Are there any known problems with the performance of TFTP in (U)EFI
> environments in general or maybe just on VMWare?

I don't believe so.  I use Workstation 10.0.2 on Ubuntu 12.04 for the
moment.

> Right now I'm running tests on two virtual environments using either
> VMWare Workstation 11.1.0 on a Fedora 21 system or VMWare Player 7.1.0
> on an Ubuntu 14.10 system. Both virtual systems support PXE booting in
> UEFI mode. Both systems behave exactly the same.
>
> The test-setup is as follows:
>
> Master:
>   2 CPU, 2GB RAM,
>   Network Adapter "Host-only"
>   Centos 7.1
>   * syslinux-6.03
>   * isc-dhcpd-V3.0.5-RedHat
>   * atftp-0.7.1
>
> Slaves:
>   2 CPU, 2GB RAM,
>   Network Adapter "Host-only"
>   Configured to boot from the network only

I hope this is a hex-core host with 6+GiB of RAM.  Have you considered
single vCPU client VMs?  I presume you disabled VMware's isc-dhcp on vmnet1?

> The Master is configured to provide a pxelinux.0 and supporting stuff
> through tftp. The client boots, gets a DHCP answer, loads pxelinux.0,
> presents a menu through menu.c32, defaults to loading the kernel and an
> initrd and starts booting from there. No problems here in BIOS mode. The
> whole process takes about less than 8 seconds, measured from the moment
> I hit enter in the menu.
>
> One alternate choice in our menu is loading a rescue-environment that
> has a much bigger initrd (87M instead of 13M) which loads in 12 seconds.

Nice.

> Now I'm testing the same setup in UEFI mode. The client boots, gets an
> answer from the DHCPd, loads efi64/syslinux.efi and the supporting stuff
> from the efi64-subtree, presents a menu, starts loading the kernel and
> initrd and continues booting from there. Sounds good? Yes, but instead
> of 8 seconds, this all takes at least a few minutes! And quite often the
> system just hangs before even presenting the menu or presents an empty
> menu, indicating it failed to load one of the .c32 files or menu
> definitions. No errors are shown. Sometimes VMWare pop-ups appear
> indicating the machine crashed on invalid code or could not load an OS.

Good idea to serve from a subtree.

> If the boot process finishes successfully, the kernel and initrd go on
> into our custom installer and after the slave is up and running, its
> network performance (actually, all performance) is equal to the slave
> that was booted in BIOS mode.
>
> It turns out to be impossible to load the rescue environment though.
> Loading the initrd of 87M takes longer than 15 minutes. After 15 minutes
> the VMWare workstation (or is it syslinux?) decides to reboot.

The VM's vmware.log might help.  Is it exactly 15 minutes from hitting
enter on your selection?  15 minutes of loading the initrd?

> By the way, This is no arp issue or something. I did check to see if
> both slaves have different MAC addresses and I think I covered most
> other obvious causes. The only obvious thing I have not yet tried is to
> run the setup on real machines or another Virtual Environment.
>
> (Actually, I tried VirtualBox first but could not get it to PXE boot in
> UEFI mode)
>
> So, the simple question is: Is anyone else experiencing slow performance
> while network-booting in UEFI mode? Is it caused by the UEFI API
> implementation of VMWare, is there an issue in gnu-efi or is it even the
> way syslinux interacts with the EFI-layer that causes this?

Not sure but worth investigation.

> This is the question I've been asking myself for the past few days.
>
> I've been debugging the syslinux code by inspection and adding debug
> statements everywere. I also removed "git submodule update --init"
> from efi/build-gnu-efi.sh so I could rebuild syslinux locally using:
>
> # make spotless
> # make bios
> # make efi64

'make bios' should be unnecessary.

> What I've found so far:
>
> /* TCPdump/Wireshark analysis - problem 1 */
>
> Running tcpdump from the host on /dev/vmnet1 I found the following.
>
> The file "syslinux.efi" loads from a client that does not understand the
> additional options. This is probably the native UEFI firmware in action.
> Also, this step never fails. The next file will always at least start to
> load, even when the slave decides to crash before (or after?) the next
> file finnishes downloading. Even ldlinux can fail to load.
>
> On bigger files, there is an increasing chance that core_udp_recv in
> efi/udp.c hits the 15ms (or is it 15 undefined jiffies?) timeout and
> returns -1 to its caller in core/fs/pxe/tfpt.c:tftp_get_packet().
>
> This function then goes to ack_again, causing atfptd to resend the packet
> that just timed out. Then on the next call the original(!) packet is
> received anyway and handled according to plan.
>
> But the next packet to be received will be the same packet again,
> causing tftp.c to hit this code:
>
>     if (serial != last_pkt) {
>         /*
>          * Wrong packet, ACK the packet and try again.
>          * This is presumably because the ACK got lost,
>          * so the server just resent the previous packet.
>          */
> < #if 0
> < printf("Wrong packet, wanted %04x, got %04x\n", \
> <                htons(last_pkt), htons(*(uint16_t *)(data+2)));
> ---
> > #if 1
> > printf("Wrong packet, wanted %04x, got %04x\n", \
> >                last_pkt, serial);
> #endif
>         goto ack_again;
>     }
>
> So, what happens now is that the next packet will also be sent twice.
> From now on, every packet will be sent twice! Until another one hits the
> timeout. From then on each packet will be sent three times.

That's not good.

> Unfortunately, this is not *the* problem. It accounts for some delay but
> this timeout only happens a few times. Typically 1 to 3 times during a
> download. It should still load the kernel and initrd in less than 24s,
> not in over 5 minutes.  So, there's something else going on, but what?
>
> I tried to figure out why a UDP packet on a virtual network with
> unlimited bandwidth and zero packet-loss would hit a 15ms timeout. The

You'd be surprised how much CPU and RAM speed influence this.  Handling a
sustained 2Gbps for example doesn't do well on most servers.

> comment says it waits for 15ms, but it does this by counting volatile
> uint32_t __jiffies, which gets incremented from efi/main.c by an event
> initiated from the UEFI API. According to the comment near the #define
> DEFAULT_TIMER_TICK_DURATION, the jiffies should increment every 50 msec.
> So a jiffie is not 1 msec but 50 times longer. If this comment is
> correct, the actual timeout isn't 15ms but 750ms. Or is it 750us because
> a few zero's are missing? I don't have the UEFI spec so I can't verify.
>
> What I did next was just increase the timeout in efi/udp.c:core_udp_recv
> to 60 jiffies. This made the double-ACK problem disappear completely.
> From now on, every packet is received within the timeout. All calls to
> core_udp_recv return 0. But still, the booting process sometimes crashes
> and loading the kernel and initrd takes more than a few minutes. I have
> yet to successfully boot the rescue environment through TFTP, because it
> still takes more than 15 minutes to load.
>
> This leads me to the conclusion that this double-ACK problem is not the
> cause of the bad performance, so I had to investigate further.
>
>
> /* TCPdump/Wireshark analysis - problem 2 */
>
> After configuring wireshark to display the difference in time between
> each DATA and ACK packet, I found another weirdness. Obviously, there is
> a delay between sending an ACK after receiving a DATA packet on the
> client side and between receiving an ACK and sending the next DATA
> packet. To illustrate:
>
> Client:
>   * receives a DATA pkt
>   (delay1)
>   * sends an ACK pkt
>   (wait...)
>
> Server:
>   * receives an ACK pkt
>   (delay2)
>   * sends the next DATA pkt
>   (wait...)
>
> Delay1 and delay2 are caused by the time it takes to evaluate the packet
> and to compute what packet to send next. No fancy stuff, just some easy
> calculations and memory transactions. So they both should be very short
> and stay constant. But they don't.
>
> The server does just fine. It typically reacts within 0.000100 seconds.
> The client also starts out fine, reacting within about 0.000250 seconds,
> but this delay slowly increases.

This makes me wonder the source and there's at least 3 factors: VMware UEFI
on multiple vCPU, memory or other leak, a strange bug in Syslinux.

> Look at the beginning of the download of vmlinuz:
>
> No.     Time        Source              Destination         Protocol
Length Info
>   15002 0.274814    10.4.0.1            10.4.255.254        TFTP
> 105    Read Request, File: efi64/images/default-image/initrd, Transfer
> type: octet, tsize\000=0\000, blksize\000=1408\000
>   15003 0.001082    10.4.255.254        10.4.0.1            TFTP
> 72     Option Acknowledgement, tsize\000=13442875\000,
> blksize\000=1408\000
>   15004 0.003735    10.4.0.1            10.4.255.254        TFTP
> 60     Acknowledgement, Block: 0
>   15005 0.000081    10.4.255.254        10.4.0.1            TFTP
> 1454   Data Packet, Block: 1
>   15006 0.000236    10.4.0.1            10.4.255.254        TFTP
> 60     Acknowledgement, Block: 1
>   15007 0.000076    10.4.255.254        10.4.0.1            TFTP
> 1454   Data Packet, Block: 2
>   15008 0.000220    10.4.0.1            10.4.255.254        TFTP
> 60     Acknowledgement, Block: 2
>   15009 0.000051    10.4.255.254        10.4.0.1            TFTP
> 1454   Data Packet, Block: 3
>   15010 0.000244    10.4.0.1            10.4.255.254        TFTP
> 60     Acknowledgement, Block: 3
>   15011 0.000038    10.4.255.254        10.4.0.1            TFTP
> 1454   Data Packet, Block: 4
>
> The time displayed is the time since the previous displayed packet. To
> after receving packet 15005, the client needs 0.000236 seconds to send
> the ACK for block 1 in packet 15006. The server then takes 0.000076
> seconds to send block 2. And so on...
>
> Now fast forward a few thousand packets and observe the delays:
>
>   51117 0.000115    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9120
>   51118 0.023451    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9120
>   51119 0.000098    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9121
>   51120 0.017573    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9120
>   51121 0.000088    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9121
>   51122 0.011682    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9121
>   51123 0.000089    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9122
>   51124 0.017590    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9121
>   51125 0.000088    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9122
>   51126 0.017732    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9122
>   51127 0.000114    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9123
>   51128 0.011833    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9122
>   51129 0.000097    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9123
>   51130 0.017628    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9123
>   51131 0.000088    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9124
>   51132 0.017785    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9123
>   51133 0.000090    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9124
>   51134 0.017949    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9124
>   51135 0.000090    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9125
>   51136 0.030592    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9124
>   51137 0.000097    10.141.255.254        10.141.0.1            TFTP
>   1454   Data Packet, Block: 9125
>   51138 0.037680    10.141.0.1            10.141.255.254        TFTP
>   60     Acknowledgement, Block: 9125
>
> See how each Acknowledgement now takes 100 times as long to send? We're
> not in 0.000x territory anymore, we're down to a handful of packets per
> second.
>
> On the next transfer, the delays are back to normal again, just to start
> increasing and increasing, again slowing down to a crawl.

A very important observation!

> I've been ploughing through the code for quite some time now, but I
> cannot explain why this happens. The main obstacle for me is that I
> barely understand how this glue-code to the UEFI API actually works and
> that I don't have a copy of the spec.

I recall finding the UEFI 2.4 spec really for free.

> So, is there anybody on this list willing to give it a shot?

Certainly worth some comparisons.

By the way, what version/commit?

--Gene