[syslinux] EFI & PXE-booting: very slow TFTP performance on a VMWare test setup

Oscar Roozen oscar.roozen at brightcomputing.com
Wed Jun 17 08:24:28 PDT 2015


Dear people on the Syslinux Mailinglist,

Are there any known problems with the performance of TFTP in (U)EFI
environments in general or maybe just on VMWare?

Right now I'm running tests on two virtual environments using either
VMWare Workstation 11.1.0 on a Fedora 21 system or VMWare Player 7.1.0
on an Ubuntu 14.10 system. Both virtual systems support PXE booting in
UEFI mode. Both systems behave exactly the same.

The test-setup is as follows:

Master:
  2 CPU, 2GB RAM,
  Network Adapter "Host-only"
  Centos 7.1
  * syslinux-6.03
  * isc-dhcpd-V3.0.5-RedHat
  * atftp-0.7.1

Slaves:
  2 CPU, 2GB RAM,
  Network Adapter "Host-only"
  Configured to boot from the network only

The Master is configured to provide a pxelinux.0 and supporting stuff
through tftp. The client boots, gets a DHCP answer, loads pxelinux.0,
presents a menu through menu.c32, defaults to loading the kernel and an
initrd and starts booting from there. No problems here in BIOS mode. The
whole process takes about less than 8 seconds, measured from the moment
I hit enter in the menu.

One alternate choice in our menu is loading a rescue-environment that
has a much bigger initrd (87M instead of 13M) which loads in 12 seconds.

Now I'm testing the same setup in UEFI mode. The client boots, gets an
answer from the DHCPd, loads efi64/syslinux.efi and the supporting stuff
from the efi64-subtree, presents a menu, starts loading the kernel and
initrd and continues booting from there. Sounds good? Yes, but instead
of 8 seconds, this all takes at least a few minutes! And quite often the
system just hangs before even presenting the menu or presents an empty
menu, indicating it failed to load one of the .c32 files or menu
definitions. No errors are shown. Sometimes VMWare pop-ups appear
indicating the machine crashed on invalid code or could not load an OS.

If the boot process finishes successfully, the kernel and initrd go on
into our custom installer and after the slave is up and running, its
network performance (actually, all performance) is equal to the slave
that was booted in BIOS mode.

It turns out to be impossible to load the rescue environment though.
Loading the initrd of 87M takes longer than 15 minutes. After 15 minutes
the VMWare workstation (or is it syslinux?) decides to reboot.

By the way, This is no arp issue or something. I did check to see if
both slaves have different MAC addresses and I think I covered most
other obvious causes. The only obvious thing I have not yet tried is to
run the setup on real machines or another Virtual Environment.

(Actually, I tried VirtualBox first but could not get it to PXE boot in
UEFI mode)

So, the simple question is: Is anyone else experiencing slow performance
while network-booting in UEFI mode? Is it caused by the UEFI API
implementation of VMWare, is there an issue in gnu-efi or is it even the
way syslinux interacts with the EFI-layer that causes this?

This is the question I've been asking myself for the past few days.

I've been debugging the syslinux code by inspection and adding debug
statements everywere. I also removed "git submodule update --init"
from efi/build-gnu-efi.sh so I could rebuild syslinux locally using:

# make spotless
# make bios
# make efi64


What I've found so far:

/* TCPdump/Wireshark analysis - problem 1 */

Running tcpdump from the host on /dev/vmnet1 I found the following.

The file "syslinux.efi" loads from a client that does not understand the
additional options. This is probably the native UEFI firmware in action.
Also, this step never fails. The next file will always at least start to
load, even when the slave decides to crash before (or after?) the next
file finnishes downloading. Even ldlinux can fail to load.

On bigger files, there is an increasing chance that core_udp_recv in
efi/udp.c hits the 15ms (or is it 15 undefined jiffies?) timeout and
returns -1 to its caller in core/fs/pxe/tfpt.c:tftp_get_packet().

This function then goes to ack_again, causing atfptd to resend the packet
that just timed out. Then on the next call the original(!) packet is
received anyway and handled according to plan.

But the next packet to be received will be the same packet again,
causing tftp.c to hit this code:

    if (serial != last_pkt) {
        /*
         * Wrong packet, ACK the packet and try again.
         * This is presumably because the ACK got lost,
         * so the server just resent the previous packet.
         */
< #if 0
< printf("Wrong packet, wanted %04x, got %04x\n", \
<                htons(last_pkt), htons(*(uint16_t *)(data+2)));
---
> #if 1
> printf("Wrong packet, wanted %04x, got %04x\n", \
>                last_pkt, serial);
#endif
        goto ack_again;
    }

So, what happens now is that the next packet will also be sent twice.
>From now on, every packet will be sent twice! Until another one hits the
timeout. From then on each packet will be sent three times.

Unfortunately, this is not *the* problem. It accounts for some delay but
this timeout only happens a few times. Typically 1 to 3 times during a
download. It should still load the kernel and initrd in less than 24s,
not in over 5 minutes.  So, there's something else going on, but what?

I tried to figure out why a UDP packet on a virtual network with
unlimited bandwidth and zero packet-loss would hit a 15ms timeout. The
comment says it waits for 15ms, but it does this by counting volatile
uint32_t __jiffies, which gets incremented from efi/main.c by an event
initiated from the UEFI API. According to the comment near the #define
DEFAULT_TIMER_TICK_DURATION, the jiffies should increment every 50 msec.
So a jiffie is not 1 msec but 50 times longer. If this comment is
correct, the actual timeout isn't 15ms but 750ms. Or is it 750us because
a few zero's are missing? I don't have the UEFI spec so I can't verify.

What I did next was just increase the timeout in efi/udp.c:core_udp_recv
to 60 jiffies. This made the double-ACK problem disappear completely.
>From now on, every packet is received within the timeout. All calls to
core_udp_recv return 0. But still, the booting process sometimes crashes
and loading the kernel and initrd takes more than a few minutes. I have
yet to successfully boot the rescue environment through TFTP, because it
still takes more than 15 minutes to load.

This leads me to the conclusion that this double-ACK problem is not the
cause of the bad performance, so I had to investigate further.


/* TCPdump/Wireshark analysis - problem 2 */

After configuring wireshark to display the difference in time between
each DATA and ACK packet, I found another weirdness. Obviously, there is
a delay between sending an ACK after receiving a DATA packet on the
client side and between receiving an ACK and sending the next DATA
packet. To illustrate:

Client:
  * receives a DATA pkt
  (delay1)
  * sends an ACK pkt
  (wait...)

Server:
  * receives an ACK pkt
  (delay2)
  * sends the next DATA pkt
  (wait...)

Delay1 and delay2 are caused by the time it takes to evaluate the packet
and to compute what packet to send next. No fancy stuff, just some easy
calculations and memory transactions. So they both should be very short
and stay constant. But they don't.

The server does just fine. It typically reacts within 0.000100 seconds.
The client also starts out fine, reacting within about 0.000250 seconds,
but this delay slowly increases.

Look at the beginning of the download of vmlinuz:

No.     Time        Source              Destination         Protocol Length Info
  15002 0.274814    10.4.0.1            10.4.255.254        TFTP
105    Read Request, File: efi64/images/default-image/initrd, Transfer
type: octet, tsize\000=0\000, blksize\000=1408\000
  15003 0.001082    10.4.255.254        10.4.0.1            TFTP
72     Option Acknowledgement, tsize\000=13442875\000,
blksize\000=1408\000
  15004 0.003735    10.4.0.1            10.4.255.254        TFTP
60     Acknowledgement, Block: 0
  15005 0.000081    10.4.255.254        10.4.0.1            TFTP
1454   Data Packet, Block: 1
  15006 0.000236    10.4.0.1            10.4.255.254        TFTP
60     Acknowledgement, Block: 1
  15007 0.000076    10.4.255.254        10.4.0.1            TFTP
1454   Data Packet, Block: 2
  15008 0.000220    10.4.0.1            10.4.255.254        TFTP
60     Acknowledgement, Block: 2
  15009 0.000051    10.4.255.254        10.4.0.1            TFTP
1454   Data Packet, Block: 3
  15010 0.000244    10.4.0.1            10.4.255.254        TFTP
60     Acknowledgement, Block: 3
  15011 0.000038    10.4.255.254        10.4.0.1            TFTP
1454   Data Packet, Block: 4

The time displayed is the time since the previous displayed packet. To
after receving packet 15005, the client needs 0.000236 seconds to send
the ACK for block 1 in packet 15006. The server then takes 0.000076
seconds to send block 2. And so on...

Now fast forward a few thousand packets and observe the delays:

  51117 0.000115    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9120
  51118 0.023451    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9120
  51119 0.000098    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9121
  51120 0.017573    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9120
  51121 0.000088    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9121
  51122 0.011682    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9121
  51123 0.000089    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9122
  51124 0.017590    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9121
  51125 0.000088    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9122
  51126 0.017732    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9122
  51127 0.000114    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9123
  51128 0.011833    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9122
  51129 0.000097    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9123
  51130 0.017628    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9123
  51131 0.000088    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9124
  51132 0.017785    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9123
  51133 0.000090    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9124
  51134 0.017949    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9124
  51135 0.000090    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9125
  51136 0.030592    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9124
  51137 0.000097    10.141.255.254        10.141.0.1            TFTP
  1454   Data Packet, Block: 9125
  51138 0.037680    10.141.0.1            10.141.255.254        TFTP
  60     Acknowledgement, Block: 9125

See how each Acknowledgement now takes 100 times as long to send? We're
not in 0.000x territory anymore, we're down to a handful of packets per
second.

On the next transfer, the delays are back to normal again, just to start
increasing and increasing, again slowing down to a crawl.

I've been ploughing through the code for quite some time now, but I
cannot explain why this happens. The main obstacle for me is that I
barely understand how this glue-code to the UEFI API actually works and
that I don't have a copy of the spec.

So, is there anybody on this list willing to give it a shot?

-- 

Oscar Roozen
Linux Developer
Bright Computing BV


More information about the Syslinux mailing list