[syslinux] Trouble with ISOLINUX and IDE bus resets.

Tue May 11 07:36:32 PDT 2004

Hpa,
	Dell ships a CD called Dell OpenManage Server Assistant that,
starting with version 8.0 released last November, is a Linux-based
bootable CD. It uses ISOLINUX to load a linux kernel/initrd combo to
start the system. From version 8.0 to 8.2 we use isolinux version 1.66.
Starting with version 8.3 we have upgraded to version 2.08. First of
all, I'd like to say thanks for your excellent bootloaders, of which I
have used all three for various projects, they have helped immensely on
this project.

	We have started to see a problem with the 1.66 version of
isolinux on Dell PowerEdge 6650 server systems. On the newer 2.08
version, the problem happens less often, but we still get a problem from
time to time. The cause of the problem seems to be that the CDROM device
goes offline or has some sort of problem. A device reset is issued, but
then it looks like when isolinux tries to re-read the sector, the BIOS
int 13h call destination for the data is invalid and it crashes the
system. We believe that the root cause is some hardware problem. But,
the interesting thing is that the newer isolinux has the problem less
often, and other OS bootloaders (windows NT in this case) also see the
Device Reset, but they retry the read calls and continue going just
fine. The newer isolinux gives "isolinux: Disk error 01, drive 82. Boot
failed: press a key to retry" usually when it fails. The 1.66 version
completely crashes the system with really neat video corruption.

	I have searched through the changelogs but did not see any
likely matches for this problem.

	I have posted ASCII version of IDE bus traces of the problem
here: http://www.michaels-house.net/~mebrown/IDE_bus_traces.tgz (4.7MB).
The raw data for this is from "Bus Doctor", and the raw bus doctor data
files are available upon request, but you need a Windows system to run
Bus Doctor on. There is an evaluation version available you can use if
you want to view the raw data.

	The BIOS team has stepped through the code and that is how we
determined that, after the read retry, the destination buffer given is a
bad address. Unfortunately I do not have any raw data from the BIOS guys
at this point. If there is something specific that you need I can ask
them to provide it.

	Have you seen this kind of problem, or is there some other data
that we can provide that would help provide a software workaround for
this problem?
--
Michael

Below I have copied several relevant entries from our internal bug
tracking system
=================Trouble log==================
    RMSD Update 4/29/04  by Villanueva, Jorge (4/29/2004 4:58:31 PM)
>From the traces I am seeing that after the Reset is issue the host
(Linux DSA 8.2) does not issue Read commands for a very large portion of
data when compared with a passing case.  This could be the cause of the
corruption.  When compared to the NT based DSA the host (NT) re-reads
some portion of the data right before the Device Reset and then
continues reading the rest.  The DSA 8.2 does not appear to be reading
in all the data.  I have attached the traces I have used for this
analysis. 

       by Gedela, Neeraja (4/28/2004 3:32:40 PM)
Hooked up an American Arium emulator and with the help of BIOS engineer
Chin-Lung Chao, we reproduced the error. By stepping through the code
around the point of failure, we saw the DSA is executing data transfer
through the BIOS INT 13h calls by giving BIOS the CD sector from which
to read data from and the memory location to which this data is to be
written into. Ching-Lung says this memory location is an incorrect
memory location to be written to, causing memory corruption to happen. 

     Added Trace files  by Villanueva, Jorge (4/23/2004 4:49:49 PM)
Added Trace file of failing and passing cases. Both were done with same
media and on the same config.  Please use Bus Doctor to view system you
can get this software at http://www.datatransit.com/support/demosw.html 

     More observations and update...  by Gedela, Neeraja (4/23/2004
2:25:25 PM)
* Removed gasket from chassis and could not reproduce the failure on the
2M451 drive w/DSA 8.1 A01 (~15 boots). Did an additional 11 boots at a
latter time and the fail occured twice.
* Could not reproduce on Boxster w/2M451 or with 0R397and DSA 8.1 A01
* There is a misalignment of P1 connector (Larry Kosch from mechanical
team verified the difference by measuring) on the interposer (Dell P/N:
401JX) due to the fact that there is a warping of the metallic strip (of
drive carrier) to which the interposer is mounted via 2 holes causing
the holes to be skewed wrt each other instead of being in line
horizontally. There could be a variance in the connector contacts along
with the mating J1 connector on Dell P/N 64EEC interposer card.

* Seen following error once:
"isolinux: Disk error 01, drive 82 
Boot failed: press a key to retry"
* Jorge Villaneuva from the RMSD team helped capture traces by hooking
up an IDE analyzer for both passing and failing cases with DSA 8.2 and a
passing case with DSA 7.5 
* From the captured traces, there seems a difference in the way the NT
code and Linux code handles errors. Error handling mechanism for NT code
seems more robust than the linux version, because error happened in both
cases but the NT code actually recovers from it.
* On a general note, BIOS actually set the transfer mode to DMA if the
device is DMA capable, and the