What if you change your interrupt code to always return 0?
And/or change interrupt priority (either to very high or very low)?
Edit: also I would add a static counter variable to the interrupt and every 1000 counts or so debug-output it to serial (unconditionally = even if "interrupt is not for me")). To see at lockup what happens: maybe there's interrupt-flood.
I would generally also look more at what happens during lockup (the OS is still likely doing/executing something) not just how to reproduce it.
Edit2: One idea behind the tests mentioned is to check whether there maybe is some interrupt handler installed in the system by some other driver for the same irq that maybe erraneously (sometimes) thinks and says "yes, this irq is for me" - when it isn't - and returning != 0 and then causing other interrupt handlers for the same irq not to be called anymore. So those other interrupt handler never get chance to clear the irq-pending-state and the interrupts from then on keeps happening endlessly -> lockup.
Boot from cdrom, and check that you don't have outdated tools in s:startup-sequence, or user-startup. often when the OS is frozen its possible get into Amiga computer by using serial cable, but you need to setup a termial on aux: using newshell.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
What if you change your interrupt code to always return 0?
You mean probably not just return 0, as we still need to write to RXCH_RESET so, if we not, it surely will lockup, but you mean to strip interrupt handler code to the bare minimum write+return 0 and see if it change anything, that what you mean ?
Quote:
And/or change interrupt priority (either to very high or very low)?
Yep, will check 128 / -128
Quote:
I would generally also look more at what happens during lockup (the OS is still likely doing/executing something) not just how to reproduce it.
You think that CPU still alive even if video dead, mouse/keyboard dead, and serial dead ?
Quote:
One idea behind the tests mentioned is to check whether there maybe is some interrupt handler installed in the system by some other driver for the same irq that maybe erraneously (sometimes) thinks and says "yes, this irq is for me" - when it isn't - and returning != 0 and then causing other interrupt handlers for the same irq not to be called anymore. So those other interrupt handler never get chance to clear the irq-pending-state and the interrupts from then on keeps happening endlessly -> lockup.
That worth of trying too, of course, thanks, will check this all out.
There also another idea coming when i read yours : memory can survive "reset" button after lockup, so we just rom irq handler doing a counter , which line reached, timestap maybe , when lockup happens, reset, boot with no s-s , read from that memory, see the last N interrupt events before lockup. Can works ? I can do a simple test like just write some crap to some fixed address, press reset, immediately after boot check if that value is still there. If yes - idea can works. What you think ?
Also, thinking more about, if nothing will help, X1000 do have COP (Debug) Header. In TRM written that "The COP (debugger) header is provided for factory test purposes and its use is not recommended.", but we know what mean those "factory test purposes" : cpu tests :) But i currently do not know what of ppc jtag debugger support PA6T at all , Varysis of course know it, but not me :) I also can go OpenOCD + FTDI-based JTAG, but then, dunno if OpenOCD support PA6T-1682M.. But that for later, first will try to do all best from software side.
kas1e wrote:@Georg You mean probably not just return 0, as we still need to write to RXCH_RESET so, if we not, it surely will lockup, but you mean to strip interrupt handler code to the bare minimum write+return 0 and see if it change anything, that what you mean ?
No, don't strip/change code in the interrupt handler. Only change the return value to always be 0. Unless things are different in AOS4 I think it used to be so that an interrupt handler returns TRUE or 1 if the interrupt handler found out that "yes, the interrupt was for me" and FALSE or 0 if the interrupt handler finds out that it "was not for me".
That return value should really just be some optimization for the OS (to call less interrupt handlers if it thinks or it is being told that the interrupt was already handled by the current handler it is calling). But I think it should be safe to always return 0 and then cause other interrupt handlers of an IRQ to be called anyway. I think in theory the interrupt handlers for an IRQ that can be shared anyway need to handle the situation where the handler is called even if the IRQ was triggered by a different device than it's own.
What does your interrupt handler return at the moment?
Quote:
Yep, will check 128 / -128
127 / -128
Quote:
You think that CPU still alive even if video dead, mouse/keyboard dead, and serial dead ?
My guess is that it likely is. Would try to run a little program in the background which installs maybe a vertical blank interrupt and periodically outputs something to serial. Don't know if vertb is best for this. It should be some interrupt that has higher (hw) priority than those external device (network/audio/...) interrupts.
/* Print approximately once per second.
* VERTB fires at ~50 Hz on most display modes. */
if (wd->tick - wd->last_tick >= 50)
{
wd->last_tick = wd->tick;
wd->seconds++;
wd->IExec->DebugPrintF("[WD] alive: %lu s (tick=%lu)\n",
wd->seconds, wd->tick);
}
Then installed like "watchdog &", running stress text via network => lockup , all i got just 2 lines before all die:
[stress] 8/8 connections open. Starting receive...
[stress] Time | Received | Est.pkts | Est.wraps | Active
[stress] ------|-----------|-----------|-----------|-------
[WD] alive: 39 s (tick=1950)
[WD] alive: 40 s (tick=2000)
Damn ! :)
Maybe i need to install it let's say to be one time in 1/4 of second, so maybe will have time to print anything before all die but cause of lockup happens ?
@all I just added watchdog monitor (also on VERB) just in the driver itself, so to print _ALL_ status registers of everything even 1/4 second, by all i mean all IOB ones, all MAC ones, all DMA engine ones, all DMA-interface-rx ones, and all RX/TX channels ones , as well, as on the running i jump state of all dma-channels just in case too. And result : all state registers looks correct, all of them.
Those changes were mostly a theory in case the irq is shared with other device drivers. Is there a tool (scout? sysmon?) where you can check if there are other installed interrupt handler for the irq your driver is using, or if your driver is the only one using the specific irq?
If it's not the only one, maybe try what happens if you disable the driver (sound? whatever) that's using same irq.
Maybe another thing you can try is to add a counter and a little serial debug output at beginning of your interrupt and at the end before the return. Don't know how many interrupts are typically happening during net transfer, so don't know if you have to do the output every 1000 ticks, every 10000 ticks or whatever.
Instead of VERTB watchdog, maybe try also timer.device softint loop instead, as that's maybe less relying on "external" stuff (gfx card, bus / pci / ??):
/* Allocate message port, data & interrupt structures. Don't use CreatePort() */
/* or CreateMsgPort() since they allocate a signal (don't need that) for a */
/* PA_SIGNAL type port. We need PA_SOFTINT. */
if (tsidata = AllocMem(sizeof(struct TSIData), MEMF_PUBLIC|MEMF_CLEAR))
{
if(port = AllocMem(sizeof(struct MsgPort), MEMF_PUBLIC|MEMF_CLEAR))
{
NewList(&(port->mp_MsgList)); /* Initialize message list */
if (softint = AllocMem(sizeof(struct Interrupt), MEMF_PUBLIC|MEMF_CLEAR))
{
softint->is_Code = tsoftcode; /* The software interrupt routine */
softint->is_Data = tsidata;
softint->is_Node.ln_Pri = 0;
port->mp_Node.ln_Type = NT_MSGPORT; /* Set up the PA_SOFTINT message port */
port->mp_Flags = PA_SOFTINT; /* (no need to make this port public). */
port->mp_SigTask = (struct Task *)softint; /* pointer to interrupt structure */
/* Allocate timerequest */
if (tr = (struct timerequest *) CreateExtIO(port, sizeof(struct timerequest)))
{
/* Open timer.device. NULL is success. */
if (!(OpenDevice("timer.device", UNIT_MICROHZ, (struct IORequest *)tr, 0)))
{
tsidata->tsi_Flag = ON; /* Init data structure to share globally. */
tsidata->tsi_Port = port;
/* Send of the first timerequest to start. IMPORTANT: Do NOT */
/* BeginIO() to any device other than audio or timer from */
/* within a software or hardware interrupt. The BeginIO() code */
/* may allocate memory, wait or perform other functions which */
/* are illegal or dangerous during interrupts. */
printf("starting softint. CTRL-C to break...\n");
/* Remove the message from the port. */
tr = (struct timerequest *)GetMsg(tsidata->tsi_Port);
/* Keep on going if main() hasn't set flag to OFF. */
if ((tr) && (tsidata->tsi_Flag == ON))
{
/* increment counter and re-send timerequest--IMPORTANT: This */
/* self-perpetuating technique of calling BeginIO() during a software */
/* interrupt may only be used with the audio and timer device. */
tsidata->tsi_Counter++;
tr->tr_node.io_Command = TR_ADDREQUEST;
tr->tr_time.tv_micro = MICRO_DELAY;
BeginIO((struct IORequest *)tr);
}
/* Tell main() we're out of here. */
else tsidata->tsi_Flag = STOPPED;
}
No, don't strip/change code in the interrupt handler. Only change the return value to always be 0. Unless things are different in AOS4 I think it used to be so that an interrupt handler returns TRUE or 1 if the interrupt handler found out that "yes, the interrupt was for me" and FALSE or 0 if the interrupt handler finds out that it "was not for me".
I'm not sure anymore, but I think the return value is ignored for shareable interrupts like the PCI ones and all interrupt handlers in the list are always called. Reason: More than one PCI device using the same IRQ number may cause an IRQ at the same time. A PCI device interrupt handler has to clear the PCI device interrupt, if it was an interrupt for it's own device, but must not clear the CPU exception used for it, that's done by the OS after all interrupt handlers in the list were called.
Quote:
Is there a tool (scout? sysmon?) where you can check if there are other installed interrupt handler for the irq your driver is using, or if your driver is the only one using the specific irq?
There is a 20 years old port of Scout. AFAIK it's the only tool which can display the interrupt handler lists. There is a comment from 2009 that it doesn't work on a Sam440ep. I don't know if it's hardware related, or if it doesn't work at all on current AmigaOS 4.1 versions anymore and would have to be updated.
Quote:
Instead of VERTB watchdog, maybe try also timer.device softint loop instead, as that's maybe less relying on "external" stuff (gfx card, bus / pci / ??):
Unlike on AmigaOS <= 3.x (using a custom chip timer) the AmigaOS 4.x timer.device doesn't depend on anything external, it's based on the PowerPC CPU timers.
Ranger shows each PCI device’s interrupt number. AFAIR the devices involved in NIC usage each have unique interrupt numbers, so I don’t think it’s likely some other random device is getting in the way.
However, with the driver needing several of the PCI devices, is it possible that one of them is triggering an interrupt you’re not expecting, e.g. for a statistics counter?
But it can't display the list of interrupt handlers, unlike Scout for example.
Quote:
AFAIR the devices involved in NIC usage each have unique interrupt numbers, so I don’t think it’s likely some other random device is getting in the way.
Even on the AmigaOne and Sam440/460 (emulation) PCI IRQs might me shared, but on the Pegasos2 (emulation) all PCI devices use the same IRQ number and everything (PATA, SATA, XHCI USB, NIC, sound, gfx, etc.) uses the same shared IRQ and interrupt handler list.
I go heavy and just wrote small tool which list all the interrupts we have in system, by all i mean _all_. So, for first 15 i use just ExecBase->IntVects[16], then next 5 i skip (those as i see from includes PCI_INTERRUPT_LINE), and all the others since 20 AddIntServer probe , so in end, i just scan from 0 to 10000 and that what we have on x1000: (some of first 15 i just named myself, dunno is that can be the case, just from hardware/intbits.h in SDK):
This also mean, that on our 169 interrupt we have nothing else (169 because 144 is DMA egine + 20 that TX dma channels + 5 this is RX dma channel we use for).
Next, i tried second Georg's suggestion about adding counter at begining and at the end. With VERTB at first: I added print 4 times in second, because RX Interrupt fires for now 8500 times in a second, so, ~2100 for 1 print.
I.e. clean right up to the end. enter - exit - miss = 0 till the very last line, so handler never got stuck inside and the cpu just died. No stuck handler, no interrupt flood...
Next, 3st suggestion, replaced VERTB on timer's softint. I just tried it as external app, probably no big sense to put it inside of the driver as was with VERTB ?
That mean that both VERTB and CPU's softint died together.. So CPU really dead, mean it's some very very heavy ..
Next things about which i may think for now:
1). As i understand , when CPU died like that, the things happens is : machine check (vector 0x200 for all the PPC include our PA6T). If something cause this machine check, the registers state saved, and then cpu jump to that 0x200, and if vector corrupted -> boom. Idea is : we add little stub right at beginig of machine check vector, which dump the regs to some memory area which can survive the 'reset' button. Then once we boot after reset, with boot with "no s-s" , and just read the data from this address. Or, we even can read it by CFE itself probably. What you think ?
But no luck of course, as there more code involved calling them. So, is it possible to install stub right on 0x200 , bypassing everything. Maybe just from CFE doing so before booting OS4 ?
2). Add to the driver "polling" variant. No interrupts involved, no IOB, just minimal DMA (as without MAC will not work). Suck of course, but for first will work and can be used somehow (or maybe not somehow), and for second we will know for sure if issue is usage of IOB/etc together, or simple transfering of data, and it will be better to debug lighter version.
3). Buy JTAG, and check the state when CPU died..
I also need to add, that in Linux driver there is known "Errata 5971" (see in pasemi_mac.c) :
if (n > RX_RING_SIZE) {
/* Errata 5971 workaround: L2 target of headers */
write_iob_reg(PAS_IOB_COM_PKTHDRCNT, 0);
n &= (RX_RING_SIZE-1);
}
I of course add this one too, and confirm by tests that it all ok after. But while this Errata bug looks like the one what can cause such issues, i for sure fix it and check that it is by all register states, etc.
But it can't display the list of interrupt handlers, unlike Scout for example.
Yes, a live list is better, but there was a suggestion in a previous post that Scout might not work on current OS4, so Ranger's PCI list may be better than nothing. But in fact, I just remembered that Ranger has a list of interrupt handlers too (Exec->IntHandlers).
Quote:
Quote:
AFAIR the devices involved in NIC usage each have unique interrupt numbers, so I don’t think it’s likely some other random device is getting in the way.
Even on the AmigaOne and Sam440/460 (emulation) PCI IRQs might me shared, but on the Pegasos2 (emulation) all PCI devices use the same IRQ number and everything (PATA, SATA, XHCI USB, NIC, sound, gfx, etc.) uses the same shared IRQ and interrupt handler list.
I was referring specifically to the X1000's built-in NIC. I know NICs in general often share IRQ numbers.
1). As i understand , when CPU died like that, the things happens is : machine check (vector 0x200 for all the PPC include our PA6T). If something cause this machine check, the registers state saved, and then cpu jump to that 0x200, and if vector corrupted -> boom. Idea is : we add little stub right at beginig of machine check vector, which dump the regs to some memory area which can survive the 'reset' button. Then once we boot after reset, with boot with "no s-s" , and just read the data from this address. Or, we even can read it by CFE itself probably. What you think ?
If it's dead without interrupts still going (not even timer interrupts work) then this looks like the way to go. Doesn't AmigaOS install some handler for MCE that dumps state? You can look at 0x200 from a debugger to see if there's any code there. I think you can't set it before AmigaOS boots as it might replace it but you should be able to install your own handler overwriting 0x200 to jump to your handler. When this is called better not rely on that any of the OS still works so your handler should be some simple assembly to dump state to serial or somewhere only using CPU without calling any routines, just writing serial registers. I don't know if this applies to PA6T but a PowerPC manual says this about MCE: Quote:
The causes for machine check exceptions are implementation-dependent, but typically these causes are related to conditions such as bus parity errors or attempting to access an invalid physical address. The machine check exception is disabled when MSR[ME] = 0. If a machine check exception condition exists and the ME bit is cleared, the processor goes into the checkstop state.
So it can be disabled so check the ME bit is enabled to get the handler called and see if AmigaOS has a handler already. Also it says cause can be accessing invalid physical address so check all things in driver that involves physical addresses. Maybe you're missing some virtual to physical translation somewhere as DMA uses physical addresses but CPU uses virtual but don't know if AmigaOS uses one to one mapping or needs some address translation for DMA addresses.
Yes, a live list is better, but there was a suggestion in a previous post that Scout might not work on current OS4, so Ranger's PCI list may be better than nothing. But in fact, I just remembered that Ranger has a list of interrupt handlers too (Exec->IntHandlers).
Ranger in this regard didn't show much (at least, not the first 15 for sure) and it also didn't show past 100 : check it, it just some you only first 100 excluding first 15 for real.
My way as i show query everything and any amount : first 15, 5 skipped, and all others till any amount, and so past 100 too, and we find there and my one 169, and USB,sb600sata interrupts, etc. If anyone need can share binary/source.
Quote:
I was referring specifically to the X1000's built-in NIC. I know NICs in general often share IRQ numbers.
As i can see not in case with x1000.
Quote:
OS4 has different virtual and physical addresses. GetDMAList() can provide the translation.
Yes, but not for EMAC ones at 0xe0000000: those ones always stays the same (while of course mapped by CFE on start, but then for us they always the same and looks like physical, and we can direct wrote to them).
@balaton Thanks, will try this all out.
@All Now, i removed ALL interrupts based code, absolutely. Keep polling/timer.device based code (but still has to use DMA, of course), and i got same lockup !! So it's DMA data transfer! The only thing still active in the polling version is the DMA writing received packets into our buffers through the IOB, so.. No interrupt flood, no irq priorities, etc ...
That a bit help probably us now..
Also, i found a pattern (At least, i hope so). The pure "ping" with default 64byte packets, or even with ping -s 1472, do not cause a lockup. The cause of lockup is massive transfer, so more constant ring wraps. I can't be 100% that this is the case, but at least for now it looks like this..