Interupt infinite loop issue

Altera_Forum · ‎09-17-2010

Hello,

I did a search but frankly I don't have a lot of time digging through threads. My initial searches came up with nothing.

I'm curious to see if anyone has seen an issue when the entire Linux System locks up? The surface cause is the external_interrupt routine in entry.S will enter an infinite loop if ipending is 0x0 when the interrupt handler is entered. This is caused by the bit test loop looking for a pending interrupt request and exiting the loop _only_ when a bit is set. If no bits are set it just happily keeps looping never to exit since ipending is never read again during this loop

This can be caused by faulty hardware (component asserts it's interrupt line, the Nios branches to the exception vector and then the component removes it before the Nios can read the ipending register thus reading all 0's).

In my specific case hardware is not the cause and has been verified in STP.

I added code to entry.S to skip the bit test loop and return whenever ipending is zero (since we know it will simply go into an infinite loop - this _is_ a bug). The external_interrupt routine should have this kind of change added to it permanently. Preferably there would be a reporting mechanism to report this condition as a spurious interrupt to the Linux kernel and then somehow logged as a system error. I don't know how to do that however.

So with further debugging doing a trigger on the ipending == 0 branch (my modified entry.S) in Signal Tap we were able to trace the instructions back and discovered the root cause is the alt_sgdma_isr routine. Specifically it appears that the exit code of alt_sgdma_isr is clearing both the TX SGDMA and RX SGDMA when processing a TX interrupt. Why does this routine clear the RX Interrupt bit when processing a TX interrupt? That is a red flag to me right away. keep in mind the TX SGDMA and RX SGDMA are two completely different components. They operate independently. There really should be two separate isr routines, IMO, but it is what it is (possibly due to a Linux kernel restriction?).

What happens is on a very infrequent basis (not so infrequent when my customer has hundreds of systems using this code. This error shows up quite frequently between all these systems) is whenever the RX SGDMA interrupt line to the Nios is asserted exactly one assembler instruction before the RX clearing, the RX SGDMA deasserts it's interrupt line as a result of the clear instruction. But it is too late. The Nios has seen the signal and several clocks later branches to service the interrupt. But the interrupt signal is already gone and external_interrupt then would enter the infinite loop (without my changes to entry.S).

So there are two issues here. 1. Clearing the RX interrupt when processing a TX interrupt in alt_sgdma_isr and 2. Why are interrupts enabled in alt_sgdma_isr anyways. In my tests the exception vector is branched to before the end of alt_sgdma_isr is reached (interrupts the isr). Maybe this is ok but certainly clearing the RX interrupt while processing a TX interrupt is not correct.

I propose a change to alt_sgdma_isr of the following code in altera_tse.c:

//reset irq

if(irq == tse_priv->rx_fifo_interrupt){

tse_priv->rx_sgdma_dev->control |= alt_sgdma_control_clear_interrupt_msk;

}else if(irq == tse_priv->tx_fifo_interrupt){

tse_priv->tx_sgdma_dev->control |= alt_sgdma_control_clear_interrupt_msk;

}

Does this make sense? Is there a better way to handle this?

So why post all this here? Because I am not a kernel expert. I would like some of the Linux Kernel experts here to look at my post and let me know if any of the issues I have seen here have already been addressed or if not help me get these changes into the Linux for Nios distribution since I imagine other Linux for Nios users will eventually run into this issue when they try to put their systems into production.

If there is a better way to address this issue then I am all ears and would love to hear any recommendations.

If necessary please contact me direct by email or PM since I cannot mention the name of the customer here in my post.

Thanks for your help.

Rick Hill

TSFAE, Embedded Systems, Altera, Inc.

Altera_Forum · ‎09-17-2010

I haven't looked at it for a few months and didn't look into this much just now, but IIRC:

The altera_tse driver uses NAPI: http://www.linuxfoundation.org/collaborate/workgroups/networking/napi

What this essentially means is that when it gets a TX or RX interrupt it schedules a polling loop which handles all TX and RX events until there aren't any, then reenables interrupts. This is supposed to improve network performance by not handling interrupts one at a time.

Unfortunately, the altera_tse NAPI implementation is broken: http://sopc.et.ntust.edu.tw/pipermail/nios2-dev/2010-april/003726.html

I tried to fix the problem I ran into there (which is different from yours) but ended up causing a system freeze instead which I didn't get around to debugging. It may be that's what you are running into, and if so I'm glad you found it.

Addendum since I see you are actually from Altera: I sent the above patch to Dalon Westergreen and he said he was actively working on the driver and SGDMA related things, but it seems he since got swamped with other tasks. As you can see the driver is broken in a couple ways and a bit strange in a few more ways. I started fixing it up when I ran into the above problem, but I am not a kernel expert either so I have a lot of catching up to do to fix the driver properly, so I put it off hoping someone else (maybe someone at Altera) will get to it. Maybe you can find out internally if there is any work being done on it or if someone else who has worked on the driver has insight into the problem you found.

Altera_Forum · ‎09-17-2010

Just glanced through my patch again, this comment may be relevant in my version of alt_sgdma_isr:


static inline void __alt_sgdma_isr(struct net_device *dev,
                                                                  struct alt_tse_private *tse_priv)
{
       tse_sgdma_disable_irq(tse_priv);
       
       if (napi_schedule_prep(&tse_priv->napi)) {
               if (netif_msg_intr(tse_priv))
                       printk(KERN_WARNING "%s: NAPI Starting\n", dev->name);
               
               __napi_schedule(&tse_priv->napi);
       } else {
               // BUG
               // if we get here, we received another irq while processing NAPI
               // this seems to happen when there is an rx interrupt and tx interrupt
               //  in short succession, but does no harm
               if (netif_msg_intr(tse_priv))
                       printk(KERN_WARNING "%s :TSE IRQ Received while IRQs disabled\n",
                                       dev->name);
       }
}

Maybe I was wrong on "does no harm."

Altera_Forum · ‎09-17-2010

Thanks for the response. I'll talk to Dalon.

Thanks!

Rick Hill

TSFAE, Embedded Systems, Altera, Inc.

Altera_Forum · ‎12-16-2010

Hi Rick,

Did you find out anything new about this?

Looks like your proposed change was just made:

http://gitorious.org/linux-nios2/linux-nios2/commit/c6bd185cfb659b935b8815725030cf077421cfbd