Re: Nios II application hang/lockup with an exception

PatrickY · ‎03-26-2024

Hello, All,

While we run the program, we randomly got an exception with the cause of -1 (unknown reason). The exception happened in the same statement but not always happened when the statement is executed. If no exception handler is added, the software just hang or lockup.

1. The exception:
The exception-causing statement is:
// Enable to raise the fifo almost empty (AE) interrupt.
altera_avalon_fifo_write_ienable(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_IENABLE_AE_MSK);

its corresponding assembly code are shown below (or in the attached picture named ExceptionAssemblyCode.png)

The exact exception code line is “mov r2,zero”.

The following statement is called before calling altera_avalon_fifo_write_ienable():
// Clear the fifo almost empty (AE) interrupt.
altera_avalon_fifo_clear_event(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_EVENT_AE_MSK);

2. Function altera_avalon_fifo_write_ienable() implementation:
int altera_avalon_fifo_write_ienable(alt_u32 address, alt_u32 mask)
{
IOWR_ALTERA_AVALON_FIFO_IENABLE(address, mask);
if(IORD_ALTERA_AVALON_FIFO_IENABLE(address) == mask)
return ALTERA_AVALON_FIFO_OK;
else
return ALTERA_AVALON_FIFO_IENABLE_WRITE_ERROR;
}

#define IOWR_ALTERA_AVALON_FIFO_IENABLE(base, data) \
IOWR(base, ALTERA_AVALON_FIFO_IENABLE_REG, data)

#define IORD_ALTERA_AVALON_FIFO_IENABLE(base) \
IORD(base, ALTERA_AVALON_FIFO_IENABLE_REG)

3. Processor and Development Environment:
It runs in the Cyclone V NIOS II processor. One fifo driver for 2K fifo peripheral is developed with the altera_avalon_fifo functions.
It uses the uCOS-II real-time OS. The application was developed with Quartus II Software version 13.0.1.123,
which is installed with QuartusSetup-13.0.1.232.exe. The Nios II 13.0sp1 Software Build Tools for Eclipse was part of the installation and was used.

4. What we have done:
4.1 Turned on the stack overflow debug, no stack overflow exception occurred.
4.2 Tried to use the hard-coded values instead of symbolic constants and added the critical section as flows, the same exception still occurred.
// FIFO_CSR_BASE = 0x610020
// ALTERA_AVALON_FIFO_EVENT_AE_MSK = 0x08
// ALTERA_AVALON_FIFO_IENABLE_AE_MSK = 0x08
OS_ENTER_CRITICAL();
altera_avalon_fifo_clear_event(0x610020, 0x08);
altera_avalon_fifo_write_ienable(0x610020, 0x08);
OS_EXIT_CRITICAL();
4.3 Tried to add the time-delay or yield as follows. The same exception still occurred.
IOWR_ALTERA_AVALON_FIFO_EVENT(0x610020, 0x08);
__asm("nop");
if((IORD_ALTERA_AVALON_FIFO_EVENT(0x610020) & 0x08) != 0)
{
printf("ALTERA_AVALON_FIFO_EVENT_CLEAR_ERROR\n");
}
__asm("nop");
IOWR_ALTERA_AVALON_FIFO_IENABLE(0x610020, 0x08);
__asm("nop");
if(IORD_ALTERA_AVALON_FIFO_IENABLE(0x610020) != 0x08)
{
printf("ALTERA_AVALON_FIFO_IENABLE_WRITE_ERROR\n");
}
__asm("nop");
4.4 After turn on the “enable_instrtruction_related_exceptions” with the re-issue of the exception-causing
statement in the added exception handler, the same exception still occurred but kept moving forward
without any impact to the software performance, that is, the software completed what expected to do.

Could you help to find the root cause of the exception? Any solution to it?

Thanks.

Sincerely

Patrick

PatrickY · ‎03-27-2024

Hello,

One more update. When the code was changed from:

altera_avalon_fifo_clear_event(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_EVENT_AE_MSK);
altera_avalon_fifo_write_ienable(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_IENABLE_AE_MSK);

to:

// FIFO_CSR_BASE = 0x610020
// ALTERA_AVALON_FIFO_EVENT_AE_MSK = 0x08
// ALTERA_AVALON_FIFO_IENABLE_AE_MSK = 0x08
OS_ENTER_CRITICAL();
altera_avalon_fifo_clear_event(0x610020, 0x08);
altera_avalon_fifo_write_ienable(0x610020, 0x08);
OS_EXIT_CRITICAL();

It failed one statement earlier in the assembly code per the address from ea-4, which is shown below:

Any supports are very appreciated!

Patrick

khtan · ‎03-29-2024

Hi Patrick,

Sorry for the delay in response, case came in while I was having offsite training. I'm Kian and I will be looking into this case. First of all, thanks for the initial analysis of the issue. I didn't see any attached picture ExceptionAssemblyCode.png but I could see the screenshot of the assembly code.

It sounds like a timing or racing condition as you mentioned enable the stack overflow debug logs and issue disappears or exception didn't happen.

If I understand correctly, the execution sequence is

i) altera_avalon_fifo_clear_event(0x610020, 0x08);

ii) altera_avalon_fifo_write_ienable(0x610020, 0x08);

and both of these are writing to the same register.

I saw in the assembly screenshot that indicates write_ienable first then clear event?

I will check on my end and what else I could dig out based on the information you provided.

Thanks

Regards

Kian

PatrickY · ‎03-29-2024

Hi, Kian,

Thank you for supporting the case. It seems it indeed is a race condition or no protection access to the share resource.

One correction: when the code was changed from:

altera_avalon_fifo_clear_event(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_EVENT_AE_MSK);
altera_avalon_fifo_write_ienable(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_IENABLE_AE_MSK);

to:

// FIFO_CSR_BASE = 0x610020
// ALTERA_AVALON_FIFO_EVENT_AE_MSK = 0x08
// ALTERA_AVALON_FIFO_IENABLE_AE_MSK = 0x08
OS_ENTER_CRITICAL();
altera_avalon_fifo_clear_event(0x610020, 0x08);
altera_avalon_fifo_write_ienable(0x610020, 0x08);
OS_EXIT_CRITICAL();

The exception in fact happened in a different function in a different task. The exception is the first line of the following code:

nLevel = altera_avalon_fifo_read_level(TRIGGERDATA_FIFO_IN_CSR_BASE);
nStatus = altera_avalon_fifo_read_status(TRIGGERDATA_FIFO_IN_CSR_BASE, ALTERA_AVALON_FIFO_STATUS_ALL));

The TRIGGERDATA_FIFO_IN_CSR_BASE is a different symbolic constant to the same value or address: 0x610020, which is same as FIFO_CSR_BASE used in the previously exception-causing function: altera_avalon_fifo_write_ienable().

After comment out these two lines:

nLevel = altera_avalon_fifo_read_level(TRIGGERDATA_FIFO_IN_CSR_BASE);
nStatus = altera_avalon_fifo_read_status(TRIGGERDATA_FIFO_IN_CSR_BASE, ALTERA_AVALON_FIFO_STATUS_ALL))

No exception happened anymore.

So I used the OS_ENTER_CRITICAL() and OS_EXIT_CRITICAL() to modify the code in both tasks as follows:

In one task (higher priority), changed from:

altera_avalon_fifo_clear_event(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_EVENT_AE_MSK);
altera_avalon_fifo_write_ienable(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_IENABLE_AE_MSK);

to:

// FIFO_CSR_BASE = 0x610020
// ALTERA_AVALON_FIFO_EVENT_AE_MSK = 0x08
// ALTERA_AVALON_FIFO_IENABLE_AE_MSK = 0x08
OS_ENTER_CRITICAL();
altera_avalon_fifo_clear_event(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_EVENT_AE_MSK);
altera_avalon_fifo_write_ienable(FIFO_CSR_BASE, ALTERA_AVALON_FIFO_IENABLE_AE_MSK);
OS_EXIT_CRITICAL();

In the other task (lower priority):

Changed from:

nLevel = altera_avalon_fifo_read_level(TRIGGERDATA_FIFO_IN_CSR_BASE);
nStatus = altera_avalon_fifo_read_status(TRIGGERDATA_FIFO_IN_CSR_BASE, ALTERA_AVALON_FIFO_STATUS_ALL))

to:

OS_ENTER_CRITICAL();
nLevel = altera_avalon_fifo_read_level(TRIGGERDATA_FIFO_IN_CSR_BASE);
nStatus = altera_avalon_fifo_read_status(TRIGGERDATA_FIFO_IN_CSR_BASE, ALTERA_AVALON_FIFO_STATUS_ALL))
OS_EXIT_CRITICAL();

Doing so, the access to same memory or FIFO peripheral 0x610020 is synchronized or protected by temporarily disabling the processor's interrupts.

Most likely the changes will fix the exception/hang/lockup. But I have not get the it tested yet. I will let you know the test result.

Two questions:

1. During disabled interrupts period, the task will not switch while executing the code within the OS_ENTER_CRITICAL() and OS_EXIT_CRITICAL(). which only delays the tasks switch and/or increases the interrupt response time momentarily, right?

2. Shall a mutex with PIP (priority inheritance priority) be used instead of critical section?

Thank you, Kian.

Patrick

khtan · ‎04-02-2024

Hi Patrick,

Thanks for the update, looks like you're getting closer on resolving the issue. Nice finding too on the

nLevel = altera_avalon_fifo_read_level(TRIGGERDATA_FIFO_IN_CSR_BASE);
nStatus = altera_avalon_fifo_read_status(TRIGGERDATA_FIFO_IN_CSR_BASE, ALTERA_AVALON_FIFO_STATUS_ALL));

As for your question

1. During disabled interrupts period, the task will not switch while executing the code within the OS_ENTER_CRITICAL() and OS_EXIT_CRITICAL(). which only delays the tasks switch and/or increases the interrupt response time momentarily, right?

Yes, that is correct. Within the OS_ENTER_CRITICAL and OS_EXIT_CRITICAL , there can only be a single task, no switching/interrupt can happen. Downside of using these is that as you mentioned it will increase delays the tasks switch and/or increases the interrupt response time momentarily. I believe during the time when you're in OS_ENTER_CRITICAL, all the interrupts that happen during this time will not be serviced so might missed some if the time between OS_ENTER and OS_EXIT is quite long.

I was talking with another colleague about this case and he also pointed out this

2. Shall a mutex with PIP (priority inheritance priority) be used instead of critical section?

Mutex is a good idea, although one task is delayed by mutex, IRQ is still serviceable . Only works for task-task though which I think is valid in your case since you mentioned if you comment out the 2 read register task then no issues (leading to where i think this is a task to task conflict). Note if those tasks are triggered by IRQ , then Mutex cannot be used though as might potential cause deadlock.

Thanks

Regards

Kian

PatrickY · ‎04-04-2024

Hi, Kian,

Thank you for the detailed answers. We will stay with the critical section. Found a few other statements to access the same memory with same HAL API calls in either read or write in tasks and one interrupt service routine. Added OS_ENTER_CRITICAL() and OS_EXIT_CRITICAL() to each of them. Going to run more tests to see if no more exceptions occur.

Note: The previous test with mentioned changes indeed did not have exceptions but I also removed the statement

altera_avalon_fifo_clear_event() (thought it is not necessary but it is better to have it per the "Embedded Peripherals IP User Guide" doc). After added it back, an exception occurred, which led us to find more similar access statements.

Will keep you posted for the test result.

Thanks.

Patrick

khtan · ‎04-07-2024

Hi Patrick,

Just checking up on this case whether you still having issues on the exceptions triggering with your changes?

Thanks

Regards

Kian

PatrickY · ‎04-08-2024

Hi, Kian,

As expected, after the synchronization with the critical section to all access the address of both event and interrupt enable through HAL library calls altera_avalon_fifo_clear_event() and altera_avalon_fifo_write_ienable(), no exceptions occur in the application. That is, the issue is resolved.

But I have couple questions about the exception with unknown reason happened before the critical section is used.

Does the unprotected access to the shared resource/address cause an exception? Usually the unprotected access could cause data corruption, thereafter when the corrupted data are used, which would cause some error or exception such as memory access violation or invalid data to use, etc. But the unprotected access itself would not cause the exception.
Per the “Embedded Peripherals IP User Guide” document, if altera_avalon_fifo_clear_event() fails, it returns ALTERA_AVALON_FIFO_EVENT_CLEAR_ERROR and altera_avalon_fifo_write_ienable() fails, it returns ALTERA_AVALON_FIFO_IENABLE_WRITE_ERROR. Does either error cause any exception? If it does, no reason is specified?

Thank you in advance.

Best regards,

Patrick

khtan · ‎04-15-2024

Hi Patrick

Sorry for the delay in getting back to you as last week was festive holiday in our region. Glad that the issue been resolved.

As per your questions

1. Does the unprotected access to the shared resource/address cause an exception? Usually the unprotected access could cause data corruption, thereafter when the corrupted data are used, which would cause some error or exception such as memory access violation or invalid data to use, etc. But the unprotected access itself would not cause the exception.

Yes, exceptions happen when multiple functions (eg in this case the clear_event and write_ienable) is trying to access to the same register at the same/similar time due to RTOS task scheduler. If only 1 unprotected function/task is assessing the shared resource/address , this access is valid and will not cause exception to occur.

2. Per the “Embedded Peripherals IP User Guide” document, if altera_avalon_fifo_clear_event() fails, it returns ALTERA_AVALON_FIFO_EVENT_CLEAR_ERROR and altera_avalon_fifo_write_ienable() fails, it returns ALTERA_AVALON_FIFO_IENABLE_WRITE_ERROR. Does either error cause any exception? If it does, no reason is specified?

The HAL API will not cause any exception in the processor. (It is designed not to)

User are recommended to read the return value, and check if it is 0, before continuing to next operation

ret_code = altera_avalon_fifo_clear_event(....);

if(ret_code !=0)

{

printf_error();

return 0;

}

Thanks

Regards

Kian

khtan · ‎04-22-2024

Hi Patrick,

Is there anything else that you want to clarify related to this case, otherwise I would like to close the case and transition this thread to community support.

Thanks

Regards

Kian

khtan · ‎04-23-2024

Hi Patrick,

I will set this thread to close pending. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.

If you happened to close this thread you might receive a survey. If you think you would rank your support experience less than 4 out of 10, please allow me to correct it before closing or if the problem can’t be corrected, please let me know the cause so that I may improve your future service experience.

Regards

Kian

Nios II application hang/lockup with an exception

Nios II SW Dev (Ethernet|GDB|HAL|Interrupt|Make|Debug|Build|CLI|SDK|Complier|Build Tool|3rd party)