Processors
Intel® Processors, Tools, and Utilities
14395 Discussions

Issues with mpich/mpiexec on Dual Intel Xeon E5-2697A v4 System

BCapa
Beginner
1,738 Views

I am attempting to run an application that uses mpich/mpiexec to assign threads to cores. I have had no trouble extensively running the same application on similar hardware (Dual Intel Xeon E5-2680-V3 system), in the same version of Fedora (23), which suggests to me that this is memory/hardware related .

I note that the E5-2680-V3 system (which runs the application without issue) does not have TSX-NI, whereas the E52697A v4 system does. Could this be the issue? Is it possible to disable TSX-NI on my E5-2697A v4 system to diagnose this? Otherwise, would updating the CPU microcode help?

Very little debugging info is given when the application fails, but given how quick it fails after execution, it is quite clear that something is very wrong here:

[wri@wrimodels12 runs]$ ems_domain --localize midatl

Starting UEMS Program ems_domain (V15.99.8) on wrimodels12 at Sat Dec 2 20:18:28 2017 UTC

* Localizing "midatl" domain - /home/wri/wrfems/uems/runs/midatl

Primary Domain

Projection : lat-lon

Standard Longitude : -41 Degrees

Reference Latitude : 42 Degrees

Reference Longitude : -41 Degrees

Grid NX x NY : 495 x 165

Grid Spacing : 0.170 Degrees

Geog Dset Res : modis_lakes+modis_30s+modis_15s+10m

* Burn'n up 32 processors to localize your domain. Please ignore the smoke - Failed (11)

! Error running GEOGRID - System Signal Code (SN) : 11 (Invalid Memory Reference - Seg Fault)

While perusing the log/domain_geogrid_stdout.log file I saw the following:

> YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

Also use the --nogeogrid flag for debugging.

[wri@wrimodels12 static]$ /home/wri/wrfems/uems/util/mpich2/bin/mpiexec -n 32 /home/wri/wrfems/uems/bin/geogrid

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= PID 47823 RUNNING AT wrimodels12

= EXIT CODE: 11

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

This typically refers to a problem with your application.

Please see the FAQ page for debugging suggestions

0 Kudos
1 Reply
idata
Employee
498 Views

 

bcapasso: Thank you very much for contacting the Intel® communities. We will do our best to try to provide the information you are looking for.

 

 

In regard to your inquiry about if the problem with the application is related to the Intel® E5-2697A v4 processor supporting TSX-NI, it is hard to tell for sure, it will depend on the requirements of the application itself. Depending on the model of the board, you might be able to disable it in the BIOS of it or by doing a BIOS update.

 

 

Now, remember that the tests done by Intel were done using Windows as operating system, since you are using Fedora, in this case we recommend to visit their forums for further technical assistance on this subject:

 

https://fedoraforum.org/ https://fedoraforum.org/

 

 

Any further questions, please let me know.

 

 

Regards,

 

Alberto R

 

0 Kudos
Reply