Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator

Does multi-node works?

fmohamm
Employee
656 Views

I am working on making the multinode run on Gaudi2 (v 1.20.0) by following the readme file here: https://github.com/HabanaAI/Megatron-LM/tree/main.  I am working with containers.  Even after multiple trials, I am not able to get the multi node code running on two gaudi nodes that are avaialable to me.

Is the readme file updated for latest  

 

0 Kudos
8 Replies
James_Edwards
Employee
605 Views

The documentation should be up to date. However, to debug your issue I will need more information regarding what you are trying to accomplish and the errors you are seeing. Please provide information on the example you are trying to run, how you are executing the example to run multi-node and the errors you are receiving.

0 Kudos
fmohamm
Employee
538 Views

I have two bare-metal gaudi machine, and I am trying the script   from https://github.com/HabanaAI/Megatron-LM/blob/1.20.0/examples/llama/README.md#setup  and https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation. 

Both the machines are connected from same jump server. During the setup, I am trying to check the accelerator interface status using the following commands. I see that the status on one machine is "up" while the status on other machine is "down". 

 

/opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --up

 /opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --status 

 

Is there any steps that are missing on README? 

 

Thanks 

 

0 Kudos
James_Edwards
Employee
527 Views

The customer has given me the following output on the "bad" node:

.

/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
accel0
3 ports down (8, 22, 23)
accel1
3 ports down (8, 22, 23)
accel2
3 ports down (8, 22, 23)
accel3
3 ports down (8, 22, 23)
accel4
3 ports down (8, 22, 23)
accel5
3 ports down (8, 22, 23)
accel6
3 ports down (8, 22, 23)
accel7
3 ports down (8, 22, 23)

.

I requested that he contact the lab admin to make sure the Gaudi platform has been wired to the accelerator network correctly.

0 Kudos
James_Edwards
Employee
393 Views

Is there any status on this issue? Has the problem been resolved?

0 Kudos
fmohamm
Employee
372 Views

We built another machine from scratch. This new machine also has the same issue.  Exactly same error that I was getting in the previous machine.

0 Kudos
James_Edwards
Employee
363 Views

Is the machine you built from scratch linked into the same switch used on the accelerator network?

0 Kudos
fmohamm
Employee
168 Views

Talked to the IT and they said that the machines are on different switch. How the ports status being `down` is related to being on different switch?   

 

fmohamm_0-1745513410041.png

 

0 Kudos
James_Edwards
Employee
164 Views

IT basically didn't answer the question, as the new machine could be on the switch that was used previously, or it could be on a different one. If the "new" system is connected to the switch correctly, IT should see the illuminated LED indicator light, showing that the connection is working. If they are on, the ports for the system should be up. If they aren't, something is wrong with the switch or the cabling.

 .

Whatever the case, if the two systems are on a different switch and those switches are not connected through a "spine" switch the boxes will not communicate with one another.

 

0 Kudos
Reply