Why NCS output are "nan" or "inf"?

idata · ‎11-23-2017

I have tried several caffe models on NCS.Two of them can run correctly.However one of the models' output is unnormalized.The C logs printed followings:

making run_cpp

cd cpp; ./run_cpp; cd ..

Successfully opened NCS device!

Successfully allocated graph for ../graph

w= 96 ,h= 112

Successfully loaded the tensor for image ../../../data/images/044.jpg

Successfully got the inference result for image ../../../data/images/044.jpg

resultData is 21080 bytes which is 10540 16-bit floats.

Index of top result is: -1

Probability of top result is: 0.000000

All my 10540 categories' probabilities are all "nan" so that the index of top result is -1 with 0.00000 probability.

The models have 90 layers with softmax output.And image resolution is 96*112.I have done scaling(1/128) when training.And I use one NCS to run the model.

I am wondering about that why the NCS's output is "nan"?

Can someone give me some suggestions?Thanks a lot!

idata · ‎11-23-2017

The fp16 data outputted from NCS seems overflowed when transformed to fp32.Is this caused by training in float or other reasons maybe??

idata · ‎11-27-2017

@ssliu Thanks for reporting this. I would like to test the networks you tried and see if I can reproduce this issue myself. If you could provide a link to the networks you used, that would be very useful. Thanks again.

idata · ‎11-27-2017

@Tome_at_Intel How can I share the networks to you ?With the large model and deploy.prototxt and an image for detect?

idata · ‎11-27-2017

@ssliu Github links/dropbox links would work.

idata · ‎11-27-2017

@Tome_at_Intel The github links is https://github.com/sunnySSliu/NCS_comunicate.git. It used the Git LFS.

idata · ‎11-27-2017

@ssliu Thanks for providing your network and images. I tried your network and I had no issues running your program and network, although I did receive very large probabilities. For your provided image file 012.jpg, I received a index of 9309 and a probability of 506.50000000. Please try this again and see if you were able to get the same result. Thanks.

idata · ‎11-28-2017

@Tome_at_Intel Maybe you have to make a revise of the makefile .Now the output is not the last layer,you can delete the "-on" parameter of mvNCCompile in makefile.Thanks.

idata · ‎11-28-2017

The large probabilities seems wrongly and the probabilities goes larger and larger between layers until overflowed.So,at one of the layers,output has going to "nan" or "inf".I also have run the model on PC,and part of the output of fc5 layer is like below:

-1.12684 -0.263021 -1.17812 1.32531 0.319905 -0.0244644 -2.2793 0.140543 0.321131 -0.328076 0.985858 -3.03365 0.615863 -0.919282 -0.639025 1.48831 -1.14018 -0.144667 0.601186 0.142621 -1.43988 0.475403 -1.26686 -0.507238 -0.93005 0.427848 -0.728527 -1.86215 0.448485 -0.182363 -1.65505 -0.790986 -0.065641 -3.27148 0.636745 1.10612 -1.17482 -2.7694 -0.329878 -1.14171 -1.28136 1.32891 -0.511056 -0.557872 0.354566 0.340552 -0.807782 -1.80585 -1.60028 -0.346055 0.224037 -0.297021 -0.0793115 -1.65875 0.320309 -0.0186328 0.607013 0.784389 -0.377752 -1.7018 -2.22441 1.67076 0.251789 0.551989 -1.32332 0.742055 -0.366883 0.578996 0.35062 0.147858 -0.918806 -0.347651 -1.25856 2.46879 0.417314 -1.09436 -1.0119 0.511565 -1.68811 0.5903 2.67939 -1.11716 -2.07182 1.58757 -1.29253 -1.78229 -0.485864 -1.05624 -0.926988 -0.748615 0.904353 -0.1299 -0.541021 1.06192 -1.14407 0.589872 -0.942095 1.28527 1.05734 -0.297605 -1.93476 -0.526176 0.713658 0.37538 -0.156158 0.365746 -0.750868 -0.835334 -0.896935 0.678555 0.0420146 0.716639 -0.000560202 -0.681816 0.856575 -0.132422 -1.92006 1.18765 -0.254363 -0.525324 0.100624 -0.374459 -1.40312 -1.86245 1.22713 -0.293825 2.31105 -0.885568 -1.46361 0.874743 -1.66305 -1.88179 -0.49293 -0.779195 1.70638 1.24512 -0.279403 2.41547 0.582597 -0.406534 -1.8943 -1.15343 -2.33131 -0.339133 -0.386461 -0.943876 -0.981108 0.807584 -0.916267 0.922208 0.369756 -0.396057 -1.15999 1.67288 1.22885 -1.38413 -0.650468

They are not so large.Is this may caused by the model training in float?Or the input is 96*112 which is not square?

idata · ‎11-29-2017

@Tome_at_Intel How was the debugging on your side ?

idata · ‎11-29-2017

@ssliu It is better to try with square images and see if same issues will appear then make comparison.

idata · ‎11-30-2017

@georgievm_cms I have already tried this.As the model is trained in 96*112,so I got the same error output if I give a square image to it.

idata · ‎11-30-2017

@ssliu I think the right test case in your situation will be to train the model either in 96x96 or 112x112 and then use test images with same size as trained ones.

idata · ‎12-01-2017

@georgievm_cms We would train it in square as soon as our server being idled.However it can't solve the fundamental problem ,right?Please help to debug the original problem.Thanks a lot.

idata · ‎01-17-2018

Hey, I encountered the same problem. The cause maybe many, but one of them, and it was in my case, is that input tensor must be in float16 format, regardless of network's data type (as far as I experience). And this is not addressed in any docs or commented in the example codes.

idata · ‎01-17-2018

@dvpo There is more information about using the NCSDK with the NCS @ https://movidius.github.io/ncsdk/. Regarding using the input tensor with float16 data types, you can visit the C and Python API portions of the documentation site @ https://movidius.github.io/ncsdk/c_api/mvncLoadTensor.html and https://movidius.github.io/ncsdk/py_api/Graph.LoadTensor.html.

idata · ‎01-19-2018

@Tome_at_Intel I did not realize that there is a doc on this. Thanks for pointing out.