Incorrect inference results from a minimal tensorflow model

idata · ‎03-24-2018

Hi,

I have a minimal example of a trivial tensorflow (v1.4) conv net that I train (to overfitting) with only two examples, freeze, convert with mvNCCompile, and then test on a compute stick.

The code, and steps, are fully described in the github repo movidius_minimal_example

No steps have warnings or errors; but the inference results I get on the stick are incorrect.

What should be my next debugging step?

Thanks,

Mat

note. mvccheck does fail, but i'm unsure if it's because of the structure of my minimal example…

$ mvNCCheck graph.frozen.pb -in imgs -on output
mvNCCheck v02.00, Copyright @ Movidius Ltd 2016

/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py:766: DeprecationWarning: builtin type EagerTensor has no __module__ attribute
  EagerTensor = c_api.TFE_Py_InitEagerTensor(_EagerTensorBase)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_inspect.py:45: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if d.decorator_argspec is not None), _inspect.getargspec(target))
USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result:  (1, 1)
1) 0 0.46216
Expected:  (1, 1)
1) 0 0.79395
------------------------------------------------------------
 Obtained values 
------------------------------------------------------------
 Obtained Min Pixel Accuracy: 41.789668798446655% (max allowed=2%), Fail
 Obtained Average Pixel Accuracy: 41.789668798446655% (max allowed=1%), Fail
 Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
 Obtained Pixel-wise L2 error: 41.789667896678964% (max allowed=1%), Fail
 Obtained Global Sum Difference: 0.331787109375
------------------------------------------------------------

idata · ‎03-24-2018

Reducing the model by removing the convolutional layers and it works. So I've got something to work on, it's something about the convolutions…

idata · ‎03-24-2018

If I use tf.layers.conv2d(model, filters=5, kernel_size=3) or slim.conv2d(model, num_outputs=5, kernel_size=3) I get the same results.

With padding=SAME they both give incorrect inference results,

With padding=VALID they both thrown an exception…

Traceback (most recent call last):
  File "./test_inference_on_ncs.py", line 29, in <module>
    output, _user_object = graph.GetResult()
  File "/usr/local/lib/python3.5/dist-packages/mvnc/mvncapi.py", line 264, in GetResult
    raise Exception(Status(status))
Exception: mvncStatus.MYRIAD_ERROR

I can't see what's fundamentally different between my definition of the conv2d layer compared to the models defined in the slim model zoo ….

idata · ‎03-24-2018

OK…. so a conv layer with >=8 outputs works (as mentioned in this post)

I agree with this comment though; what is the approach we should use for a fully convolutional image to image architecture (e.g. U-Net pix2pix) where we want the last output layer to be either 1 or 3 channels representing either a black and white or RGB image ?

idata · ‎03-25-2018

The method I'm going to use is just output 8 channels in final layer and for training (and inference) just slice off the first for loss calculation. It's unneeded work, but will get me going.

idata · ‎03-26-2018

@matpalm Can you post your frozen model so I can try to reproduce this issue on my bench?

idata · ‎03-27-2018

Sure, I'll do that today. Have a bundle of related cases I have repro steps for. I'll fork that GitHub repo I mentioned above with test cases for each & I'll check the models in too.

idata · ‎03-28-2018

see this github repo for models / reproduction steps etc https://github.com/matpalm/movidius_bug_reports

conv_with_8_filters works

conv_with_6_filters (same model but with =6 channels) fails

idata · ‎03-28-2018

also added an example of deconv failing with padding='SAME' under deconv_padding_same

idata · ‎03-28-2018

also added an example of output shape being wrong after conv -> deconv stack conv_deconv_output_shape_wrong

idata · ‎03-28-2018

at this point i can't get any workaround to work. all combos i can think of to hack my way out of not being able to use num_channels=1 don't work. so i'll put this project on hold & either check again next SDK release or sooner if you have things you'd like me to further test…

idata · ‎03-28-2018

@matpalm I'll be reviewing this issue today and I will get back to you if I need/find anything. Thanks.

idata · ‎03-28-2018

Great, thanks! No rush from my end; I'm going to be away for a couple of weeks. I have other cases but didn't get time to make test cases for them; I suspect they might be all the same underlying problem…

idata · ‎03-30-2018

@matpalm I have been able to reproduce your issues. At the moment, our SDK requires the output from convolution layers to be >= 8 as you've already seen. A possible workaround is to add a conf file to current working directory and name it the same name as your pb or meta file. For example if your model's name is "model.meta" then you create a brand new file called "model.conf".

In the conf file, add the convolution layer(s) (with an output that is less than 8) and in the line right below it, add the line "generic_spatial". This will choose a generic spatial convolution function. See the example below. Make sure to have an additional empty line as the last line in the conf file or else the SDK won't parse it correctly. Let me know if this helps. Thanks.

conv1
generic_spatial
conv2
generic_spatial
conv3
generic_spatial

idata · ‎04-11-2018

Thanks @Tome_at_Intel .

Got to try this today but still doesn't work sorry. The mode of failure is the same as far as I can see; output from running frozen network on host differs from running on compute stick…

e.g.

host_positive_prediction (1,) [ 1.]
host_negative_prediction (1,) [  1.38149336e-09]
ncs_positive_prediction (1,) [ 0.99902344]
ncs_negative_prediction (1,) [ 1.]

Just to confirm I've done the config correctly though;

When I include a conf file ….

e1/Conv2D
generic_spatial

… and run ./test.sh conv_with_6_filters I see the output mvNCCompile include I see the output

Spec opt found opt_conv_generic_spatial  1<< 10
Layer (a) e1/Conv2D use the optimisation mask which is:  0x400
0 0x80000000
Layer fully_connected/MatMul use the generic optimisations which is:  0x80000000
0 0x80000000
Layer output use the generic optimisations which is:  0x80000000

( whereas when I include an empty conf file I see just

0 0x80000000
Layer e1/Conv2D use the generic optimisations which is:  0x80000000
0 0x80000000
Layer fully_connected/MatMul use the generic optimisations which is:  0x80000000
0 0x80000000
Layer output use the generic optimisations which is:  0x80000000

)

So it appears to be picking it up the config (?) but doesn't work?

idata · ‎04-12-2018

@matpalm Thanks for reporting this issue. I went back and ran your ./test.sh script multiple times without the conf file. It seems that I am able to get a passing result sometimes from the script using conv_with_6_filters.

host_positive_prediction (1,) [0.49975544]
host_negative_prediction (1,) [0.49975544]
ncs_positive_prediction (1,) [0.5]
ncs_negative_prediction (1,) [0.5]
PASS conv_with_6_filters

However sometimes I get failing results when running the test.sh script:

host_positive_prediction (1,) [0.49975544]
host_negative_prediction (1,) [0.49975544]
ncs_positive_prediction (1,) [0.5]
ncs_negative_prediction (1,) [0.492]
FAIL conv_with_6_filters

Not sure why this is happening this way when the same network is generated each time.

idata · ‎04-13-2018

yeah, apologies on my part, this is a fault of my overly simply reproduction script…. to save time i've set things up to do a minimal training run to try to build a classifier that maps one example to 0.0 and another to 1.0. the script runs a simple optimiser loop, for a very short time, and it's possible that the optimisation fails & in these cases i see what you've reported here; both the host_positive_prediction & host_negative_prediction are 0.5 . i see this sometimes when i run the script. the workaround is to rerun the script when this happens so you get host_positive_prediction 1.0 and host_negative_prediction 0.0. i should fix this, even if the optimisation takes longer it's better to be more realiable with the reproduction…

idata · ‎04-13-2018

( let me fix this so it reproducible every run; i'll ping you again when that's done )

idata · ‎05-14-2018

Hey @Tome_at_Intel after being away for a bit I've revisited this, thanks for waiting.

Have reduced it to an even more minimal example that for me trains every time (made more stable by using a network that does a simpler regression now). Can you take another look? https://github.com/matpalm/movidius_bug_reports

Running ./test.sh conv_with_regression three times I get …

expected positive_prediction [10]
expected negativee_prediction [5]
host_positive_prediction (1,) [ 10.]
host_negative_prediction (1,) [ 5.]
ncs_positive_prediction (1,) [ 4.50390625]
ncs_negative_prediction (1,) [ 4.19921875]

expected positive_prediction [10]
expected negativee_prediction [5]
host_positive_prediction (1,) [ 10.]
host_negative_prediction (1,) [ 5.]
ncs_positive_prediction (1,) [ 5.109375]
ncs_negative_prediction (1,) [ 4.82421875]

expected positive_prediction [10]
expected negativee_prediction [5]
host_positive_prediction (1,) [ 10.]
host_negative_prediction (1,) [ 4.99999952]
ncs_positive_prediction (1,) [ 5.5]
ncs_negative_prediction (1,) [ 5.03125]

idata · ‎06-04-2018

updated to v2 of API in hope it might include a fix but still getting problems

git clone https://github.com/matpalm/movidius_bug_reports

and run_all_tests.sh if someone has time to sanity check what i'm doing wrong…

idata · ‎08-21-2018

Hi, I have been running the minimal example on ubuntu 16.04 with NCSDK 2.05 and setting the padding to VALID which now works. It seems to have fixed the error for all the filter sizes under 8.

I am still having a problem with the conv_with_regression output though, something weird is going on at the flatten layer.

I cannot get the other examples (conv 6 filter, conv shape wrong) as the output node BiasAdd doesn't exist. I tried using output/Sigmoid as that looks like what it should be from the graph file. What is the output node for these examples?

WITH FILTERS 5 PADDING SAME:

-ve prediction [[0.00200709]]

+ve prediction [[0.99799144]]

zeros prediction [[0.4813904]]

ones prediction [[0.68682736]]

ncs_negative_prediction (1,) [0.3876953]

ncs_positive_prediction (1,) [0.26342773]

ncs_zeros_prediction (1,) [0.48120117]

ncs_ones_prediction (1,) [0.40063477]

WITH FILTERS 5 PADDING VALID:

-ve prediction [[0.00233787]]

+ve prediction [[0.9975262]]

zeros prediction [[0.5118382]]

ones prediction [[0.43551898]]

ncs_negative_prediction (1,) [0.00232887]

ncs_positive_prediction (1,) [0.9970703]

ncs_zeros_prediction (1,) [0.51171875]

ncs_ones_prediction (1,) [0.43603516]

WITH FILTERS 4 PADDING SAME:

-ve prediction [[0.00228436]]

+ve prediction [[0.9978607]]

zeros prediction [[0.47025326]]

ones prediction [[0.18113402]]

ncs_negative_prediction (1,) [0.45507812]

ncs_positive_prediction (1,) [0.46972656]

ncs_zeros_prediction (1,) [0.49316406]

ncs_ones_prediction (1,) [0.42529297]

WITH FILTERS 4 PADDING VALID:

-ve prediction [[0.00204596]]

+ve prediction [[0.99782455]]

zeros prediction [[0.48026168]]

ones prediction [[0.2394675]]

ncs_negative_prediction (1,) [0.00204659]

ncs_positive_prediction (1,) [0.9980469]

ncs_zeros_prediction (1,) [0.4802246]

ncs_ones_prediction (1,) [0.23840332]

WITH FILTERS 3 PADDING VALID:

-ve prediction [[0.00195853]]

+ve prediction [[0.99770445]]

zeros prediction [[0.5915294]]

ones prediction [[0.6413879]]

ncs_negative_prediction (1,) [0.00197029]

ncs_positive_prediction (1,) [0.9980469]

ncs_zeros_prediction (1,) [0.5917969]

ncs_ones_prediction (1,) [0.64208984]

WITH FILTERS 2 PADDING VALID:

-ve prediction [[0.00195683]]

+ve prediction [[0.99803835]]

zeros prediction [[0.4144791]]

ones prediction [[0.27988973]]

ncs_negative_prediction (1,) [0.00194931]

ncs_positive_prediction (1,) [0.9980469]

ncs_zeros_prediction (1,) [0.41430664]

ncs_ones_prediction (1,) [0.27954102]