A Hybrid Deep Boltzmann Machine for Contextual Scene Modeling
Scene models allow robots to reason about what is in the scene, what else should be in it, and what should not be in it. In this paper, we propose a hybrid Boltzmann Machine (BM) for scene modeling where relations between objects are integrated. To be able to do that, we extend BM to include tri- way edges between visible (object) nodes and make the network to share the relations across different objects. We evaluate our method against several baseline models (Deep Boltzmann Machines, and Restricted Boltzmann Machines) on a scene classification dataset, and show that it performs better in several scene reasoning tasks.
What is (missing or wrong) in the scene? A Hybrid Deep Boltzmann Machine For Contextualized Scene Modeling ( accepted for ICRA 2018 )
Method & Results
We propose a triway deep BM for scene modeling. Visible nodes are separated into two groups: relations(r) and objects(v). Relation nodes(visible) are shared between two object nodes. Visible nodes(both of relation and visible nodes) are connected to units at the first hidden layer. Hidden layers are stacked to make model deep.
Gibbs sampling is used for inference instead of variational inference since our dataset is relatively small (total 3485 samples) and input vectors are too sparse (i.e. slight number of relation nodes are active). It is a type of Monte Carlo Markov Chain Method that can guarantee precise inferences. Precise inference is important for our problem since only slight number of relation nodes are active for each sample.
We conduct several experiments that includes scenarios that can be performed by robots in real world problems. We compared our model’s results(Triway BM) with Restricted Boltzmann Machine(RBM) and General Boltzmann Machine(GBM) that is structured as bidirectional links exists within input layer and hidden layers are stacked.
The first experiment is estimation of relations between objects that are on the scene for current context. In this experiment, firstly, model determines context by using objects on the scene only, then using by context and activation of objects, it estimates relations among objects on the scene. For this task, we define accuracy as the percentage of relations correctly estimated with respect to the labeled relations in the test dataset. We see that our model provides highest accuracy(Table 1).
The second experiment is finding missing object(s) on the scene according to current context. In this task, we randomly de-activate an object from the scene and expect the network to find the missing object. For this task, we define accuracy as the percentage of the missing objects found correctly in the test dataset. Our model is the best(Table 2).
The third experiment is finding the objects that are out of current context on the scene. For this task, we select scenes, remove randomly an object and add randomly another object not in the scene. We define error measure to indicate how much reconstructed data is different form original data that is without extra objects. Our model is the best(Table 3).
The last experiment is random scene generation. In this task, we randomly activate hidden units and visible units are sampled according to context that is determined by hidden units.