Vision, Interaction, and Language (VIL)


Under the affordance framework formalized by J.J. Gibson in 1979, for an agent to understand an environment does not simply mean to understand the physics behind the environment, but to understand what an environment offers to the agent. For example, a sphere affords to be rolled in certain environments under certain physical conditions, whereas a cube affords to be slided instead. Subsequently, knowing whether if an object rolls or slides tells us a ton about this object and how it behaves under different circumstances.

A comprehensive understanding of affordances is embedded in humans’ core knowledge systems, but today’s machines are far away from acquiring similar forms of commonsense reasoning. Therefore, we want to design an intelligent system that can capture the notion of affordances through perception, translate this knowledge into language, and hence be able to ground language in experience.


<ENVIRONMENT> The room has white walls. The floor is a blue rug. <ENVIRONMENT> <OBJECTS> The scene has green ring, blue cylinder, green cube, orange hexagonal prism, purple rectangular prism, and brown hemisphere. <OBJECTS> <POSITIONS> The green ring is on top of the blue cylinder. (...) The brown hemisphere is at the bottom. <POSITIONS> <ACTIONS> The green ring fell down. (...) The brown hemisphere first rolled to one side, but then came back to its initial orientation. <ACTIONS>

<ENVIRONMENT> The room has white walls. The floor is a blue rug. <ENVIRONMENT> <OBJECTS> The scene has black cone, black sphere, blue cylinder, pink cube, black cube, purple cube, and green cylinder.<OBJECTS> <POSITIONS>The black cone is on top of the black sphere. (...) The green cylinder is at the bottom. <POSITIONS> <ACTIONS> The black cone and the black sphere fell down. The black sphere slided through the side of the black cone. The black sphere rolled, and then came to a halt. The rest of the objects did not move. <ACTIONS>