Methods and systems use one or more machine learning models to automatically generate three-dimensional sound. A multimodal content item is accessed by a computing device. Three-dimensional sound is automatically generated by the computing device using the one or more machine learning models based on the multimodal content item.