This paper presents a vision-based fall detection system to automatically monitor and detect people’s fall accidents, particularly those of elderly people or patients. For video analysis, the system should be able to extract both spatial and temporal features so that the model captures appearance and motion information simultaneously. Our approach is based on 3-dimensional convolutional neural networks, which can learn spatiotemporal features. In addition, we adopts a thermal camera in order to handle several issues regarding usability, day and night surveillance and privacy concerns. We design a pan-tilt camera with two actuators to extend the range of view. Performance is evaluated on our thermal dataset: TCL Fall Detection Dataset. The proposed model achieves 90.2% average clip accuracy which is better than other approaches.