[ Article ]

The Journal of Korean Institute of Information Technology - Vol. 19, No. 3, pp.25-34

ISSN: 1598-8619 (Print) 2093-7571 (Online)

Print publication date 31 Mar 2021

Received 19 Jan 2021 Revised 23 Mar 2021 Accepted 26 Mar 2021

DOI: https://doi.org/10.14801/jkiit.2021.19.3.25

Traffic Accident Detection in First-Person Videos based on Depth and Background Motion Estimation

Jongwook Si^*

; Sungyoung Kim^**

*MS Course, Department of Computer Engineering, Kumoh National Institute of Technology
**Professor Department of Computer Engineering, Kumoh National Institute of Technology

Correspondence to: Sungyoung Kim Dept. of Computer Engineering, Kumoh National Institute of Technology, 61 Daehak-ro (yangho-dong), Gumi, Gyeongbuk, [39177] Korea Tel.: +82-54-478-7530, Email: sykim@kumoh.ac.kr

Abstract

Vehicle dashboard-mounted cameras (a.k.a. vehicle black box) are very helpful in preventing vehicle-related crime and building of an intelligent transportation system (ITS). The videos captured from the cameras are ego-motion videos. However, most of the existing methods tried to detect traffic accidents in the videos captured by fixed CCTVs. The situation is quite different in the first-person videos because motion exists even in a stationary object in the video. In this paper, we propose a method to determine if traffic accidents have occurred in the first-person videos. The detection of the accidents is based on future frame prediction. and the frame predictions are performed based on GAN. The generator of the GAN generates a predicted frame from several previous frames. Then, we use a traffic accident score between the predicted frame and the real frame to determine whether there is an accident. However, It is very difficult to predict frames when there in movement in the background such as the first-person video. So, we use depth and background motion estimation to improve the performance of future frame prediction. The proposed method shows better performance than previous researches.

초록

차량에 장착된 카메라(차량용 블랙박스)는 차량 관련 범죄를 예방하고 지능형 교통 시스템(ITS)을 구축하는 데 매우 유용하다. 차량용 카메라에서 촬영한 영상은 카메라 자체의 움직임을 포함하는 영상이다. 그런데 기존의 영상 기반 교통 사고를 검출하는 연구에서는 대부분 고정 CCTV에서 촬영된 영상에서 교통사고를 감지한다. 1인칭 비디오에서는 정지된 물체에서도 움직임이 발생하기 때문에 CCTV에서 촬영된 상황과는 상당히 다르다. 본 논문에서는 1인칭 비디오에서 교통사고의 발생 여부를 판단하는 방법을 제안한다. 사고 탐지는 미래 프레임 예측에 기반하며, 프레임 예측은 GAN을 기반으로 수행된다. GAN의 생성기는 이전 프레임들로부터 미래 프레임을 예측한다. 예측 프레임과 실제 프레임간의 교통사고 점수를 계산하여 사고 여부를 판단한다. 그런데, 배경의 움직임이 있을 때는 프레임이 예측이 매우 어려우므로 본 논문에서는 프레임 예측의 성능을 향상시키기 위해 깊이와 배경 움직임 추정을 사용한다. 제안된 방법은 이전 연구보다 더 나은 성능을 보여준다.

Keywords:

GAN, traffic accident detection, depth estimation, background motion estimation, first-person videos

Ⅰ. Introduction

Due to the development of the means of transportation, many people are dying of traffic accidents. According to statistics [1], the number of registered vehicles in Korea has been increasing by 3~4% every year, and it reached 23.68 million units in 2019. As the demand for personal transportation increases, so does the risk of accidents. According to another statistics [2], 46,208 people have died and about 3.4 million have been injured in traffic accidents in the last 10 years. While the death toll continues to decrease every year, the number of injured remains above 300,000. According to a research article [3], the need for a dashboard-mounted camera (i.e. automotive black boxes) in vehicles has been increasing, and the installation rate of the cameras in vehicles was just 38.2% in 2013, but 88.9% in 2019.

The cameras are very helpful in preventing hit-and-run and vehicle-related crime. In addition, construction of an intelligent transportation system may be accelerated due to the increase in vehicles equipped with vehicle black boxes. The vehicle accident detected in the black box can be transmitted to other vehicles and used to avoid the location of the traffic accident. It would be convenient if this process could be handled automatically.

Several studies have been proposed to detect vehicle accidents at intersections or highways [4][5]. However, most of the existing methods detect whether there were accidents in the videos captured by fixed CCTVs. It is not very difficult to analyze the movement of vehicles because the background is hardly changed in the fixed CCTV videos. On the other hand, the situation is quite different in the video captured by the camera mounted on the vehicle. The background moves together due to the movement of the vehicle, which makes it difficult to analyze the movement of other vehicles. so, there are not many researches to detect vehicle accidents in vehicle black box videos.

Recently, researches on vehicle accident detection have been actively conducted based on GAN (Generative Adversarial Network) [6]. First introduced in 2014, GAN is a technology that has attracted attention in many areas such as image creation, image synthesis, style conversion, natural language processing, and speech recognition. GAN is also used in research for vehicle accident detection [7]. The study used GAN and Flownet [8] to predict future frames to detect abnormal behavior in the fixed camera environment.

In this paper, we propose a method to detect traffic accident in black box videos based on GAN. The detection on the accidents is based on future frame prediction. The frame predictions are performed based on GAN. The generator of the GAN generates a future frame from several previous frames. The generator is trained in a way that minimizes the losses (integrity, gradient, depth and background motion loss) between the real images and the predicted images from non-accident dataset. The discriminator helps to improve the quality of image generation in the generator by distinguishing the real image from the generated image. In the inference step, the traffic accident score between the predicted frame and the real frame is used to determine whether there is an accident or not.

Ⅱ. Related works

2.1 GAN

GAN [6] was introduced in 2014. Yann LeCun described it as “the most interesting idea in the last 10 years in Machine Learning”. The two modules compete with each other and improve performance in the GAN. The generator produces the fake images and the discriminator classifies it correctly. The generator should create the fake images as if they are real to trick the discriminator. The Discriminator must classify real and fake images correctly. However, this performance was too low. DCGAN [9] improved performance by changing the structure of the fully-connected architecture to the convolution architecture and modifying the activation function to address the shortcomings of the original GAN. Pix2Pix [10] is a new type of GAN that uses supervised learning, that introduces the conversion from one image to another, but there is a limitation that there should exist a matching pair. CycleGAN [11] uses unsupervised learning and provides the ability to switch between the styles of the images by maintaining shape, but the resolution of the image is low. StyleGAN [12] can generate high-resolution styles-transferred results by adding noise in the learning process and increasing the size of images little by little from low resolution to high resolution. In addition, it produced natural high-resolution images using Adaptive Instance Normalization to solve feature entanglement [12] where interpolation becomes unnatural.

2.2 Anomaly Detection Using GAN

There have been several anomaly detection researches based on GAN. Liu et al. [7] proposed a novel study. They tried to predict the future frame by training the generator based on the intensity loss, gradient loss and optical flow loss. A predicted frame was classified as abnormal class if the score of the frame was low. Ganokratanaa et al. [13] detected anomaly events using the difference between the optical flow obtained from two consecutive frames and the predicted one. Both of the previous two methods only consider the frame order in the forward direction. However, Chen et al. [14] proposed a bi-directional anomaly detection method that considers both forward and reverse frame sequences. However, all of these methods do not provide high performance in first-person videos because they all use videos captured by a fixed camera as input.

2.3 Traffic Accident Detection

Vehicle accident detection methods are largely divided into methods of using fixed cameras such as CCTV and methods of using first-person view cameras such as vehicle black box. L. Shine estimated the start time of the anomaly events, proposed a new Fractal Data Distillation model for the separation of frames in which the anomaly occurred in the test data and used Yolo V3 to detect anomalies that were commonly detected in Frame Extractor and Background Extractor [15]. Son et al. used vectors and bird's-eye view to indicate movement between vehicles to determine traffic accidents using the distance and vector values [16]. Kim et al. used ego motion and object tracking technology to determine the movement of vehicles equipped with black boxes to determine parked vehicles and to determine traffic accidents using distances and vectors [17]. By introducing motion encoded images based on the research of [7], Hguyen et al. changed to stack frames instead of several frames, and the system was proposed traffic accident detection to convert images taken at night to day because of low performance of images taken at night [18]. Yao et al. uses the ego motion and the FOL track algorithm of the driving vehicle to predict the location of future movements and to detect traffic accidents on the basis of the black box video through the relationship between the bounding boxes [19].

2.4 Depth Estimation

Researches of estimating depth information from images vary widely. To estimate the stereo mono depth, Godard et al. input only the left image in the network and extracts two disparities through the left image to extract the depth estimation [20]. As an extension of the research [20], the camera pose could be also estimated through deep learning [21]. Li et al. estimated depth information by training MVO in an unsupervised way and using stereo images as input [22]. J. Hong analyzed factors related to depth (texture, color temperature, etc) by applying them differentially to graphic images [23]. I. Jee applied the belief proposition algorithm to the multi-resolution region to generate depth images accurately and quickly [24].

Ⅲ. Proposed Method

In this paper, we propose a method to detect traffic accident in black box videos based on GAN. Inspired by Liu et al. [7] to predict the frame, we propose a new system that can predict future frames. Fig. 1 shows the overall structure of the system. First, the generator generates a next (predicted) frame from several previous frames. We adopt several losses to improve the performance of the generator. We use integrity loss and gradient loss as proposed in the research [7] and two additional losses (depth loss and background motion loss). The generator is trained in a way that minimizes the losses between the real images and the predicted images from non-accident dataset. The discriminator helps to improve the quality of image generation in the generator by distinguishing the real image from the generated image. In the inference step, the traffic accident score (refer to 3.4) between the predicted frame and the real frame is used to determine whether there is an accident or not.

Fig. 1.

Overview of system architecture

3.1 Future Frame Prediction

As described above, the proposed method is based on future frame prediction. The generator generates future frames based on modified U-net architecture. U-net [25] was proposed for image segmentation purposes, but it has been widely used for prediction recently. We also use U-net structure for backbone network of the generator in the proposed GAN structure continuous frames are fed into the generator to predict a future frame. Each frame has a size of 128x128. The modified U-net structure has five layers and generates feature maps as in Fig. 2.

Fig. 2.

U-net architecture in the proposed system

3.2 Depth Estimation

The generator is trained in a way that minimizes the losses (integrity, gradient, depth and background motion loss) between the real images and the predicted images from non-accident dataset. Integrity loss and gradient loss were proposed in the research [7]. Intensity loss is the difference in pixel intensity between two frames, and gradient loss is the difference between slope of nearby pixels. We describe two additional losses, depth loss and background motion loss, in this and the next section. The depth loss is calculated from the depth estimation.

Depth Estimation is the process of calculating depth information of objects existing in a frame. Depth information is estimated based on the monocular depth estimation system [20]. The system uses the VGG-19 as backbone network and represents the depth information with a pyramidal structure of 4 levels in Fig. 3. The pyramidal depth information is resulted in the last four layers of the 1x1 convolution in the VGG-19. In this paper, we estimate depth information from a predicted and a real frame respectively and define depth loss based on the difference between them as in Fig. 3.

Fig. 3.

Network for depth information estimation and calculation process of depth loss

3.3 Background Motion Estimation

Vehicle movement is very important information to determine whether vehicle accidents occur or not [7]. If the optical flow is calculated in two sequential frames from static video camera, we can extract the trajectories of moving objects and analyze the behavior of them easily. However, it is impossible to extract the exact trajectories of moving objects in the first-person video with the same way in a static camera video. In first-person video, an invalid optical flow occurs in the background, which prevents proper training of the detection model.

In this paper, we estimate background motion by using Flownet [8]. The background motion is calculated based on a pre-defined small area, not the entire image. Fig. 4 shows the overall structure of background motion estimation. The middle area of the lower part of a frame is used to estimate background motion. Its size of the area is experimentally set to 12 pixels x 4 pixels in width and height. First optical flow is calculated between the current frame and the next frame, and second one between the current frame and the predicted next frame by using Flownet. Then, background motion loss is calculated based on the two optical flows as in Fig. 4.

Fig. 4.

Calculation process of background motion loss

3.4 Traffic Accident Score Using Reverse Normalized PSNR

Peak Signal to Noise Ratio (PSNR) is a representation of noise ratio against the maximum power of the signal. If the difference between the real image and the predicted image is large, the PSNR is measured low and can be determined as traffic accident frame. PSNR can be calculated as in Eq. (1). Then the Traffic Accident Score is defined based on the PSNR as in Eq. (2). The value of PSNR can vary greatly depending on the content of the image, the score is normalized by dividing with the difference between maximum and minimum values of the frame as in Eq. (2). The closer the score is to 1, the more likely it is to be a traffic accident frame.

P S N R f i = 10 × l o g m a x 2 1 N ∑ i = 0 N I - I^2

(1)

S c o r e v i = 1 - P S N R f i - m i n P S N R v i m a x P S N R v i - m i n P S N R v i

(2)

Ⅳ. Experiment

4.1 Datasets

Train dataset consists of 254 videos (81,417 frames) from the HEV-I dataset [26] and videos collected from YouTube. The test dataset consists of 54 videos (6,429 frames). The training dataset contains only non-accidental videos. On the other hand, all of the videos included in the test dataset contain accident scenes. Non-accident videos include sudden stops or turns, as well as usual car driving.

4.2 Evaluation

We evaluate the proposed method in the computational environment with RTX 3090 and ubuntu 18.04. Each frame in training dataset is manually labeled as either a non-accident and an accident category. Then, the score of the frame is calculated using the equation in (2). Fig. 5 shows the detection performance in ROC curve with δ=8. δ is the sequence (frame) length at input of generator in GAN. AUC of the ROC curve is 0.829.

Fig. 5.

ROC curve for the accident detection

Table 1 is the performance analysis of the proposed system according to δ and had the highest AUC when δ is 8. We also compared and analyzed previous researches. Compared to the most recent study [15], the proposed method shows 6.9% higher performance in AUC as in Table 2. This improvement in performance is due to the prediction of depth and background motion estimation.

Table 1.

Performance evaluation of the proposed method according to δ (the number of input frames)

Table 2.

AUC comparison between the proposed method and previous researches for first-person videos

We evaluate the detection performance according to the type of loss. Table 3 shows the comparison of the performance. The baseline model uses the intensity loss and gradient loss that were also used in the research [7]. Baseline shows 78.3% accuracy in AUC. When background motion loss (L_B) is added to the baseline model, the accuracy is improved to 79.1%. On the one hand, when we add depth loss (L_D), the accuracy is improved to 80.5%. We achieve the best performance of 82.9% when both background loss (L_B) and depth loss (L_D) are used together.

Table 3.

AUC Comparison according to detailed settings of the proposed method

Fig. 6 shows one of the frames in which a vehicle accident occurred. The left image in Fig. 6 is the real scene and the right is the predicted scene. A white vehicle on the right crashed into the wall. As you can see, the predicted scene appears somewhat blurry compared to the real scene. However, there is no difficulty in determining whether there is an accident. Fig. 7 shows an example of incorrect detection. The incorrect detection is due to the appearance of another vehicle in a situation where the turning path of the vehicle with black box is very large. The predicted frame appears to be more blurred on average than other predicted frames.

Fig. 6.

An example of the prediction of a traffic accident frame (left: ground truth, right: prediction)

Fig. 7.

An example of incorrect detection (left: ground truth, right: prediction)

Ⅴ. Conclusions

In this paper, we proposed a novel method of detection on traffic accidents in first-person videos. The first-person videos were captured from vehicle dashboard-mounted camera which is called a vehicle black box. The detection on the accidents was based on future frame prediction. The frame predictions were performed based on GAN in the situations in which the background movement caused by vehicle movement. We used depth and background motion estimation to improve prediction accuracy. The proposed method showed better performance than the previous researches. However, when the vehicles were located at a distance, they did not significantly affect PSNR. In this case, even if there were accidents, it is sometimes misclassified as non-accident. In addition, when the vehicle suddenly appeared on the frame, it was detected as accident because it was very different from the prediction. It is necessary to improve the prediction of the movement of objects in the future by removing the background.

Acknowledgments

This research was supported by Kumoh National Institute of Technology (2019104006)

References

https://www.index.go.kr/potal/main/EachDtlPageDetail.do?idx_cd=1257, [accesed: Jan. 10, 2021]
https://www.index.go.kr/potal/main/EachDtlPageDetail.do?idx_cd=1614, [accesed: Jan. 10, 2021]
https://www.trendmonitor.co.kr/tmweb/trend/allTrend/detail.do?bIdx=1780&code=0304&trendType=CKOREA, [accesed: Jan. 10, 2021]
P. Wang, C. Ni, and K. Li, "Vision-based highway traffic accident detection", Proc. Of International Conference on AIIPCC, Sanya, China, pp. 1-5, Dec. 2019. [https://doi.org/10.1145/3371425.3371449]
N. W. Aung and T. L lai, "Vehicle Accident Detection on Highway and Communication to the Closest Rescure Service", Proc, Of IEEE Conference on ICCA, Yangon, Myanmar, pp. 1-7, Feb. 2020. [https://doi.org/10.1109/ICCA49400.2020.9022855]
I. GoodFellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative Adversarial Nets", Advances in Neural Information Processing Systems 27 (NIPS), pp. 2672-2680, Jun. 2014
W. Liu, D. Lian, and S. Gao, "Future Frame Prediction for Anomaly Detection – A New Baseline", Proc. Of IEEE Conference on CVPR, Salt Lack City, USA, pp. 6536-6545, Mar. 2018. [https://doi.org/10.1109/CVPR.2018.00684]
A. Dosovitskiy, P. Fischer, E. llg, P. Hausser, C. Hazirbas, V. Golkov, P. V. Smagt, D. Cremers, and T. Brox, "Flownet: Learning Optical Flow with Convolutional Networks", Proc. Of the IEEE International Conference on ICCV, Santiago, Chile, pp. 2758-2766, May 2015. [https://doi.org/10.1109/ICCV.2015.316]
A. Radford, L. Metz, and S. Chintala, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", Under review as a conference paper at ICLR 2016, Vancouver, BC, Canada, pp. 1-16, Jan. 2016. arXiv:1511.06434v2, [cs.LG]
P. Isola, J. Y. Zhu, T. Zhou, and A. Efros, "Image-to-Image Translation with Conditional Adversarial Nets", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, pp. 1125-1134, Jul. 2017. [https://doi.org/10.1109/CVPR.2017.632]
J. Y. Zhu, T. Park, P. Isola, and A. Efros, "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks", Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2223-2232, Oct. 2017.
Tero Karras, Samuli Laine, and Timo Alia, "Style-Based Generator Architecture for Generative Adversarial Networks", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, pp. 4401-4410, Jun. 2019. [https://doi.org/10.1109/CVPR.2019.00453]
T. Ganokratanaa, S. Aramvith, and N. Sebe, "Anomaly Event Detection Using Generative Adversarial Network for Surveillance Videos", Proc. Of APSIPA Annual Summit and Conference, Lanzhou, China, pp. 1395-1399, Nov. 2019. [https://doi.org/10.1109/APSIPAASC47483.2019.9023261]
D. Chen, P. Wang, L. Yue, Y. Zhang, and T. Jia, "Anomaly detection in surveillance video based on bidirectional prediction", Elsevier, Vol. 98, No. 1, pp. 1-8, Jun. 2020. [https://doi.org/10.1016/j.imavis.2020.103915]
L. Shine, M. A. Vaishnav, and C. V. Jiji, "Fractional Data Distillation Model for Anomaly Detection in Traffic Videos", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, USA, pp. 2581-2589, Jun. 2020. [https://doi.org/10.1109/CVPRW50498.2020.00311]
H. Son, J. Si, D. Kim, Y. Lee, and S. Kim, "Traffic Accident Detection Using Bird’s Eye View and Vehicle motion Vector", Proc. of Korean Society of Computer Information, Jeju, Korea, pp. 71-72, Jul. 2020.
D. Kim, H. Son, J. Si, and S. Kim, "Traffic Accident Detection Based on Ego Motion and Object Tracking", Journal of advanced information technology and convergence, Vol. 10, No. 1, pp. 15–23, Jul. 2020. [https://doi.org/10.14801/JAITC.2020.10.1.15]
K. Hguyen, D. Dinh, M. N, Do, and M. Trian, "Anomaly Detection in Traffic Surveillance Videos with GAN-based Future Frame Prediction", Proc. of ICMR, Dublin, Ireland, pp. 457-463, Jun. 2020. [https://doi.org/10.1145/3372278.3390701]
Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, "Unsupervised Traffic Accident Detection in First-Person Videos", IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems (IROS), Macau, China, pp. 273-280, Jul. 2019. [https://doi.org/10.1109/IROS40897.2019.8967556]
C. Godard, O. M. Aodha and G. J. Brostow, "Unsupervised Monocular Depth Estimation with Left - Right Consistency", Proc. of the IEEE conference on CVPR, Honolulu, USA, pp. 270-279, Jul. 2017. [https://doi.org/10.1109/CVPR.2017.699]
C. Godard, O. M. Aodha, and G. J. Brostow, "Digging into Self-Supervised Monocular Depth Estimation", Proc. of the IEEE/CVF Internatioal Conference on ICCV, Seoul, Korea, pp. 3828-3838, Oct. 2019. [https://doi.org/10.1109/ICCV.2019.00393]
R. Li, S. Wang. Z. Long, and D. Gu, "UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning", Proc. of the IEEE International Conference on ICRA, Brisbane, Australia, pp. 7286-7291, May 2018.
J. Hong, "A study on characteristics related to texture, colour temperature and contrast ratio to improve the depth of stereoscopic images", Journal of the Institute of Internet, Broadcasting and Communication (IIBC), Vol. 18, No. 4, pp. 37-42, Aug. 31, 2018.
I. Jee, "A Study on the Generation and Processing of Depth Map for Multi-resolution Image Using Belief Propagation Algorithm", Journal of the Institute of Internet, Broadcasting and Communication (IIBC), Vol. 15, No. 6, pp. 201-208, Jun. 28, 2015. [https://doi.org/10.7236/JIIBC.2015.15.6.201]
O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation", Proc. of the MIC-CAI, Munich, Germany, pp. 234-241, Oct. 2015. [https://doi.org/10.1007/978-3-319-24574-4_28]
Y. Yao, M. Xu, C. Choi, D. J. Crandall, E. M. Akins, and B. Dariush, "Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems", Proc. of the International Conference on ICRA, Montreal, Canada, pp. 9711-9717, May 2019. [https://doi.org/10.1109/ICRA.2019.8794474]

Authors

Jongwook SI

2020 : BS degree in Department of Computer Engineering, Kumoh National Institute of Technology.

2020 ~ Current : MS Course in Department of Computer Engineering, Kumoh National Institute of Technology.

Research interests : Image Analysis, Style Transfer, Image Generation, Generative Adversarial Network.

Sungyoung Kim

1994 : BS degree in Department of Computer Engineering, Pusan National University.

1996 : MS degree in Department of Computer Engineering, Pusan National University.

2003 : Ph.D degree in Department of Computer Engineering, Pusan National University.

2004 ~ Current : Professor in Department of Computer Engineering, Kumoh National Institute of Technology.

Research interests : Image Processing, Computer Vision, Machine Learning, Deep Learning.