Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection
Correspondence to: Assoc. Prof. Tie Liu, College of Information Engineering, Capital Normal University, No.56, West North Ring Road, North street, Haidian, Beijing 100048, China. E-mail:
Driver drowsiness detection is a critical task for early warning of safe driving, while existing spatial feature-based methods face the challenges of large variations of head pose. This paper proposes a novel approach to integrate the memory mechanism in a multi-granularity deep framework to detect driver drowsiness, and the temporal dependencies over sequential frames are well integrated with the spatial deep learning framework on the frontal faces. The proposed approach includes two steps. First, the spatial Multi-granularity Convolutional Neural Network is designed to utilize a group of parallel Convolutional Neural Network extractors on well-aligned facial patches of different granularities and extract facial representations effectively for large variations of head pose. Furthermore, it can flexibly fuse detailed appearance clues of the main parts and local-to-global spatial constraints. Second, the memory mechanism is set up using a deep long short-term memory network of facial representations to explore long-term relationships with variable length over sequential frames, which is capable of distinguishing the states with temporal dependencies, such as blinking and closing eyes. The proposed approach achieves 90.05% accuracy and about 37 frames per second (FPS) speed on the evaluation set of the National Tsing Hua University Driver Drowsiness Detection dataset, which is applied to the intelligent vehicle for driver drowsiness detection. A dataset named Forward Instant Driver Drowsiness Detection is also built and will be publicly accessible to speed up the study of driver drowsiness detection.
Driver drowsiness is a critical problem that induces 6% of serious road accidents each year . This condition indicates that the driver lacks sleep, which can be detected by the variation of physiological signals[2–5], vehicle trajectory [6, 7], and facial expressions . Drowsiness detection using vehicle-based, physiological, and behavioral change measurement systems is possible with embedded pros and cons . Subjective techniques cannot be used in a real driving situation but are helpful in simulations for determining drowsiness. Psychological signals, such as electrocardiogram, electroencephalogram (EEG), and Electrooculography, can be utilized for drowsiness detection. Vehicle-movement-based detection is another technique. Here, information is obtained from sensors attached to the steering wheel, acceleration pedal, or body of the vehicle. Signals collected from sensors are continuously monitored for the identification of noticeable variations in order to detect driver drowsiness. Drowsiness never comes instantly but appears with visually noticeable symptoms. These symptoms generally appear even well before drowsiness in every driver. Moreover, drowsiness can be reflected by facial expressions, such as nodding, yawning, and closing eyes. We, therefore, aim to develop a drowsiness detection method based on videos. Video-based methods have the potential to give timely warning prompts and receive feedback from drivers, making them highly valuable in practice.
Video-based drowsiness detection still faces numerous challenges, mainly stemming from the illumination condition changes, head pose variations, and temporal dependencies. In particular, the large variation of head pose causes serious deformations of facial shape, which makes it difficult to extract effective spatial representations. Conventional approaches, such as those based on aligned facial points , provide a better way to represent drowsy features; however, their limitation lies in their inability to distinguish between blinking and closing eyes due to the neglect of temporal relationships. Spatial-temporal descriptors  are proposed to collect spatial and temporal features, but they are not good at distinguishing states with long-term dependencies, such as yawning and speaking. Besides, these hand-crafted descriptors are not powerful enough to capture the wide range of head pose variations and classify confusing states. For instance, looking aside and lowering the head lead to significant pose variations, while yawning and laughing, although similar in appearance, belong to different states.
Recently, deep learning methods have been widely used to learn facial spatial representations automatically from the global face [11–13]. However, the global face, when not well-aligned, is weak to provide effective representations, especially for handling large pose variations. Moreover, it is not flexible to fuse the configurations of local regions and concentrate representations on the most important parts, such as eyes, nose, and mouth, on which the majority of drowsy information focuses. It is another challenge to distinguish easy-to-confuse states, such as blinking and closing eyes. Additionally, 3D-Convolutional Neural Networks (CNN) with fixed time windows  tried to describe spatial and temporal features, but they do not have enough capability to model long-term relationships with variable time lengths.
We propose a Long-term Multi-granularity Deep Framework (LMDF) to detect driver drowsiness from well-aligned facial patches. Our method applies alignment technology to obtain the well-aligned facial patches over frames, and these patches are mainly located in the informative regions that supply critical drowsy information. A group of parallel convolution layers is applied to the multi-granularity facial patches, and the outputs of these layers are fused by a fully connected layer to generate spatial representations, which is named Multi-granularity CNN (MCNN). MCNN is able to fuse the appearance of those well-aligned patches and capture local-to-global constraints. To explore temporal dynamical characteristics, we fuse a memory mechanism to the MCNN; the Long Short-Term Memory (LSTM) network is applied to the spatial representations over sequential frames, which can distinguish the confusing states with temporal relationships, such as yawning, laughing, blinking, and closing eyes. The proposed method can, thus, not only extract effective facial representations from single-frame images but also mine temporal clues from videos.
As shown in Figure 1, the spatial and temporal features are extracted and concentrated to detect driver drowsiness. The contributions of our approach are mainly in the following three aspects:
Figure 1. The examples of driver drowsiness. (A) The normal status of a driver; (B) The drowsiness status of the driver. The spatial and temporal features are extracted and concentrated to detect driver drowsiness.
(1) We propose MCNN to learn the facial representations from the most important parts, which makes the detector robust to large pose variations.
(2) We propose an LMDF to learn facial spatial features and their long-term temporal dependencies.
(3) We build a Forward Instant Driver Drowsiness Detection (FI-DDD) dataset with higher precision of drowsy locations in the temporal dimension, which is a good test bed for evaluating practical systems that are required to detect drowsiness in time.
2. RELATED WORK
2.1. Traditional driver drowsiness detection methods
Driver drowsiness detection is becoming a hot topic in Advanced Driver Assistant Systems. Many traditional methods are applied to deal with this problem. The change of pupil diameter was utilized by Shirakata et al.to detect imperceptible drowsiness, which is effective but not convenient for a driver to take the equipment. Nakamura et al.utilized face alignment to estimate the degree of drowsiness via K-Nearest Neighbors (k-NN), which cannot achieve online performance. Spatial-temporal features for driver drowsiness detection were proposed by Akrout et al.. However, these features, based on Hough transformation, cannot work well in practical driving environments proposed a method for detecting driver drowsiness based on time-series analysis of the steering wheel angular velocity. Their approach involves using a temporal detection window to determine the steering wheel angular velocity over a time series, during which specified indicators of driver drowsiness become evident. Besides, the representations used in those methods are hand-crafted, which may not be flexible enough to adapt to complex situations faced in driving. In contrast, our method automatically learns facial representations, which is more effective for practical tasks.
In earlier research, to estimate the level of drowsiness, the measures to be focused on are single measures, such as vehicle-based, physiological, behavioral, or subjective measures. Researchers have reported good results using numerous less intrusive techniques to detect the drowsiness of drivers, including eyelid movement and gaze or head movement monitoring. Rumagit et al.investigated the relationship between the drowsiness and physiological conditions by utilizing an eye gaze tracker and the Japanese version of the Karolinska sleepiness scale within the driving simulator environment. Amirudin et al.analyzed two single measures that include physiological and behavioral measures, such as EEG signals and video sequences. Chmielińska et al.introduced an approach using CNNs and transferring learning techniques. The paper presents the results of scientific investigations aimed at developing detectors of the selected driver fatigue symptoms based on face images.
2.2. Driver drowsiness detection methods: CNN and RNN-based approaches
Deep learning approaches, such as CNN, have achieved success in representing information on images [19–21] and are widely used in the field of machine learning . The use of CNN models for image classification can avoid the problem of high complexity and difficulty in feature extraction found in traditional classification methods and is, therefore, increasingly applied in facial recognition. Compared to traditional image classification methods, deep learning methods can use a large number of datasets for training and learn the best features to represent these data, making them more responsive to changes in the real world .
Recently, many researchers have also applied CNN to driver drowsiness detection. Park et al.combined the results of three existing networks by Support Vector Machines (SVM) to present the categories of videos. Later[26–29], have also introduced various improved CNNs to capture facial regions under complex driving conditions to classify videos. However, those models can only classify videos into different categories; they cannot detect driver drowsiness online. In contrast, 3D-CNN is applied to extract spatial and temporal information by Yu et al.. However, the method can only capture features with a fixed temporal window. The above two methods utilize global face images, which cannot flexibly configure those patches containing the majority of drowsy information. Moreover, they struggle to capture dependencies with variable temporal lengths.
Due to the strong performance of LSTM Networks on sequential data [30–32], an increasing number of researchers propose combinations of CNNs and LSTMs to learn spatial and temporal representations of sequential frames. It is interesting that Liang et al.came up with convolutional layers with intra-layer recurrent connections to integrate the context information for object recognition. Donahue et al.provided a method that extracts visual features from images by CNN and learns the long-term dependencies from sequential data by LSTMs. In particular, the approach of Wang et al.and Jeong et al.processes images with CNN and models sequential labels by LSTMs concurrently and then combines the two representations through projection layers[35, 36]. However, none of the above methods apply a multi-granularity approach to concentrate representations on important parts and flexibly fuse configurations of different regions.
2.3. Multi-granularity methods
Fine-grained methods mostly rely on object detection[37–39], classifying all regions after identifying areas that may contain objects. Coarse-grained methods extract and encode overall image features through convolutional networks or vision transformers, which can eliminate the interference of fine details, but their performance is often inferior to fine-grained methods. The multi-granularity methods combine coarse and fine granularity to capture discriminative spatial and temporal information at different semantic levels.
Recently, multi-granularity methods have achieved several excellent results in some applications of computer vision. Li et al.proposed a temporal multi-granularity approach on action recognition. Their method achieved the state-of-the-art performance on action benchmarks but could not capture detailed appearance clues and local-to-global spatial information. Chen et al.applied multi-scale patches based on face alignment on face recognition. Wang et al.utilized multi-granularity regions detected by three granularities of CNN to generate a multi-granularity descriptor for fine-grained categorization, but this method cannot process sequential frames. Huang et al.proposed a multi-granularity extraction sub-network that extracts more efficient multi-granularity features while compressing the network parameters. They also included a feature rectification sub-network and a feature fusion sub-network to adaptively recalibrate and fuse the multi-granularity features. Finally, an LSTM network is applied to distinguish actions with similar appearances. However, this method cannot prioritize the most significant regions to get the most precise result and speed up the inference. Different from the approaches mentioned above, our method can capture both spatial multi-granularity information and long-term temporal dependencies. Particularly, our MCNN can learn representations on the most significant regions from well-aligned multi-granularity patches, and the proposed method has achieved the state-of-the-art accuracy on the National Tsing Hua University Driver Drowsiness Detection (NTHU-DDD)  dataset for driver drowsiness detection.
The proposed method utilizes MCNN to learn facial representations from single-frame images. The representations, extracted from well-aligned multi-granularity facial patches, contain detailed appearance information of the main parts and local-to-global constraints. Furthermore, our approach takes advantage of a deep LSTM network to explore the dynamical characteristics of the facial representations from sequential frames. The detailed structure of our LMDF combining MCNN and LSTMs is shown in Figure 2.
Figure 2. The long-term multi-granularity deep framework for driver drowsiness detection. The first stage is well-aligned multi-granularity patches that consist of local regions, main parts, and the global face. Parallel convolutional layers are well-applied to process these patches separately. In the second stage, a fully connected layer fuses local and global clues and generates a representation. The first two stages together construct the MCNN. The third stage uses Recurrent Neural Networks (RNN) with multiple LSTM blocks to mine the clues in the temporal dimension, together with a fully connected layer.
3.1. Well-aligned multi-granularity patches
It is well known that drowsy information is focused on several main facial parts, such as eyes, nose, and mouth. Alignment provides an excellent way to extract well-aligned features over frames, which effectively represent facial drowsy states. Besides, a global patch provides rough information for estimating a driver's head and full face states, which assists in the decision of the driver's drowsiness when the locations of parts are imprecise. Our method takes advantage of local regions and the global face at the same time.
We utilize face alignment technology to locate facial shape points. Given an image
Those patches, including local regions, main parts, and the global face, are produced by three different mappings. As shown in Figure 3, a mapping
Figure 3. The procedure of extracting multi-granularity facial patches, which include three granularities: the main parts, local regions, and the global face.
By processing the input image
Compared to the original image, the patch set
3.2. Learning facial representations
Our approach learns representations by CNN but is not hand-crafted for its good performance in learning spatial features. We apply several convolutional layers to process each one in the set of patches
Every patch needs to be processed by convolutional operations at first. For a patch
Figure 4. The three layers to capture the spatial features. The first layer is to project a normalized three-channel image onto a higher dimensional representation; the second layer is to enlarge the dimension of the representation; the third layer is to decrease the dimension with different parameters to the first layer.
A fully connected layer is utilized to combine those representations extracted by the mapping
With a specific weighted matrix
Driver drowsiness detection is a binary classification problem; thus, the state of an input frame is just drowsy or not. We label drowsiness with 1 as the positive sample and normal state with 0 as the negative sample. A label
To train the parameters of the CNN, we project the representation
3.3. Exploring dynamical characteristics
An LSTM block consists of an input gate, a forget gate, an output gate, and a memory cell. Because of the three gates, the LSTM block can learn long-term dependencies in sequential data, and its parameters are easier to train. The memory cell can store long-term information in its vector, which can be rewritten or operated on in the next time steps. Besides, the number of hidden units should be chosen according to the dimension of the input representation
We employ multiple-layer LSTMs to mine the temporal features for driver drowsiness. A mapping
A fully connected layer with weight
Similarly, the labels
The NTHU-DDD dataset is provided on the challenge of the ACCV2016 workshop for driver drowsiness detection, on which we compare our approach with others. To make the sequential labels close to the practical driving environments, we relabel the video set with the instant detecting principle. A new dataset is generated from the relabeled video set and is called FI-DDD, on which we learn parameters and analyze the performance of several subnetworks. While the performance of our entire approach is evaluated on the original NTHU-DDD dataset, we thus train a set of parameters to achieve long-term memory performance. Finally, the accuracy 90.05% is obtained by our LMDF on the evaluation set of the NTHU-DDD dataset, and the proposed method achieves about 37 frames per second (FPS) on GPU Tesla M40.
NTHU-DDD Dataset: The NTHU-DDD dataset includes five scenarios listed as glasses, no glasses, glasses at night, no glasses at night, and sunglasses. The training set involves 18 volunteers consisting of ten men and eight women who act as drivers with four different states in every scenario, while the evaluation set has four volunteers, including two men and two women. Non-sleepy videos contain only normal states, while sleepy videos combine normal and drowsy states together. Besides, blinking with nodding and yawning videos only record drowsy eyes and mouth, respectively. The NTHU-DDD dataset offers four annotation files recording the states of drowsiness, eyes, head, and mouth for every video. Table 1 gives the labels of drowsiness and three main parts.
The labeled states of each part on the NTHU-DDD dataset
|Mouth||Normal||Yawning||Talking and laughing|
It is worth emphasizing that the labels on the NTHU-DDD dataset are long-term memory, which means that the states of a frame may depend on the frames in the previous several seconds.
FI-DDD Dataset: A problem comes due to the long-term memory in NTHU-DDD, which is that a driver would still receive the warning prompts even if he had revised his drowsy state to normal for a few seconds. At the same time, those labels are unable to locate the drowsy states with high precision in the temporal dimension. To solve these problems, we relabel those videos with an instant principle, which means the latency is limited to 0.5 s, namely 15 frames for 30 FPS videos. Those typical states, such as closing eyes, yawning, and lowering head, are still considered as one of the pieces of evidence to judge whether a frame is drowsy. Those videos are cut into several clips that contain only the drowsy or normal states, alternatively according to our labels. To describe the transitional states between the normal and the drowsy, we reserve ten normal frames at the head and the tail of every clip with drowsiness. We name the relabeled dataset FI-DDD, which includes 14 drivers on the train set and four ones on the test set, as shown in Figure 5. The train set of FI-DDD in the daytime has 157 clips, and the test set has 92 clips, while in night scenarios, the train set has 126 clips, and the test set has 75 clips with about 530 frames on average.
Static image set: To train the parameters of CNN and analyze the effects of several factors, we build a static image set by sampling lots of frames from the FI-DDD dataset. The samples on the image set are labeled with drowsiness or normality, and the labels can almost indicate the true states of the corresponding images, even if a small number of images are matched with wrong labels due to a lack of temporal dependence. The static image set has 7, 498 images in the daytime; the train set includes 5, 239 images, and the test set has 2, 259 images. It has 2, 653 images in night scenarios; the train set includes 1, 750 images, and the test set has 903 images.
4.2. Implementation details
Face Alignment: We apply face alignment technology to locate those facial shape points for all videos. Face detection and tracking are combined to increase detecting rates and provide more accurate positions for faces on videos. Face alignment algorithms are based on those face positions. The face detector is from OpenCV, and the approach of face tracking is proposed by Danelljan et al.. We implement the method of Ren et al., retrain the model, and preprocess all videos to obtain the 51 landmark points for every frame. Those frames with no face will be recognized as empty and filled with zero coordinates for landmark points.
Multi-granularity: We obtain Multi-granularity patches considering two factors: different positions and sizes. We design to choose 15 positions from facial shape points, which are divided into three granularities: one global face with size
Dataset Usage: A static image set required for training the CNN parameters is sampled from the videos of FI-DDD with a specific frame interval. The result of CNN is directly related to multi-granularity patches and CNN parameters; we, thus, analyze the effects of those factors on the static image set, while all experiments for analyzing the effects of LSTM parameters are carried out on the FI-DDD dataset. To compare with the previous methods, we evaluate the proposed method on the evaluation set of the NTHU-DDD dataset.
4.3. Experimental analysis
As shown in Figure 6, the spatial and temporal features are extracted to detect driver drowsiness. The detection results of a driver under normal conditions are shown in Figure 6A, and the detection result of driver drowsiness is shown in Figure 6B, with the detected features marked with red color. In the experiments, there are two kinds of videos with different camera positions, and the proposed methods can work effectively for both.
Figure 6. The examples of driver drowsiness detection. (A) The normal status of a driver; (B) The detection result of driver drowsiness, while the detected features are marked with red colors.
To further explain the effects of alignment, multi-granularity, and CNN extractors, several groups of experiments are conducted on the static image set. We also provide experiments on the FI-DDD dataset to verify the effectiveness of LSTMs for detecting drowsiness on videos.
4.3.1. The importance of alignment
It is essential to carry out experiments to explain the significance of alignment and the effects of locating precision.
None-alignment vs. With Alignment: We provide another two none-alignment methods to sample those multi-granularity patches in facial bounding boxes: Uniform Sampling (US) and Specific Sampling (SS). The corresponding sizes of our Aligned sampling (AS) method and the two none-alignment ones are the same. Figure 7 (Left) shows the comparison of AS, US, and SS. AS considering alignment achieves the best accuracy at 87.4% on the test set of the static image set, which is 4.9% higher than the SS method and 6.2% higher than US. In conclusion, the alignment of facial patches, providing aligned representations, is an effective way to improve the accuracy of driver drowsiness detection.
Figure 7. Left: The comparison of different sampling methods, sampling over US, sampling SS, and our proposed sampling with AS; Right: The effect of alignment precision,
Effects of alignment precision: We evaluate the effects of the alignment precision and investigate the influence quantitatively by adding random noise with a Gaussian distribution
4.3.2. The effects of multi-granularity patches
Multi-granularity patches consist of local regions, main parts, and the global face. It is significant to conduct experiments and explain the importance of those granularities on driver drowsiness detection. We apply a fully connected layer and softmax operation to classify representations presented by MCNN extractors and analyze the effects of multi-granularity patches by the results of the classification.
Learning curve on different granularities: We take four different granularities, listed as local regions, main parts, the global face, and their combination, into account to analyze the effects of multi-granularity facial patches. Figure 8 illustrates the comparisons of those granularities, from which we know that the convergent speed of the method with global face granularity is the slowest compared with the others and that of local regions is the fastest, while the multi-granularity method achieves good performance on both convergent speed and accuracy. Aligned points can achieve higher precision in those local regions with abundant boundary texture, which results in more aligned representations and easier classification. Nevertheless, multi-granularity patches containing more aligned information are more effective in driver drowsiness detection.
Figure 8. The comparison of different granularities, the global face, main parts, local regions, and multi-granularity patches. The Curve of Acc over training times is achieved by CNN with different granularities on the test set of the static image set.
Effects of positions and sizes: We change the positions and sizes of facial patches, respectively. As shown in Figure 9 (Left), the main facial parts, including eyes, nose, and mouth, obtain the best accuracy 83.6% compared with the other single-granularity method. Obviously, the combination of those three granularities achieves the best accuracy at 87.4%. A conclusion comes that the most effective representation is extracted from the three main facial parts, while the fusion of local and global clues is an excellent way to obtain better facial representations.
Figure 9. Left: The comparison of patches with different positions, GF-the global face, MP-main parts (eyes, nose, and mouth), and LR-local regions(the corner of eyes, the sides of the nose, and the boundary of the mouth); Right: The comparison of patches with different sizes at all locations. Mg represents multi-granularity patches.
We set their sizes as the same and change the sizes to examine the difference between single-size and multi-granularity methods while keeping the patch locations constant. Figure 9 (Right) shows different regions with different sizes achieve 2.3% accuracy more than that of those single-size patches. The phenomenon is a result of the variation in sizes among different physiological parts; for example, the size of the global face is larger than that of a single eye. The above analysis presents that the multi-granularity method is an effective way to represent facial features.
4.3.3. The parameters selection of MCNN extractor
The structure parameters of the convolutional layers are listed in Table 2, which are the same in all parallel convolutional paths. A patch with size
The parameters of the three convolutional layers
|Max pooling||Not Used|
A fully connected layer is applied to combine the multi-granularity clues and generate MCNN representations. The number of its hidden units
4.3.4. The significance of LSTMs
We first apply MCNN to detect driver drowsiness on videos, but it has no capacity to capture the temporal clues. To address this limitation, we consider MCNN+LSTMs. It is necessary to compare the situation with LSTMs  and without LSTMs to understand the effects of LSTMs. All experiments at this part are carried out on the FI-DDD dataset in daytime scenarios. The parameter settings and adjustments follow the settings in .
Parameters setting: The representations given by MCNN extractors are 256-dimensional, and the number of hidden units in each LSTM block is equal to 256. The forget gate is enabled, and the max memory step is set to 60 frames. We randomly select a batch with 1, 000 samples to train the LSTM parameters with a learning rate
MCNN-Only vs. MCNN + LSTMs: The experiments are carried out on four different granularities to research the effects of multi-granularity and LSTMs. Figure 11 shows the accuracy of MCNN only and MCNN + LSTMs for detecting videos on test sets under different granularities. The MCNN-Only method obtains 72.7% accuracy, while the approach of MCNN + LSTMs surpasses it by 15.6%. The reason is that the LSTMs have the ability to mine the clues in the temporal dimension, which is significant for recognizing lots of ambiguous states, such as closing eyes and blinking. Comparing the accuracies of different granularities, we discover that the well-aligned multi-granularity facial patches still achieve the best performance. The accuracy of the main parts ranks second, which means the granularity of the main parts certainly plays the most important role in improving the effectiveness compared to the other two granularities.
Comparisons with the previous methods
We evaluate the whole method on the evaluation set and compare it with the previous methods [11, 13, 25, 52, 53] achieved on the same dataset. Due to the long-term memory characteristics of the NTHU-DDD dataset, the max memory length is set to 120 frames, and other parameters remain the same as in the above experiments. Especially for night scenarios, we retrain a model with the night data of NTHU-DDD to detect driver drowsiness on near-infrared videos.
Accuracy:Table 3 presents the comparison of our method, the previous work [11, 13, 25, 52, 53], and the proposed method achieves 90.05% accuracy, which is significantly improved compared to other existing methods, as the state-of-the-art method of driver drowsiness detection.
The comparison of different methods on the evaluation set of NTHU-DDD dataset with the detailed information of environments
|Methods||Platform||Spatial features||Sequential features||Speed||Accuracy|
|Yu et al.||GPU||3D-DCNN||Feature fusion||24
|Park et al.||-||DDD Network||SVM||-||73.06%|
|Yu et al.||GPU||3D-DCNN||Feature fusion||38.1 fps||76.2%|
|Wang et al.||-||CNN||LSTMs||40.64 fps||82.8%|
|MSTN ||-||CNN||LSTMs||60 fps||85.52%|
Speed:Table 3 shows the performance comparison of our method with other existing methods. The proposed method achieves a speed of 37 FPS on the GPU platform, satisfies the real-time performance requirements, and exceeds the majority of existing methods, second only to the methods proposed by Yu et al.. However, our method has significantly improved accuracy compared to their methods. At the same time, we measure the time consumption of all modules of our proposed method. From Table 4, CNN is the most time-consuming, and the approach achieves about 3 FPS on a CPU platform.
Time consumption of each module of the proposed method. Others include reading, writing, and some converting operations
|CPU(E5) + GPU(M40)||11.1||10.7||0.6||4.5||26.9|
Although our method has achieved good performance on existing datasets, there are still complex conditions and uncertain factors in real-world scenarios, such as significant changes in lighting and occlusion of the driver's face. In future research, we will continue to explore the model explainability and uncertainty quantification. We will also consider applying the method proposed in this paper to real-world scenarios and continue to explore how to improve the generalization of the method under complex conditions.
This paper proposes a novel approach to integrate the memory mechanism into a multi-granularity deep framework to detect driver drowsiness, and the temporal dependencies over sequential frames are well integrated with the spatial deep learning framework on the frontal faces. First, the spatial MCNN is designed to utilize a group of parallel CNN extractors on well-aligned facial patches of different granularities and extract facial representations effectively for large variations of head pose. Second, the memory mechanism is set up with a deep LSTM network of facial representations to explore long-term relationships with variable lengths over sequential frames, which is capable of distinguishing the states with temporal dependencies. The proposed method is evaluated on the NTHU-DDD dataset and achieves 90.05% accuracy and about 37 FPS performance as the state-of-the-art method on driver drowsiness detection. Moreover, a new dataset named FI-DDD is built with higher precision of drowsy locations in the temporal dimension. This dataset performs well in training model parameters and analyzing the effects of several factors and will be provided publicly to speed up the study.
Made substantial contributions to the conception and design of the study: Liu T, Chen D, Yuan Z
Performed data analysis and interpretation: Zhang H, Lyu J
All authors were involved in the writing of the paper.
Availability of data and materials
Financial support and sponsorship
This work was supported by the Beijing Natural Science Foundation (L201022).
Conflicts of interest
Ethical approval and consent to participate
Photographs of the faces used in this study were permitted and processed to protect privacy.
Consent for publication
Written informed consent was obtained from the participants.
© The Author(s) 2023.
1. Global status report on road safety 2013: supporting a decade of action: summary., WHO; 2013. Available from:
2. Wang J, Gong Y., Recognition of multiple drivers' emotional state., In: IEEE 19th International Conference on Pattern Recognition; 2008 Dec 08-11; Tampa, USA. IEEE; 2009. p. 1-4.
3. Yaacob H, Hossain F, Shari S, Khare SK, Ooi CP, Acharya UR. Application of artificial intelligence techniques for brain-computer interface in mental fatigue detection: a systematic review (2011-2022). IEEE Access 2023;11:74736-58.
4. Sharma S, Khare SK, Bajaj V, Ansari IA. Improving the separability of drowsiness and alert EEG signals using analytic form of wavelet transform. Appl Acoust 2021;181:108164.
5. Khare SK, Bajaj V, Sinha GR., Automatic drowsiness detection based on variational non-linear chirp mode decomposition using electroencephalogram signals., In: Modelling and Analysis of Active Biopotential Signals in Healthcare, Volume 1. IOP Publishing; 2020. Available from:
6. Colic A, Marques O, Furht B., Driver drowsiness detection - Systems and solutions., In: Springer Briefs in Computer Science. Springer; 2014.
7. Rezaei M, Klette R., Look at the driver, look at the road: No Distraction! No Accident!, In: IEEE Conference on Computer Vision and Pattern Recognition; 2014; Jun 23-28. IEEE; 2014. p. 129-36.
8. Nakamura T, Maejima A, Morishima S., Detection of driver's drowsy facial expression., In: IEEE 2nd Asian Conference on Pattern Recognition; 2013 Nov 05-08. IEEE; 2014. p. 749-53.
9. Ullah MR, Aslam M, Ullah MI, Maria MEA., Driver's drowsiness detection through computer vision: a review., In: Mexican International Conference on Artificial Intelligence. Springer, Cham; 2017. p. 272-81.
10. Akrout B, Mahdi W. Spatio-temporal features for the automatic control of driver drowsiness state and lack of concentration. Mach Vision Appl 2015;26:1-13.
11. Yu J, Park S, Lee S, Jeon M., Representation learning, scene understanding, and feature fusion for drowsiness detection., In: Asian Conference on Computer Vision. Springer, Cham; 2016. p. 165-77.
12. Huynh XP, Park SM, Kim YG., Detection of driver drowsiness using 3D deep neural network and semi-supervised gradient boosting machine., In: Asian Conference on Computer Vision. Springer, Cham; 2016. p. 134-45.
13. Shih TH, Hsu CT., MSTN: Multistage spatial-temporal network for driver drowsiness detection., In: Asian Conference on Computer Vision. Springer, Cham; 2016. p. 146-53.
14. Shirakata T, Tanida K, Nishiyama J, Hirata Y. Detect the imperceptible drowsiness. SAE Int J Passeng Cars Electron Electr Syst 2010;3:98-108.
15. Gao Z, Le D, Hu H, Yu Z, Wu X., Driver drowsiness detection based on time series analysis of steering wheel angular velocity., In: IEEE 9th International Conference on Measuring Technology and Mechatronics Automation; 2017 Jan 14-15; Changsha, China. IEEE; 2017. p. 99-101.
16. Rumagit AM, Akbar IA, Igasaki T. Gazing time analysis for drowsiness assessment using eye gaze tracker. Telecommun Comput Electron Contr 2017;15:919.
17. Amirudin NAB, Saad N, Ali SSA, Adil SH., Detection and analysis of driver drowsiness., In: IEEE 3rd International Conference on Emerging Trends in Engineering, Sciences and Technology; 2018 Dec 21-22; Karachi, Pakistan. IEEE; 2019. p. 1-9.
18. Chmielińska J, Jakubowski J. Detection of driver fatigue symptoms using transfer learning. B Pol Acad Sci Tech 2018;66.
19. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM 2017;60:84-90.
20. Sun Y, Wang X, Tang X., Deep learning face representation from predicting 10,000 classes., In: IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23-28; Columbus, USA. IEEE; 2014. p. 1891-8.
22. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998;86:2278-324.
23. Levi G, Hassner T., Age and gender classification using convolutional neural networks., In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops; 2015 Jun 07-12; Boston, USA. IEEE; 2015. p. 34-42.
24. Trigueros DS, Meng L, Hartnett M., Face recognition: from traditional to deep learning methods., arXiv 2018; In press.
25. Park S, Pan F, Kang S, Yoo CD., Driver drowsiness detection system based on feature representation learning using various deep networks., In: Asian Conference on Computer Vision. Springer, Cham; 2016. p. 154-64.
26. Li K, Gong Y, Ren Z. A fatigue driving detection algorithm based on facial multi-feature fusion. IEEE Access 2020;8:101244-59.
27. Arakawa T. Trends and future prospects of the drowsiness detection and estimation technology. Sensors 2021;21:7921.
28. Dua M, Shakshi, Singla R, Raj S, Jangra A. Deep CNN models-based ensemble approach to driver drowsiness detection. Neural Comput Appl 2021;33:3155-68.
29. Celecia A, Figueiredo K, Vellasco M, González R. A portable fuzzy driver drowsiness estimation system. Sensors 2020;20:4093.
30. Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A., Deep captioning with multimodal recurrent neural networks (m-RNN)., arXiv 2014; In press.
31. Graves A, Mohamed A, Hinton G., Speech recognition with deep recurrent neural networks., In: IEEE International Conference on Acoustics, Speech and Signal Processing; 2013 May 26-31; Vancouver, Canada. IEEE; 2013. p. 6645-9.
32. Chernodub AN, Nowicki D., Sampling-based gradient regularization for capturing long-term dependencies in recurrent neural networks., In: International Conference on Neural Information Processing. Springer, Cham; 2016. p. 90-7.
33. Liang M, Hu X., Recurrent convolutional neural network for object recognition., In: IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 07-12; Boston, USA. IEEE; 2015. p. 3367-75.
34. Donahue J, Hendricks LA, Guadarrama S, et al., Long-term recurrent convolutional networks for visual recognition and description., In: IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 07-12; Boston, USA. IEEE; 2015. p. 2625-34.
35. Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W., CNN-RNN: A unified framework for multi-label image classification., In: IEEE conference on Computer Vision and Pattern Recognition; 2016 Jun 27-30; Las Vegas, USA. IEEE; 2016. p. 2285-94.
36. Jeong JH, Yu BW, Lee DH, Lee SW. Classification of drowsiness levels based on a deep spatio-temporal convolutional bidirectional LSTM network using electroencephalography signals. Brain Sci 2019;9:348.
37. Tan H, Bansal M., LXMERT: learning cross-modality encoder representations from transformers., arXiv 2019; In press.
38. Li W, Gao C, Niu G, et al., UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning., arXiv 2020; In press.
39. Zeng Y, Zhang X, Li H., Multi-grained vision language pre-training: aligning texts with visual concepts., arXiv 2021; In press.
40. Huang Z, Zeng Z, Huang Y, et al., Seeing out of the box: end-to-end pre-training for vision-language representation learning., In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. p. 12976-85. Available from:
41. Kim W, Son B, Kim I., ViLT: vision-and-language transformer without convolution or region supervision., In: Proceedings of the 38th International Conference on Machine Learning. PMLR; 2021. p. 5583-94. Available from:
42. Zhang Z, Lan C, Zeng W, Chen Z., Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification., In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 10407-16. Available from:
43. Li Q, Qiu Z, Yao T, Mei T, Rui Y, Luo J., Action recognition by learning deep multi-granular spatio-temporal video representation., In: Proceedings of the ACM on International Conference on Multimedia Retrieval; 2016. p. 159-66.
44. Chen D, Cao X, Wen F, Sun J., Blessing of dimensionality: high-dimensional feature and its efficient compression for face verification., In: IEEE Conference on Computer Vision and Pattern Recognition; 2013 Jun 23-28; Portland, USA. IEEE; 2013. p. 3025-32.
45. Wang D, Shen Z, Shao J, Zhang W, Xue X, Zhang Z., Multiple granularity descriptors for fine-grained categorization., In: IEEE International Conference on Computer Vision; 2015 Dec 07-13; Santiago, Chile. IEEE; 2016. p. 2399-406.
46. Huang R, Wang Y, Li Z, Lei Z, Xu Y. RF-DCM: multi-granularity deep convolutional model based on feature recalibration and fusion for driver fatigue detection. IEEE Trans Intell Transp Syst 2020;23:630-40.
47. Weng CH, Lai YH, Lai SH., Driver drowsiness detection via a hierarchical temporal deep belief network., In: Asian Conference on Computer Vision. Springer; 2016. p. 117-33.
48. Ren S, Cao X, Wei Y, Sun J., Face alignment at 3000 FPS via regressing local binary features., In: IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23-28; Columbus, USA. IEEE; 2014. p. 1685-92.
49. Danelljan M, Häger G, Khan F, Felsberg M., Accurate scale estimation for robust visual tracking., In: Proceedings of the British Machine Vision Conference. Bmva Press; 2014. p. 1-12.
50. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 2016;28:2222-32.
51. Khare SK, Bajaj V, Acharya UR. SchizoNET: a robust and accurate Margenau-Hill time-frequency distribution based deep neural network model for schizophrenia detection using EEG signals. Physiol Meas 2023;44:035005.
52. Yu J, Park S, Lee S, Jeon M. Driver drowsiness detection using condition-adaptive representation learning framework. IEEE T Intell Transp 2018;20:4206-18.
53. Wang C, Yan T, Jia H. Spatial-temporal feature representation learning for facial fatigue detection. Int J Pattern Recogn Artif Intell 2018;32:1856018.
Cite This Article
Zhang H, Liu T, Lyu J, Chen D, Yuan Z. Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection. Intell Robot 2023;3(4):614-31. http://dx.doi.org/10.20517/ir.2023.34
Zhang H, Liu T, Lyu J, Chen D, Yuan Z. Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection. Intelligence & Robotics. 2023; 3(4): 614-31. http://dx.doi.org/10.20517/ir.2023.34
Zhang, Handan, Tie Liu, Jie Lyu, Dapeng Chen, Zejian Yuan. 2023. "Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection" Intelligence & Robotics. 3, no.4: 614-31. http://dx.doi.org/10.20517/ir.2023.34
Zhang, H.; Liu T.; Lyu J.; Chen D.; Yuan Z. Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection. Intell. Robot. 2023, 3, 614-31. http://dx.doi.org/10.20517/ir.2023.34
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at email@example.com.