Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection

Handan  Zhang; Tie  Liu; Jie  Lyu; Dapeng  Chen; Zejian  Yuan

doi:10.20517/ir.2023.34

Download PDF

Research Article | Open Access | 26 Nov 2023

Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection

Views: 337 | Downloads: 92 | Cited:

0

Handan Zhang¹

, ...

Zejian Yuan²

Intell Robot 2023;3(4):614-31.

10.20517/ir.2023.34 | © The Author(s) 2023.

Author Information

Article Notes

Cite This Article

Abstract

Driver drowsiness detection is a critical task for early warning of safe driving, while existing spatial feature-based methods face the challenges of large variations of head pose. This paper proposes a novel approach to integrate the memory mechanism in a multi-granularity deep framework to detect driver drowsiness, and the temporal dependencies over sequential frames are well integrated with the spatial deep learning framework on the frontal faces. The proposed approach includes two steps. First, the spatial Multi-granularity Convolutional Neural Network is designed to utilize a group of parallel Convolutional Neural Network extractors on well-aligned facial patches of different granularities and extract facial representations effectively for large variations of head pose. Furthermore, it can flexibly fuse detailed appearance clues of the main parts and local-to-global spatial constraints. Second, the memory mechanism is set up using a deep long short-term memory network of facial representations to explore long-term relationships with variable length over sequential frames, which is capable of distinguishing the states with temporal dependencies, such as blinking and closing eyes. The proposed approach achieves 90.05% accuracy and about 37 frames per second (FPS) speed on the evaluation set of the National Tsing Hua University Driver Drowsiness Detection dataset, which is applied to the intelligent vehicle for driver drowsiness detection. A dataset named Forward Instant Driver Drowsiness Detection is also built and will be publicly accessible to speed up the study of driver drowsiness detection.

Keywords

Driver drowsiness detection, multi-granularity convolutional neural network, visual attention

Download PDF 0 5

1. INTRODUCTION

Driver drowsiness is a critical problem that induces 6% of serious road accidents each year ^[1]. This condition indicates that the driver lacks sleep, which can be detected by the variation of physiological signals^[2–5], vehicle trajectory ^{[6, 7]}, and facial expressions ^[8]. Drowsiness detection using vehicle-based, physiological, and behavioral change measurement systems is possible with embedded pros and cons ^[9]. Subjective techniques cannot be used in a real driving situation but are helpful in simulations for determining drowsiness. Psychological signals, such as electrocardiogram, electroencephalogram (EEG), and Electrooculography, can be utilized for drowsiness detection. Vehicle-movement-based detection is another technique. Here, information is obtained from sensors attached to the steering wheel, acceleration pedal, or body of the vehicle. Signals collected from sensors are continuously monitored for the identification of noticeable variations in order to detect driver drowsiness. Drowsiness never comes instantly but appears with visually noticeable symptoms. These symptoms generally appear even well before drowsiness in every driver. Moreover, drowsiness can be reflected by facial expressions, such as nodding, yawning, and closing eyes. We, therefore, aim to develop a drowsiness detection method based on videos. Video-based methods have the potential to give timely warning prompts and receive feedback from drivers, making them highly valuable in practice.

Video-based drowsiness detection still faces numerous challenges, mainly stemming from the illumination condition changes, head pose variations, and temporal dependencies. In particular, the large variation of head pose causes serious deformations of facial shape, which makes it difficult to extract effective spatial representations. Conventional approaches, such as those based on aligned facial points ^[8], provide a better way to represent drowsy features; however, their limitation lies in their inability to distinguish between blinking and closing eyes due to the neglect of temporal relationships. Spatial-temporal descriptors ^[10] are proposed to collect spatial and temporal features, but they are not good at distinguishing states with long-term dependencies, such as yawning and speaking. Besides, these hand-crafted descriptors are not powerful enough to capture the wide range of head pose variations and classify confusing states. For instance, looking aside and lowering the head lead to significant pose variations, while yawning and laughing, although similar in appearance, belong to different states.

Recently, deep learning methods have been widely used to learn facial spatial representations automatically from the global face ^[11–13]. However, the global face, when not well-aligned, is weak to provide effective representations, especially for handling large pose variations. Moreover, it is not flexible to fuse the configurations of local regions and concentrate representations on the most important parts, such as eyes, nose, and mouth, on which the majority of drowsy information focuses. It is another challenge to distinguish easy-to-confuse states, such as blinking and closing eyes. Additionally, 3D-Convolutional Neural Networks (CNN) with fixed time windows ^[11] tried to describe spatial and temporal features, but they do not have enough capability to model long-term relationships with variable time lengths.

We propose a Long-term Multi-granularity Deep Framework (LMDF) to detect driver drowsiness from well-aligned facial patches. Our method applies alignment technology to obtain the well-aligned facial patches over frames, and these patches are mainly located in the informative regions that supply critical drowsy information. A group of parallel convolution layers is applied to the multi-granularity facial patches, and the outputs of these layers are fused by a fully connected layer to generate spatial representations, which is named Multi-granularity CNN (MCNN). MCNN is able to fuse the appearance of those well-aligned patches and capture local-to-global constraints. To explore temporal dynamical characteristics, we fuse a memory mechanism to the MCNN; the Long Short-Term Memory (LSTM) network is applied to the spatial representations over sequential frames, which can distinguish the confusing states with temporal relationships, such as yawning, laughing, blinking, and closing eyes. The proposed method can, thus, not only extract effective facial representations from single-frame images but also mine temporal clues from videos.

As shown in Figure 1, the spatial and temporal features are extracted and concentrated to detect driver drowsiness. The contributions of our approach are mainly in the following three aspects:

Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection

Figure 1. The examples of driver drowsiness. (A) The normal status of a driver; (B) The drowsiness status of the driver. The spatial and temporal features are extracted and concentrated to detect driver drowsiness.

(1) We propose MCNN to learn the facial representations from the most important parts, which makes the detector robust to large pose variations.

(2) We propose an LMDF to learn facial spatial features and their long-term temporal dependencies.

(3) We build a Forward Instant Driver Drowsiness Detection (FI-DDD) dataset with higher precision of drowsy locations in the temporal dimension, which is a good test bed for evaluating practical systems that are required to detect drowsiness in time.

2. RELATED WORK

2.1 Traditional driver drowsiness detection methods

Driver drowsiness detection is becoming a hot topic in Advanced Driver Assistant Systems. Many traditional methods are applied to deal with this problem. The change of pupil diameter was utilized by Shirakata et al.to detect imperceptible drowsiness, which is effective but not convenient for a driver to take the equipment^[14]. Nakamura et al.utilized face alignment to estimate the degree of drowsiness via K-Nearest Neighbors (k-NN), which cannot achieve online performance^[8]. Spatial-temporal features for driver drowsiness detection were proposed by Akrout et al.^[10]. However, these features, based on Hough transformation, cannot work well in practical driving environments^[15] proposed a method for detecting driver drowsiness based on time-series analysis of the steering wheel angular velocity. Their approach involves using a temporal detection window to determine the steering wheel angular velocity over a time series, during which specified indicators of driver drowsiness become evident. Besides, the representations used in those methods are hand-crafted, which may not be flexible enough to adapt to complex situations faced in driving. In contrast, our method automatically learns facial representations, which is more effective for practical tasks.

In earlier research, to estimate the level of drowsiness, the measures to be focused on are single measures, such as vehicle-based, physiological, behavioral, or subjective measures. Researchers have reported good results using numerous less intrusive techniques to detect the drowsiness of drivers, including eyelid movement and gaze or head movement monitoring. Rumagit et al.investigated the relationship between the drowsiness and physiological conditions by utilizing an eye gaze tracker and the Japanese version of the Karolinska sleepiness scale within the driving simulator environment^[16]. Amirudin et al.analyzed two single measures that include physiological and behavioral measures, such as EEG signals and video sequences^[17]. Chmielińska et al.introduced an approach using CNNs and transferring learning techniques^[18]. The paper presents the results of scientific investigations aimed at developing detectors of the selected driver fatigue symptoms based on face images.

2.2 Driver drowsiness detection methods: CNN and RNN-based approaches

Deep learning approaches, such as CNN, have achieved success in representing information on images ^[19–21] and are widely used in the field of machine learning ^[22]. The use of CNN models for image classification can avoid the problem of high complexity and difficulty in feature extraction found in traditional classification methods and is, therefore, increasingly applied in facial recognition^[23]. Compared to traditional image classification methods, deep learning methods can use a large number of datasets for training and learn the best features to represent these data, making them more responsive to changes in the real world ^[24].

Recently, many researchers have also applied CNN to driver drowsiness detection. Park et al.combined the results of three existing networks by Support Vector Machines (SVM) to present the categories of videos^[25]. Later^[26–29], have also introduced various improved CNNs to capture facial regions under complex driving conditions to classify videos. However, those models can only classify videos into different categories; they cannot detect driver drowsiness online. In contrast, 3D-CNN is applied to extract spatial and temporal information by Yu et al.^[11]. However, the method can only capture features with a fixed temporal window. The above two methods utilize global face images, which cannot flexibly configure those patches containing the majority of drowsy information. Moreover, they struggle to capture dependencies with variable temporal lengths.

Due to the strong performance of LSTM Networks on sequential data ^[30–32], an increasing number of researchers propose combinations of CNNs and LSTMs to learn spatial and temporal representations of sequential frames. It is interesting that Liang et al.came up with convolutional layers with intra-layer recurrent connections to integrate the context information for object recognition^[33]. Donahue et al.provided a method that extracts visual features from images by CNN and learns the long-term dependencies from sequential data by LSTMs^[34]. In particular, the approach of Wang et al.and Jeong et al.processes images with CNN and models sequential labels by LSTMs concurrently and then combines the two representations through projection layers^{[35, 36]}. However, none of the above methods apply a multi-granularity approach to concentrate representations on important parts and flexibly fuse configurations of different regions.

2.3 Multi-granularity methods

Fine-grained methods mostly rely on object detection^[37–39], classifying all regions after identifying areas that may contain objects. Coarse-grained methods extract and encode overall image features through convolutional networks^[40] or vision transformers^[41], which can eliminate the interference of fine details, but their performance is often inferior to fine-grained methods. The multi-granularity methods combine coarse and fine granularity to capture discriminative spatial and temporal information at different semantic levels^[42].

Recently, multi-granularity methods have achieved several excellent results in some applications of computer vision. Li et al.proposed a temporal multi-granularity approach on action recognition^[43]. Their method achieved the state-of-the-art performance on action benchmarks but could not capture detailed appearance clues and local-to-global spatial information. Chen et al.applied multi-scale patches based on face alignment on face recognition^[44]. Wang et al.utilized multi-granularity regions detected by three granularities of CNN to generate a multi-granularity descriptor for fine-grained categorization, but this method cannot process sequential frames^[45]. Huang et al.proposed a multi-granularity extraction sub-network that extracts more efficient multi-granularity features while compressing the network parameters^[46]. They also included a feature rectification sub-network and a feature fusion sub-network to adaptively recalibrate and fuse the multi-granularity features. Finally, an LSTM network is applied to distinguish actions with similar appearances. However, this method cannot prioritize the most significant regions to get the most precise result and speed up the inference. Different from the approaches mentioned above, our method can capture both spatial multi-granularity information and long-term temporal dependencies. Particularly, our MCNN can learn representations on the most significant regions from well-aligned multi-granularity patches, and the proposed method has achieved the state-of-the-art accuracy on the National Tsing Hua University Driver Drowsiness Detection (NTHU-DDD) ^[47] dataset for driver drowsiness detection.

3. METHODS

The proposed method utilizes MCNN to learn facial representations from single-frame images. The representations, extracted from well-aligned multi-granularity facial patches, contain detailed appearance information of the main parts and local-to-global constraints. Furthermore, our approach takes advantage of a deep LSTM network to explore the dynamical characteristics of the facial representations from sequential frames. The detailed structure of our LMDF combining MCNN and LSTMs is shown in Figure 2.

Figure 2. The long-term multi-granularity deep framework for driver drowsiness detection. The first stage is well-aligned multi-granularity patches that consist of local regions, main parts, and the global face. Parallel convolutional layers are well-applied to process these patches separately. In the second stage, a fully connected layer fuses local and global clues and generates a representation. The first two stages together construct the MCNN. The third stage uses Recurrent Neural Networks (RNN) with multiple LSTM blocks to mine the clues in the temporal dimension, together with a fully connected layer.

3.1 Well-aligned multi-granularity patches

It is well known that drowsy information is focused on several main facial parts, such as eyes, nose, and mouth. Alignment provides an excellent way to extract well-aligned features over frames, which effectively represent facial drowsy states. Besides, a global patch provides rough information for estimating a driver’s head and full face states, which assists in the decision of the driver’s drowsiness when the locations of parts are imprecise. Our method takes advantage of local regions and the global face at the same time.

We utilize face alignment technology to locate facial shape points. Given an image $$ I^t $$ with a face in the $$ t $$-th frame, we detect landmark points of facial shape $$ S^t $$ via regressing local binary features proposed by Ren et al.^[48]. From those points, it is convenient to get the locations of the main parts and important local regions. According to center points and specific sizes of all regions, we crop those patches from the original image and resize them into the same size, which are the well-aligned multi-granularity patches as the input of the CNN.

Those patches, including local regions, main parts, and the global face, are produced by three different mappings. As shown in Figure 3, a mapping $$ {\bf{\Phi}}^M_p $$ can select center points of facial features, such as eyes, nose, and mouth, from the facial shape $$ S^t $$ and crop patches of those parts from the input image $$ I^t $$ with given sizes $$ s_p $$. The mapping still needs to convert these patches into a unified size $$ s_u $$. Thus, the single-granularity patches of those main parts $$ {\bf{I}}_p^t $$ are generated. The operations of mappings $$ {\bf{\Phi}}^M_l $$ and $$ {\bf{\Phi}}^M_g $$ are similar to $$ {\bf{\Phi}}^M_p $$, while the differences lie in the locations and sizes of regions. The mapping $$ {\bf{\Phi}}^M_l $$ selects the corners of the eyes, mouth, and the sides of the nose as the interest of regions with size $$ s_l $$ and output local patches $$ {\bf{I}}_l^t $$. A global facial region with size $$ s_g $$ is chosen by the mapping $$ {\bf{\Phi}}^M_g $$, which finally produces a global facial patch $$ {\bf{I}}_g^t $$. Formally, the mappings are represented as

(1)

$$ {\bf{I}}_i^t = {\bf{\Phi}}^M_i(I^t, S^t, s_i, s_u), i \in \{l, g, p\} $$

Figure 3. The procedure of extracting multi-granularity facial patches, which include three granularities: the main parts, local regions, and the global face.

By processing the input image $$ I^t $$ through the three mappings, we can obtain a set of well-aligned patches $$ {\bf{I}}_c^t $$ consisting of the main parts, local regions, and global face, which is represented as

(2)

$$ {\bf{I}}_c^t = \{{\bf{I}}_{l, :}^t, {\bf{I}}_{p, :}^t, {\bf{I}}_{g, :}^t\} $$

where $$ {\bf{I}}^t_{i, :}, i\in\{l, p, g\} $$ represents all elements of a patch set $$ {\bf{I}}^t_i $$.

Compared to the original image, the patch set $$ {\bf{I}}_c^t $$, including both detailed appearance clues of parts and rough information of the full face, has more advantages in describing the facial states. Meanwhile, the relations between local and global regions are implied, which are the basis of mining useful features. Therefore, we take the set of patches $$ {\bf{I}}_c^t $$ as the input of CNN to learn effective representations.

3.2 Learning facial representations

Our approach learns representations by CNN but is not hand-crafted for its good performance in learning spatial features. We apply several convolutional layers to process each one in the set of patches $$ {\bf{I}}_c^t $$ independently. To fuse the information of all patches, a fully connected layer is arranged after all convolutional operations, which generates $$ N $$-dimensional descriptors combining local and global clues.

Every patch needs to be processed by convolutional operations at first. For a patch $$ I_{c, k}^t $$, the $$ k $$-th one of patch set $$ {\bf{I}}_c^t $$ with length $$ L $$, three convolutional layers are utilized to capture the spatial features, as shown in Figure 4. The first one is made with convolution and Rectified Linear Unit (ReLU) activation followed by a max-pooling operation, which projects a normalized 3-channel image to a higher dimensional representation. Only convolution and ReLU activation are selected in the second layer to enlarge the dimension of representation sequentially. The structure of the third convolutional layer is similar to the first layer but with different parameters to decrease the dimension. A representation $$ {\bf{x}}_k^t $$ of the patch $$ {\bf{I}}_{c, k}^t $$ can be generated by a mapping $$ {\bf{\Phi}}^C $$ consisting of those convolutional layers with parameters $$ {\mathit{\boldsymbol{\theta}}}^C_k $$, which is represented as

(3)

$$ {\bf{x}}_k^t = {\bf{\Phi}}^C({\mathit{\boldsymbol{\theta}}}^C_k, {\bf{I}}_{c, k}^t), k=1, 2, \dots, L $$

Figure 4. The three layers to capture the spatial features. The first layer is to project a normalized three-channel image onto a higher dimensional representation; the second layer is to enlarge the dimension of the representation; the third layer is to decrease the dimension with different parameters to the first layer.

where $$ {\mathit{\boldsymbol{\theta}}}^C_k $$ is the $$ k $$-th element of the convolutional parameter set $$ {\mathit{\boldsymbol{\theta}}}^C $$.

A fully connected layer is utilized to combine those representations extracted by the mapping $$ {\bf{\Phi}}^C $$ from the set of patches. But before combining operation, we concatenate those representations into a long vector $$ {\bf{x}}_c^t $$, formed as

(4)

$$ {\bf{x}}_c^t = [{\bf{x}}_{k}^t], k=1, 2, \dots, L $$

With a specific weighted matrix $$ {\bf{W}}^C_f $$ and bias vector $$ {\bf{b}}^C_f $$, the combining $$ N $$-dimensional representation $$ {\bf{x}}^t $$ can be represented by the fully connected layer as

(5)

$$ {\bf{x}}^t = \max\{{\bf{W}}^C_f{{\bf{x}}_c^t}+{\bf{b}}^C_f, {\bf{0}}\} $$

in which $$ {\bf{0}} $$ is a zero vector.

The descriptor $$ {\bf{x}}^t $$ contains not only detailed appearance information implied in every part but also the constrained relations between local regions and the global face. The effectiveness of the descriptor can be improved by appropriate objective functions and proper training methods.

Driver drowsiness detection is a binary classification problem; thus, the state of an input frame is just drowsy or not. We label drowsiness with 1 as the positive sample and normal state with 0 as the negative sample. A label $$ c $$ is expressed with a one-hot vector $$ {\bf{y}}_c $$, where, for example, a vector [0, 1] means the positive label.

To train the parameters of the CNN, we project the representation $$ {\bf{x}}^t $$ into the probabilities of each category $$ c\in\{0, 1\}, $$ by another fully connected layer with weights $$ {\bf{W}}^C_p $$, a bias vector $$ {\bf{b}}^C_p $$, and the probability vector $$ {\bf{p}}(c\|{\bf{x}}^t, {\bf{W}}^C_p, {\bf{b}}^C_p) $$ is normalized via a softmax layer. The cross-entropy, which can indicate the correct rate of classification, is selected as the objective function, and we utilize the Adam optimizer to train the whole CNN. The visual representations can also be generated by the convolutional layers and the first fully connected layer.

3.3 Exploring dynamical characteristics

The representation $$ {\bf{x}}^t $$ is extracted in a frame, while whether a driver is drowsy is judged by a certain period. We apply LSTMs to model the temporal dynamical characteristics of spatial representations on driver drowsiness detection.

An LSTM block consists of an input gate, a forget gate, an output gate, and a memory cell. Because of the three gates, the LSTM block can learn long-term dependencies in sequential data, and its parameters are easier to train. The memory cell can store long-term information in its vector, which can be rewritten or operated on in the next time steps. Besides, the number of hidden units should be chosen according to the dimension of the input representation $$ {\bf{x}}^t $$.

We employ multiple-layer LSTMs to mine the temporal features for driver drowsiness. A mapping $$ {\bf{\Phi}}^R $$ containing three-layer LSTMs with parameters $$ {\mathit{\boldsymbol{\theta}}}^R $$ is utilized to explore temporal clues of the representation $$ {\bf{x}}^t $$ generated by MCNN extractors and presents the hidden states $$ {\bf{h}}^t_3 $$ of the third layer as a representation containing temporal dependencies, which is represented as:

(6)

$$ {\bf{h}}^t_3 = {\bf{\Phi}^R}({\mathit{\boldsymbol{\theta}}}^R, {\bf{h}}^{t-1}, {\bf{x}}^t) $$

where $$ {\bf{h}}^{t-1}=\{{\bf{h}}^{t-1}_1, {\bf{h}}^{t-1}_2, {\bf{h}}^{t-1}_3\} $$ is the parameter set of these LSTM blocks in the last step.

A fully connected layer with weight $$ {\bf{W}}^R $$ and a bias vector $$ {\bf{b}}^R $$ is used to project the output of the mapping $$ {\bf{\Phi}}^R $$ into a two-dimensional vector that is then decoded by softmax operation to the probabilities $$ {\bf{p}}(c\|{\bf{h}}^t_3, {\bf{W}}^R, {\bf{b}}^R) $$ of the two categories. To solve the parameters, we take advantage of the Adam optimizer to train the LSTMs with a cross-entropy objective function.

The label $$ y^t $$ of the current frame can be predicted as the class with the maximum probability.

(7)

$$ y^t=\arg \max\limits_c {\mathit{\boldsymbol{p}}}(c\|{\bf{h}}^t_3, {\bf{W}}^R, {\bf{b}}^R), c\in\{0, 1\} $$

Similarly, the labels $$ {\bf{y}} $$ of the sequential data can be obtained.

4. EXPERIMENTS

The NTHU-DDD dataset is provided on the challenge of the ACCV2016 workshop for driver drowsiness detection, on which we compare our approach with others. To make the sequential labels close to the practical driving environments, we relabel the video set with the instant detecting principle. A new dataset is generated from the relabeled video set and is called FI-DDD, on which we learn parameters and analyze the performance of several subnetworks. While the performance of our entire approach is evaluated on the original NTHU-DDD dataset, we thus train a set of parameters to achieve long-term memory performance. Finally, the accuracy 90.05% is obtained by our LMDF on the evaluation set of the NTHU-DDD dataset, and the proposed method achieves about 37 frames per second (FPS) on GPU Tesla M40.

4.1 Dataset

NTHU-DDD Dataset: The NTHU-DDD dataset includes five scenarios listed as glasses, no glasses, glasses at night, no glasses at night, and sunglasses. The training set involves 18 volunteers consisting of ten men and eight women who act as drivers with four different states in every scenario, while the evaluation set has four volunteers, including two men and two women. Non-sleepy videos contain only normal states, while sleepy videos combine normal and drowsy states together. Besides, blinking with nodding and yawning videos only record drowsy eyes and mouth, respectively. The NTHU-DDD dataset offers four annotation files recording the states of drowsiness, eyes, head, and mouth for every video. Table 1 gives the labels of drowsiness and three main parts.

Table 1

The labeled states of each part on the NTHU-DDD dataset

	0	1	2
Drowsiness	Normal	Drowsy	-
Eyes	Normal	Sleepy	-
Head	Normal	Nodding	Looking aside
Mouth	Normal	Yawning	Talking and laughing

It is worth emphasizing that the labels on the NTHU-DDD dataset are long-term memory, which means that the states of a frame may depend on the frames in the previous several seconds.

FI-DDD Dataset: A problem comes due to the long-term memory in NTHU-DDD, which is that a driver would still receive the warning prompts even if he had revised his drowsy state to normal for a few seconds. At the same time, those labels are unable to locate the drowsy states with high precision in the temporal dimension. To solve these problems, we relabel those videos with an instant principle, which means the latency is limited to 0.5 s, namely 15 frames for 30 FPS videos. Those typical states, such as closing eyes, yawning, and lowering head, are still considered as one of the pieces of evidence to judge whether a frame is drowsy. Those videos are cut into several clips that contain only the drowsy or normal states, alternatively according to our labels. To describe the transitional states between the normal and the drowsy, we reserve ten normal frames at the head and the tail of every clip with drowsiness. We name the relabeled dataset FI-DDD, which includes 14 drivers on the train set and four ones on the test set, as shown in Figure 5. The train set of FI-DDD in the daytime has 157 clips, and the test set has 92 clips, while in night scenarios, the train set has 126 clips, and the test set has 75 clips with about 530 frames on average.

Figure 5. FI-DDD Dataset.

Static image set: To train the parameters of CNN and analyze the effects of several factors, we build a static image set by sampling lots of frames from the FI-DDD dataset. The samples on the image set are labeled with drowsiness or normality, and the labels can almost indicate the true states of the corresponding images, even if a small number of images are matched with wrong labels due to a lack of temporal dependence. The static image set has 7, 498 images in the daytime; the train set includes 5, 239 images, and the test set has 2, 259 images. It has 2, 653 images in night scenarios; the train set includes 1, 750 images, and the test set has 903 images.

4.2 Implementation details

Face Alignment: We apply face alignment technology to locate those facial shape points for all videos. Face detection and tracking are combined to increase detecting rates and provide more accurate positions for faces on videos. Face alignment algorithms are based on those face positions. The face detector is from OpenCV, and the approach of face tracking is proposed by Danelljan et al.^[49]. We implement the method of Ren et al., retrain the model, and preprocess all videos to obtain the 51 landmark points for every frame^[48]. Those frames with no face will be recognized as empty and filled with zero coordinates for landmark points.

Multi-granularity: We obtain Multi-granularity patches considering two factors: different positions and sizes. We design to choose 15 positions from facial shape points, which are divided into three granularities: one global face with size $$ s_g=(160 \times 160) $$, four main parts with size $$ s_p=(64 \times 64) $$, and ten local regions with size $$ s_l=(32 \times 32) $$. The specific locations of all patches are shown in Figure 3. Before being sent to CNN, those patches are resized to size $$ s_u=(64 \times 64) $$, normalized to [-0.5, 0.5], and converted to 3 channels to ensure that our framework can process RGB data.

Dataset Usage: A static image set required for training the CNN parameters is sampled from the videos of FI-DDD with a specific frame interval. The result of CNN is directly related to multi-granularity patches and CNN parameters; we, thus, analyze the effects of those factors on the static image set, while all experiments for analyzing the effects of LSTM parameters are carried out on the FI-DDD dataset. To compare with the previous methods, we evaluate the proposed method on the evaluation set of the NTHU-DDD dataset.

4.3 Experimental analysis

As shown in Figure 6, the spatial and temporal features are extracted to detect driver drowsiness. The detection results of a driver under normal conditions are shown in Figure 6A, and the detection result of driver drowsiness is shown in Figure 6B, with the detected features marked with red color. In the experiments, there are two kinds of videos with different camera positions, and the proposed methods can work effectively for both.

Figure 6. The examples of driver drowsiness detection. (A) The normal status of a driver; (B) The detection result of driver drowsiness, while the detected features are marked with red colors.

To further explain the effects of alignment, multi-granularity, and CNN extractors, several groups of experiments are conducted on the static image set. We also provide experiments on the FI-DDD dataset to verify the effectiveness of LSTMs for detecting drowsiness on videos.

4.3.1 The importance of alignment

It is essential to carry out experiments to explain the significance of alignment and the effects of locating precision.

None-alignment vs. With Alignment: We provide another two none-alignment methods to sample those multi-granularity patches in facial bounding boxes: Uniform Sampling (US) and Specific Sampling (SS). The corresponding sizes of our Aligned sampling (AS) method and the two none-alignment ones are the same. Figure 7 (Left) shows the comparison of AS, US, and SS. AS considering alignment achieves the best accuracy at 87.4% on the test set of the static image set, which is 4.9% higher than the SS method and 6.2% higher than US. In conclusion, the alignment of facial patches, providing aligned representations, is an effective way to improve the accuracy of driver drowsiness detection.

Figure 7. Left: The comparison of different sampling methods, sampling over US, sampling SS, and our proposed sampling with AS; Right: The effect of alignment precision, $$ \sigma $$ is the standard deviation of normal distribution. The results (Acc) are achieved by CNN on the test set of the static image set.

Effects of alignment precision: We evaluate the effects of the alignment precision and investigate the influence quantitatively by adding random noise with a Gaussian distribution $$ N(0, \sigma) $$ over the well-aligned facial points. Figure 7 (Right) shows the results of a test set of the static image set, from which we discover that the accuracy decreases with the increasing standard deviation of noise and even less than 80% if $$ \sigma\geq10 $$ px. While the accuracy is more than 83% with $$ \sigma $$ less than 5 px, we make a conclusion that the proposed MCNN is robust to the corrupted locations if $$ \sigma\leq5 $$ px.

4.3.2 The effects of multi-granularity patches

Multi-granularity patches consist of local regions, main parts, and the global face. It is significant to conduct experiments and explain the importance of those granularities on driver drowsiness detection. We apply a fully connected layer and softmax operation to classify representations presented by MCNN extractors and analyze the effects of multi-granularity patches by the results of the classification.

Learning curve on different granularities: We take four different granularities, listed as local regions, main parts, the global face, and their combination, into account to analyze the effects of multi-granularity facial patches. Figure 8 illustrates the comparisons of those granularities, from which we know that the convergent speed of the method with global face granularity is the slowest compared with the others and that of local regions is the fastest, while the multi-granularity method achieves good performance on both convergent speed and accuracy. Aligned points can achieve higher precision in those local regions with abundant boundary texture, which results in more aligned representations and easier classification. Nevertheless, multi-granularity patches containing more aligned information are more effective in driver drowsiness detection.

Figure 8. The comparison of different granularities, the global face, main parts, local regions, and multi-granularity patches. The Curve of Acc over training times is achieved by CNN with different granularities on the test set of the static image set.

Effects of positions and sizes: We change the positions and sizes of facial patches, respectively. As shown in Figure 9 (Left), the main facial parts, including eyes, nose, and mouth, obtain the best accuracy 83.6% compared with the other single-granularity method. Obviously, the combination of those three granularities achieves the best accuracy at 87.4%. A conclusion comes that the most effective representation is extracted from the three main facial parts, while the fusion of local and global clues is an excellent way to obtain better facial representations.

Figure 9. Left: The comparison of patches with different positions, GF-the global face, MP-main parts (eyes, nose, and mouth), and LR-local regions(the corner of eyes, the sides of the nose, and the boundary of the mouth); Right: The comparison of patches with different sizes at all locations. Mg represents multi-granularity patches.

We set their sizes as the same and change the sizes to examine the difference between single-size and multi-granularity methods while keeping the patch locations constant. Figure 9 (Right) shows different regions with different sizes achieve 2.3% accuracy more than that of those single-size patches. The phenomenon is a result of the variation in sizes among different physiological parts; for example, the size of the global face is larger than that of a single eye. The above analysis presents that the multi-granularity method is an effective way to represent facial features.

4.3.3 The parameters selection of MCNN extractor

The structure parameters of the convolutional layers are listed in Table 2, which are the same in all parallel convolutional paths. A patch with size $$ 64 \times64 $$ processed by those convolutional layers is projected to a tensor with size $$ 16 \times 16 \times 4 $$. A representation of the patch is generated by reshaping the tensor to a 1024-dimensional vector, which is the input of a fully connected layer.

Table 2

The parameters of the three convolutional layers

Layers	Operations	Attributions
1st	Convolution	Size: $$ [5\times5\times3\times32] $$
	Activation	ReLU
	Max pooling	Strides: $$ [2\times2] $$
2nd	Convolution	Size: $$ [5\times5\times32\times64] $$
	Activation	ReLU
	Max pooling	Not Used
3rd	Convolution	Size: $$ [5\times5\times64\times4] $$
	Activation	ReLU
	Max pooling	Strides: $$ [2\times2] $$

A fully connected layer is applied to combine the multi-granularity clues and generate MCNN representations. The number of its hidden units $$ N $$, namely the dimension of representation, has effects on the combination of those patches. Changing the number of hidden units $$ N $$, we explore the relations between the dimension of MCNN representations and classification accuracy with well-aligned multi-granularity facial patches. The comparison of different dimensions is shown in Figure 10, which indicates that the number of dimensions almost has no influence on the convergent speed, but 256-dimensional representations achieve the highest accuracy. Therefore, it is reasonable for us to choose the number of hidden units as 256.

Figure 10. The comparison of different dimensional MCNN representations on accuracies and convergent performance achieved by CNN on the test set of the static image set in the daytime.

4.3.4 The significance of LSTMs

We first apply MCNN to detect driver drowsiness on videos, but it has no capacity to capture the temporal clues. To address this limitation, we consider MCNN+LSTMs. It is necessary to compare the situation with LSTMs ^[50] and without LSTMs to understand the effects of LSTMs. All experiments at this part are carried out on the FI-DDD dataset in daytime scenarios. The parameter settings and adjustments follow the settings in ^[51].

Parameters setting: The representations given by MCNN extractors are 256-dimensional, and the number of hidden units in each LSTM block is equal to 256. The forget gate is enabled, and the max memory step is set to 60 frames. We randomly select a batch with 1, 000 samples to train the LSTM parameters with a learning rate $$ 3e^{-4} $$. The fully connected layer projects the states of the last LSTM block to a 2-dimensional vector, which is decoded to the probability of drowsiness by a softmax operation.

MCNN-Only vs. MCNN + LSTMs: The experiments are carried out on four different granularities to research the effects of multi-granularity and LSTMs. Figure 11 shows the accuracy of MCNN only and MCNN + LSTMs for detecting videos on test sets under different granularities. The MCNN-Only method obtains 72.7% accuracy, while the approach of MCNN + LSTMs surpasses it by 15.6%. The reason is that the LSTMs have the ability to mine the clues in the temporal dimension, which is significant for recognizing lots of ambiguous states, such as closing eyes and blinking. Comparing the accuracies of different granularities, we discover that the well-aligned multi-granularity facial patches still achieve the best performance. The accuracy of the main parts ranks second, which means the granularity of the main parts certainly plays the most important role in improving the effectiveness compared to the other two granularities.

Figure 11. The comparison of accuracies achieved via MCNN-Only and MCNN + LSTMs on the test set of the FI-DDD dataset. Different granularities are still presented.

Comparisons with the previous methods

We evaluate the whole method on the evaluation set and compare it with the previous methods ^{[11, 13, 25, 52, 53]} achieved on the same dataset. Due to the long-term memory characteristics of the NTHU-DDD dataset, the max memory length is set to 120 frames, and other parameters remain the same as in the above experiments. Especially for night scenarios, we retrain a model with the night data of NTHU-DDD to detect driver drowsiness on near-infrared videos.

Accuracy:Table 3 presents the comparison of our method, the previous work ^{[11, 13, 25, 52, 53]}, and the proposed method achieves 90.05% accuracy, which is significantly improved compared to other existing methods, as the state-of-the-art method of driver drowsiness detection.

Table 3

The comparison of different methods on the evaluation set of NTHU-DDD dataset with the detailed information of environments

Methods	Platform	Spatial features	Sequential features	Speed	Accuracy
Yu et al.^[11]	GPU	3D-DCNN	Feature fusion	24$$ {\sim} $$32 fps	72.60%
Park et al.^[25]	-	DDD Network	SVM	-	73.06%
Yu et al.^[52]	GPU	3D-DCNN	Feature fusion	38.1 fps	76.2%
Wang et al.^[53]	-	CNN	LSTMs	40.64 fps	82.8%
MSTN ^[13]	-	CNN	LSTMs	60 fps	85.52%
Ours	GPU-M40	MCNN	LSTMs	37 fps	90.05%

Speed:Table 3 shows the performance comparison of our method with other existing methods. The proposed method achieves a speed of 37 FPS on the GPU platform, satisfies the real-time performance requirements, and exceeds the majority of existing methods, second only to the methods proposed by Yu et al.^[52]. However, our method has significantly improved accuracy compared to their methods. At the same time, we measure the time consumption of all modules of our proposed method. From Table 4, CNN is the most time-consuming, and the approach achieves about 3 FPS on a CPU platform.

Table 4

Time consumption of each module of the proposed method. Others include reading, writing, and some converting operations

	Mg	CNN	LSTMs	Others	Total
CPU(E5) + GPU(M40)	11.1	10.7	0.6	4.5	26.9
CPU(I7)	5.6	302.3	0.6	3.2	311.7

Although our method has achieved good performance on existing datasets, there are still complex conditions and uncertain factors in real-world scenarios, such as significant changes in lighting and occlusion of the driver’s face. In future research, we will continue to explore the model explainability and uncertainty quantification^[54]. We will also consider applying the method proposed in this paper to real-world scenarios and continue to explore how to improve the generalization of the method under complex conditions.

5. CONCLUSIONS

This paper proposes a novel approach to integrate the memory mechanism into a multi-granularity deep framework to detect driver drowsiness, and the temporal dependencies over sequential frames are well integrated with the spatial deep learning framework on the frontal faces. First, the spatial MCNN is designed to utilize a group of parallel CNN extractors on well-aligned facial patches of different granularities and extract facial representations effectively for large variations of head pose. Second, the memory mechanism is set up with a deep LSTM network of facial representations to explore long-term relationships with variable lengths over sequential frames, which is capable of distinguishing the states with temporal dependencies. The proposed method is evaluated on the NTHU-DDD dataset and achieves 90.05% accuracy and about 37 FPS performance as the state-of-the-art method on driver drowsiness detection. Moreover, a new dataset named FI-DDD is built with higher precision of drowsy locations in the temporal dimension. This dataset performs well in training model parameters and analyzing the effects of several factors and will be provided publicly to speed up the study.

DECLARATIONS

Authors’ contributions

Made substantial contributions to the conception and design of the study: Liu T, Chen D, Yuan Z

Performed data analysis and interpretation: Zhang H, Lyu J

All authors were involved in the writing of the paper.

Availability of data and materials

Not applicable.

Financial support and sponsorship

This work was supported by the Beijing Natural Science Foundation (L201022).

Conflicts of interest

Not applicable.

Ethical approval and consent to participate

Written informed consent was obtained from the participants.

Consent for publication

Photographs of the faces used in this study were permitted and processed to protect privacy.

Copyright

REFERENCES

1. Global status report on road safety 2013: supporting a decade of action: summary. WHO; 2013. Available from: https://www.drugsandalcohol.ie/19499/. [Last accessed on 20 Nov 2023].

2. Wang J, Gong Y. Recognition of multiple drivers’ emotional state. In: IEEE 19th International Conference on Pattern Recognition; 2008 Dec 08-11; Tampa, USA. IEEE; 2008. p. 1-4.

3. Yaacob H, Hossain F, Shari S, Khare SK, Ooi CP, Acharya UR. Application of artificial intelligence techniques for brain-computer interface in mental fatigue detection: a systematic review (2011-2022). IEEE Access 2023;11:74736-58.

4. Sharma S, Khare SK, Bajaj V, Ansari IA. Improving the separability of drowsiness and alert EEG signals using analytic form of wavelet transform. Appl Acoust 2021;181:108164.

5. Khare SK, Bajaj V, Sinha GR. Automatic drowsiness detection based on variational non-linear chirp mode decomposition using electroencephalogram signals. In: Modelling and Analysis of Active Biopotential Signals in Healthcare, Volume 1. IOP Publishing; 2020. Available from: https://iopscience.iop.org/book/edit/978-0-7503-3279-8/chapter/bk978-0-7503-3279-8ch5. [Last accessed on 20 Nov 2023].

6. Colic A, Marques O, Furht B. Driver drowsiness detection - Systems and solutions. In: Springer Briefs in Computer Science. Springer; 2014.

7. Rezaei M, Klette R. Look at the driver, look at the road: No Distraction! No Accident! In: IEEE Conference on Computer Vision and Pattern Recognition; 2014; Jun 23-28. IEEE; 2014. pp. 129-36.

8. Nakamura T, Maejima A, Morishima S. Detection of driver’s drowsy facial expression. In: IEEE 2nd Asian Conference on Pattern Recognition; 2013 Nov 05-08. IEEE; 2013. pp. 749-53.

9. Ullah MR, Aslam M, Ullah MI, Maria MEA. Driver’s drowsiness detection through computer vision: a review. In: Mexican International Conference on Artificial Intelligence. Springer, Cham; 2017. pp. 272-81.

10. Akrout B, Mahdi W. Spatio-temporal features for the automatic control of driver drowsiness state and lack of concentration. Mach Vision Appl 2015;26:1-13.

11. Yu J, Park S, Lee S, Jeon M. Representation learning, scene understanding, and feature fusion for drowsiness detection. In: Asian Conference on Computer Vision. Springer, Cham; 2016. pp. 165-77.

12. Huynh XP, Park SM, Kim YG. Detection of driver drowsiness using 3D deep neural network and semi-supervised gradient boosting machine. In: Asian Conference on Computer Vision. Springer, Cham; 2016. pp. 134-45.

13. Shih TH, Hsu CT. MSTN: Multistage spatial-temporal network for driver drowsiness detection. In: Asian Conference on Computer Vision. Springer, Cham; 2016. pp. 146-53.

14. Shirakata T, Tanida K, Nishiyama J, Hirata Y. Detect the imperceptible drowsiness. SAE Int J Passeng Cars Electron Electr Syst 2010;3:98-108.

15. Gao Z, Le D, Hu H, Yu Z, Wu X. Driver drowsiness detection based on time series analysis of steering wheel angular velocity. In: IEEE 9th International Conference on Measuring Technology and Mechatronics Automation; 2017 Jan 14-15; Changsha, China. IEEE; 2017. pp. 99-101.

16. Rumagit AM, Akbar IA, Igasaki T. Gazing time analysis for drowsiness assessment using eye gaze tracker. Telecommun Comput Electron Contr 2017;15:919-25.

17. Amirudin NAB, Saad N, Ali SSA, Adil SH. Detection and analysis of driver drowsiness. In: IEEE 3rd International Conference on Emerging Trends in Engineering, Sciences and Technology; 2018 Dec 21-22; Karachi, Pakistan. IEEE; 2018. p. 1-9.

18. Chmielińska J, Jakubowski J. Detection of driver fatigue symptoms using transfer learning. B Pol Acad Sci Tech 2018;66:869-74.

19. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM 2017;60:84-90.

20. Sun Y, Wang X, Tang X. Deep learning face representation from predicting 10,000 classes. In: IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23-28; Columbus, USA. IEEE; 2014. pp. 1891-8.

21. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436-44.

22. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998;86:2278-324.

23. Levi G, Hassner T. Age and gender classification using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops; 2015 Jun 07-12; Boston, USA. IEEE; 2015. pp. 34-42.

24. Trigueros DS, Meng L, Hartnett M. Face recognition: from traditional to deep learning methods. arXiv 2018; In press.

25. Park S, Pan F, Kang S, Yoo CD. Driver drowsiness detection system based on feature representation learning using various deep networks. In: Asian Conference on Computer Vision. Springer, Cham; 2016. pp. 154-64.

26. Li K, Gong Y, Ren Z. A fatigue driving detection algorithm based on facial multi-feature fusion. IEEE Access 2020;8:101244-59.

27. Arakawa T. Trends and future prospects of the drowsiness detection and estimation technology. Sensors 2021;21:7921.

28. Dua M, Shakshi, Singla R, Raj S, Jangra A. Deep CNN models-based ensemble approach to driver drowsiness detection. Neural Comput Appl 2021;33:3155-68.

29. Celecia A, Figueiredo K, Vellasco M, González R. A portable fuzzy driver drowsiness estimation system. Sensors 2020;20:4093.

30. Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv 2014; In press.

31. Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing; 2013 May 26-31; Vancouver, Canada. IEEE; 2013. pp. 6645-9.

32. Chernodub A, Nowicki D. Sampling-based gradient regularization for capturing long-term dependencies in recurrent neural networks. In: International Conference on Neural Information Processing. Springer, Cham; 2016. pp. 90-7.

33. Liang M, Hu X. Recurrent convolutional neural network for object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 07-12; Boston, USA. IEEE; 2015. pp. 3367-75.

34. Donahue J, Hendricks LA, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 07-12; Boston, USA. IEEE; 2015. pp. 2625-34.

35. Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W. CNN-RNN: A unified framework for multi-label image classification. In: IEEE conference on Computer Vision and Pattern Recognition; 2016 Jun 27-30; Las Vegas, USA. IEEE; 2016. pp. 2285-94.

36. Jeong JH, Yu BW, Lee DH, Lee SW. Classification of drowsiness levels based on a deep spatio-temporal convolutional bidirectional LSTM network using electroencephalography signals. Brain Sci 2019;9:348.

37. Tan H, Bansal M. LXMERT: learning cross-modality encoder representations from transformers. arXiv 2019; In press.

38. Li W, Gao C, Niu G, et al. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv 2020; In press.

39. Zeng Y, Zhang X, Li H. Multi-grained vision language pre-training: aligning texts with visual concepts. arXiv 2021; In press.

40. Huang Z, Zeng Z, Huang Y, et al. Seeing out of the box: end-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. pp. 12976-85. Available from: https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_Seeing_Out_of_the_Box_End-to-End_Pre-Training_for_Vision-Language_Representation_CVPR_2021_paper.pdf. [Last accessed on 20 Nov 2023].

41. Kim W, Son B, Kim I. ViLT: vision-and-language transformer without convolution or region supervision. In: Proceedings of the 38th International Conference on Machine Learning. PMLR; 2021. pp. 5583-94. Available from: https://proceedings.mlr.press/v139/kim21k.html. [Last accessed on 20 Nov 2023].

42. Zhang Z, Lan C, Zeng W, Chen Z. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. pp. 10407-16. Available from: https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Multi-Granularity_Reference-Aided_Attentive_Feature_Aggregation_for_Video-Based_Person_Re-Identification_CVPR_2020_paper.pdf. [Last accessed on 20 Nov 2023].

43. Li Q, Qiu Z, Yao T, Mei T, Rui Y, Luo J. Action recognition by learning deep multi-granular spatio-temporal video representation. In: Proceedings of the ACM on International Conference on Multimedia Retrieval; 2016. pp. 159-66.

44. Chen D, Cao X, Wen F, Sun J. Blessing of dimensionality: high-dimensional feature and its efficient compression for face verification. In: IEEE Conference on Computer Vision and Pattern Recognition; 2013 Jun 23-28; Portland, USA. IEEE; 2013. pp. 3025-32.

45. Wang D, Shen Z, Shao J, Zhang W, Xue X, Zhang Z. Multiple granularity descriptors for fine-grained categorization. In: IEEE International Conference on Computer Vision; 2015 Dec 07-13; Santiago, Chile. IEEE; 2015. pp. 2399-406.

46. Huang R, Wang Y, Li Z, Lei Z, Xu Y. RF-DCM: multi-granularity deep convolutional model based on feature recalibration and fusion for driver fatigue detection. IEEE Trans Intell Transp Syst 2020;23:630-40.

47. Weng CH, Lai YH, Lai SH. Driver drowsiness detection via a hierarchical temporal deep belief network. In: Asian Conference on Computer Vision. Springer; 2016. pp. 117-33.

48. Ren S, Cao X, Wei Y, Sun J. Face alignment at 3000 FPS via regressing local binary features. In: IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23-28; Columbus, USA. IEEE; 2014. pp. 1685-92.

49. Danelljan M, Häger G, Khan F, Felsberg M. Accurate scale estimation for robust visual tracking. In: Proceedings of the British Machine Vision Conference. Bmva Press; 2014. pp. 1-12. Available from: https://www.diva-portal.org/smash/get/diva2:785778/FULLTEXT01.pdf. [Last accessed on 20 Nov 2023].

50. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 2017;28:2222-32.

51. Khare SK, Bajaj V, Acharya UR. SchizoNET: a robust and accurate Margenau-Hill time-frequency distribution based deep neural network model for schizophrenia detection using EEG signals. Physiol Meas 2023;44:035005.

52. Yu J, Park S, Lee S, Jeon M. Driver drowsiness detection using condition-adaptive representation learning framework. IEEE T Intell Transp 2018;20:4206-18.

53. Wang C, Yan T, Jia H. Spatial-temporal feature representation learning for facial fatigue detection. Int J Pattern Recogn Artif Intell 2018;32:1856018.

54. Khare SK, Blanes-Vidal V, Nadimi ES, Acharya UR. Emotion recognition and artificial intelligence: a systematic review (2014-2023) and research recommendations. Inform Fusion 2024;102:102019.

Cite This Article

Export citation file: BibTeX | RIS

OAE Style

Zhang H, Liu T, Lyu J, Chen D, Yuan Z. Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection. Intell Robot 2023;3(4):614-31. http://dx.doi.org/10.20517/ir.2023.34

AMA Style

Zhang H, Liu T, Lyu J, Chen D, Yuan Z. Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection. Intelligence & Robotics. 2023; 3(4): 614-31. http://dx.doi.org/10.20517/ir.2023.34

Chicago/Turabian Style

Zhang, Handan, Tie Liu, Jie Lyu, Dapeng Chen, Zejian Yuan. 2023. "Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection" Intelligence & Robotics. 3, no.4: 614-31. http://dx.doi.org/10.20517/ir.2023.34

ACS Style

Zhang, H.; Liu T.; Lyu J.; Chen D.; Yuan Z. Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection. Intell. Robot. 2023, 3, 614-31. http://dx.doi.org/10.20517/ir.2023.34

About This Article

Copyright

© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Data & Comments

Data

Views

337

Downloads

92

Citations

0

Comments

0

5

Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.

⁰

Download PDF

Download XML 10 downloads

Cite This Article 8 clicks

Export Citation 0 clicks

Like This Article 5 likes

Share This Article

https://www.oaepublish.com/articles/ir.2023.34

Scan the QR code for reading!

See Updates

Contents

Figures

Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection

Abstract

Keywords

1. INTRODUCTION

2. RELATED WORK

2.1 Traditional driver drowsiness detection methods

2.2 Driver drowsiness detection methods: CNN and RNN-based approaches

2.3 Multi-granularity methods

3. METHODS

3.1 Well-aligned multi-granularity patches

3.2 Learning facial representations

3.3 Exploring dynamical characteristics

4. EXPERIMENTS

4.1 Dataset

4.2 Implementation details

4.3 Experimental analysis

4.3.1 The importance of alignment

4.3.2 The effects of multi-granularity patches

4.3.3 The parameters selection of MCNN extractor

4.3.4 The significance of LSTMs

Comparisons with the previous methods

5. CONCLUSIONS

DECLARATIONS

Authors’ contributions

Availability of data and materials

Financial support and sponsorship

Conflicts of interest

Ethical approval and consent to participate

Consent for publication

Copyright

REFERENCES

Cite This Article

About This Article

Copyright

Data & Comments

Data

Comments

Share This Article

See Updates

Committee on Publication Ethics

Portico

Committee on Publication Ethics

Portico