A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction

A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction Indranil Bhattacharjee¹, Vartika Narayani Srinet², Anirudha Bhattacharjee², Braj Bhushan², Bishakh Bhattacharya^2,* ¹Department of Information Technology, School of Engineering, Cochin University of Science and Technology, Kochi, Kerala, India ²Indian Institute of Technology Kanpur, Uttar Pradesh, India Emails: indranil@ug.cusat.ac.in, vartikana23@iitk.ac.in, anirub@iitk.ac.in, brajb@iitk.ac.in, *bishakh@iitk.ac.in

Understanding emotional responses in children with Autism Spectrum Disorder (ASD) during social interaction remains a critical challenge in both developmental psychology and human-robot interaction. This study presents a novel deep learning pipeline for emotion recognition in autistic children in response to a name-calling event by a humanoid robot (NAO), under controlled experimental settings. The dataset comprises of around 50,000 facial frames extracted from video recordings of 15 children with ASD. A hybrid model combining a fine-tuned ResNet-50-based Convolutional Neural Network (CNN) and a three-layer Graph Convolutional Network (GCN) trained on both visual and geometric features extracted from MediaPipe FaceMesh landmarks. Emotions were probabilistically labeled using a weighted ensemble of two models: DeepFace’s and FER, each contributing to soft-label generation across seven emotion classes. Final classification leveraged a fused embedding optimized via Kullback-Leibler divergence. The proposed method demonstrates robust performance in modeling subtle affective responses and offers significant promise for affective profiling of ASD children in clinical and therapeutic human-robot interaction contexts, as the pipeline effectively captures micro emotional cues in neurodivergent children, addressing a major gap in autism-specific HRI research. This work represents the first such large-scale, real-world dataset and pipeline from India on autism-focused emotion analysis using social robotics, contributing an essential foundation for future personalized assistive technologies.

Autism, NAO, Child-Robot Interaction, Emotion analysis, ResNet-50, GCN, Deepface, Mini-Xception, FER. \newunicodechar

⁻^-

I I §I <tag close=" ">I</tag><text font="smallcaps">Introduction</text>

NAO humanoid robot developed by SoftBank Robotics, standing 58 cm tall with 25 degrees of freedom, is widely utilized in educational and therapeutic environments due to its semi-anthropomorphic appearance and programmable capabilities. Globally, NAO has been applied in diverse contexts ranging from children’s education to autism interventions, yet its deployment in India remains a sparse scenario that presents a significant opportunity for strengthening socio-cognitive support through technology-enhanced methods.

Amid growing concerns that excessive screen time and digital media consumption may impact children’s attentional capacities, there is increasing interest in robot-mediated interventions as proactive tools to foster engagement and learning. In children with Autism Spectrum Disorder (ASD), one of the hallmark early markers is a delayed or absent response to name-calling, a clinical indicator frequently used in diagnostic assessments. Evidence indicates that ASD children demonstrate heightened responsiveness and engagement when interacting with robotic agents [] positioning Socially Assistive Robots (SARs) like NAO as promising platforms for eliciting measurable socio-behavioral responses.

While response to name (RTN) paradigms have been previously explored within ASD diagnostic protocols, integration with robust, deep learning based emotion detection especially combining facial appearance and geometric landmark data have not been fully realized. Conventional approaches tend to rely on either texture-based convolutional models, which may miss subtle expressions, or landmark sequences, which fail to account for global affective context, discussed in [].

To address this limitation, we propose a novel hybrid CNN–GCN architecture, named Fusion-N, capable of extracting and fusing multi-scale emotional cues from both RGB imagery and facial landmarks simultaneously, shown in Fig 1. Our pipeline leverages ensemble-derived soft labels from DeepFace’s and FER models, enabling probabilistic training that effectively models emotion ambiguity and anticipates ASD-specific expression patterns. We evaluated this approach on a dataset comprising almost 50,000 high-resolution frames obtained from 15 children with ASD during NAO-mediated RTN tasks and demonstrated its efficacy in accurately classifying nuanced emotion states, including fear and disgust, which are typically underrepresented in ASD datasets. This methodology contributes to the fields of affective computing, human-robot interaction, and computational neuro-psychology by introducing a multimodal framework for assessing emotion recognition in vulnerable developmental cohorts.

II II §II <tag close=" ">II</tag><text font="smallcaps">Related Work</text>

Facial expression recognition (FER) has long been a cornerstone in affective computing and human-computer interaction. Among the most widely adopted face detection pipelines is the Multi-task Cascaded Convolutional Neural Network (MTCNN) framework by Zhang et al. [], which remains a benchmark for real-time face detection and alignment due to its efficiency in bounding-box regression and landmark localization.

Fig. 1 1 Fig. 1 1A focused top-level view of our multimodal pipeline structure. Fusion-N, the novel hybrid framework made using ResNet-50 and GCN. Fig. 1A focused top-level view of our multimodal pipeline structure. Fusion-N, the novel hybrid framework made using ResNet-50 and GCN.

For facial landmark extraction, Lugaresi et al.[] introduced MediaPipe FaceMesh, which provides dense 3D landmark detection (468 key points), forming a strong basis for extracting geometric and relational facial features and facilitating more nuanced understanding of facial structure and microexpressions. Graph Convolutional Networks (GCNs) have become another cornerstone in modeling structured data by combining node features with graph topology. A seminal work by Kipf and Welling [] introduced the modern GCN architecture, which efficiently performs semi-supervised node classification via layer-wise propagation based on graph Laplacians. To label emotional states, researchers have increasingly moved beyond single-label supervision to probabilistic soft labels that account for ambiguity and class overlap. The DeepFace library [], with its robust backbones such as VGG‑Face [] , FaceNet [] backbones, has been widely adopted for face recognition, especially in facial datasets characterized by real-world variability. Similarly, Mini-Xception architectures trained on FER2013 [] have demonstrated competitive performance with lower computational overhead, making them ideal for ensemble frameworks. These models are particularly helpful in analyzing common human expressions. A recent system, SENSES‑ASD[], utilized Mini‑Xception (trained on FER‑2013) for facial emotion recognition in autistic adults and achieved a validation accuracy of approximately 60%[]. The integration of DeepFace (Mini-Xception) and FER-based predictions through weighted averaging forms a non-obvious soft-label calibration method which is better suited for neuro-divergent datasets where emotional ambiguity is prevalent.

The increasing use of GCNs has also led to hybrid models that combine image based CNN features with graph based structural information. Bin Li and Lima [] implemented a ResNet-50 based architecture for facial expression recognition, showcasing its robustness across benchmark datasets. Our model Fusion-N integrates a ResNet-50 variant for global semantic extraction and a topology-aware GCN over facial landmarks to generate spatial embeddings. This hybrid architecture demonstrates higher accuracy and better generalization, especially when analyzing subtle or masked emotions such as fear or disgust emotions that are often underrepresented and harder to detect.

While many studies have focused on emotion recognition in typical populations, relatively fewer have addressed the unique challenges posed by children with ASD. [] underscored the importance of developing systems that can support or augment emotion recognition capabilities. The role of assistive technologies, particularly humanoid robots such as NAO, has grown significantly in autism research. Robins et al. [] were among the first to demonstrate the potential of robots in engaging children with ASD through structured interactions. Rudovic et al. [] expanded this domain by introducing personalized machine learning algorithms that enabled robots to adapt to individual emotional patterns in children with ASD.

Studies show that NAO robot interventions have the potential to enhance emotional expressiveness and social engagement in children with ASD significantly. Robot therapy promotes communication in minimally verbal children, increases social engagement with imitation activities, and stimulates better classroom participation compared to normal settings []. This is particularly significant in name-calling tests, in which a child’s reaction to their own name offers an insight into social awareness, attention, and affective states, all of which are significant diagnostic indicators in early diagnosis of autism. Costescu et al. [] similarly proved that children with ASD were more socially responsive when the NAO robot was engaged in imitative play and joint-attention exercises. These results strongly advocate for combining NAO-based interaction paradigms with computationally sophisticated emotion-analysis pipelines through the combination of soft-label supervision, dense facial-geometry modeling, and robot-mediated data collection. Such an integration provides a solid framework to study affective behavior in autistic children in ethically approved, ecologically valid experimental environments.

Fig. 2 2 Fig. 2 2This figure illustrates the setup of an autistic child engaging in free play in an unbiased environment with NAO and a facilitator seated nearby. Fig. 2This figure illustrates the setup of an autistic child engaging in free play in an unbiased environment with NAO and a facilitator seated nearby. TABLE I I TABLE I IData Specifications

TABLE IData Specifications
Parameter	Value
Subjects	15 children with ASD
Videos	15 (1 per child)
Duration	3–5 minutes per child
Name Called	12 times (randomly spread)
FPS for Processing	15
Frames Extracted	48,891
Label Distribution	Balanced across 7 emotions

III III §III <tag close=" ">III</tag><text font="smallcaps">Methodology</text>

The proposed emotion recognition pipeline for autistic children is a modular, multi-staged architecture designed to capture and interpret subtle affective cues from video data. The flow of controls in our pipeline is displayed in Fig. 3. The stages of this pipeline flow as follows:

Fig. 3 3 Fig. 3 3Flowchart of the facial emotion recognition pipeline. The process begins with dataset creation through video collection, followed by face detection. Detected faces are validated, aligned, and then passed to the facial landmark extraction module. These features, along with the cropped face images, are fed into our novel hybrid model (Fusion-N) to generate emotion probability predictions. Fig. 3Flowchart of the facial emotion recognition pipeline. The process begins with dataset creation through video collection, followed by face detection. Detected faces are validated, aligned, and then passed to the facial landmark extraction module. These features, along with the cropped face images, are fed into our novel hybrid model (Fusion-N) to generate emotion probability predictions. III-A III-A §III-A <tag close=" ">III-A</tag><text font="italic">Experimental data acquisition</text>

After approval of the Institutional Ethics Committee of the Indian Institute of Technology, Kanpur and the center head and consent from the parents, the psychological analysis report of the children was obtained to finalize our selection criterias such as studying children in mild to moderate autism spectrum and 6 to 10 years of age.

Sessions were conducted in a carefully curated environment to ensure the child’s comfort, with a trusted psychologist present and strict confidentiality maintained throughout.

The child participated in a semi-structured interaction session for a duration of 3–5 minutes in a known and relaxed environment, with provision of toys and play materials to minimize stress and improve ecological validity. In this free-play setting, the NAO robot performed a pre-programmed name-calling procedure, uttering each child’s name 12 times in random temporal order. The experimental configuration is shown in Fig. 2, and dataset information is given in Table I.

III-B III-B §III-B <tag close=" ">III-B</tag><text font="italic">Face Extraction</text>

Face detection is performed using the Multi-task Cascaded Convolutional Neural Network (MTCNN), which jointly handles face localization and bounding-box regression. To ensure clean inputs, frames are filtered for blur and validity, followed by secondary verification using Dlib’s CNN/HOG detector (results were the same in both cases) via face_recognition.face_locations, discussed by [] to reduce false positives. To address MTCNN’s over-cropping, temporary dynamic padding is applied during validation, though only unpadded images are retained for downstream processing. Verified bounding boxes are used to extract 468 3D facial landmarks via MediaPipe Face Mesh, capturing dense anatomical regions (e.g., brows, lips, jawline). Landmarks are normalized using min-max scaling relative to the nose tip for scale, rotation, and translation invariance. The resulting data is exported in CSV format for graph-based modeling.

III-C III-C §III-C <tag close=" ">III-C</tag><text font="italic">Probabilistic Soft Label Generation</text>

To accommodate the ambiguity of expressions common in ASD, we employed a soft-labeling mechanism using ensemble fusion. Emotion probabilities are computed by aggregating predictions from two independently trained models:

• 1st item

DeepFace: A Mini-Xception model trained on FER-2013 [], providing semantic emotion embeddings.

• 2nd item

FER: A custom CNN-based model by Shenk [], also trained on FER-2013, outputting 7-class softmax distributions.

The final distribution $∈ y final R 7$ is obtained as a weighted average:

= y final + ⋅ 13 y DeepFace ⋅ 23 y FER

FER is trained and tested more on low-quality images. During our validation tests, FER consistently produced lower error rates compared to DeepFace in the low-resolution scenarios []. That’s the reason why assigning a greater weight to FER in the ensemble enhances overall prediction quality , the ensemble is relying more on the model which is performing better under the real conditions of our data provided in the Table IV. Both models are trained on tightly cropped, aligned face images from FER-2013. Although they include their own detectors, we supplied preprocessed face crops to minimize issues such as failed detection, incorrect scale, or orientation, thereby improving prediction robustness. This ensemble strategy mitigates model-specific bias and enhances reliability across diverse visual inputs, as demonstrated in Table II. The full soft-labeling workflow is illustrated in Fig. 5.

TABLE II II TABLE II IIComparison of Emotion Detection Models and Fusion Strategy Used in the Proposed Pipeline

TABLE IIComparison of Emotion Detection Models and Fusion Strategy Used in the Proposed Pipeline
Model Source	Architecture	Output Type	Fusion Weight	Rationale
DeepFace	Mini-Xception	7-class probability distribution	$/ 13$	Lightweight CNN pretrained on FER-2013, efficient for real-time inference
FER	Custom CNN (fer library)	7-class probability distribution	$/ 23$	Accurate and fast, empirically better on subtle emotions
Ensemble Logic	Weighted average	Final 7-class soft probabilities	–	Reduces neutral bias using penalty regularization and sharpens predictions via temperature scaling

III-D III-D §III-D <tag close=" ">III-D</tag><text font="italic">Hybrid CNN-GCN Classification (Fusion-N)</text>

We introduced Fusion-N, a dual-branch architecture that jointly processes pixel-level and geometric information. A schematic diagram of Fusion-N is shown in Fig. 4.

III-D1 III-D1 §III-D1 <tag close=" ">III-D1</tag>CNN Branch

Fig. 4 4 Fig. 4 4Simplified architecture of the proposed Fusion-N model. The network consists of two parallel branches: a CNN-based global feature extractor (left) that uses ResNet-50 with channel-wise attention to produce the global descriptor

∈ F CNN R 2048

, and a GCN-based geometric branch (right) that encodes 3D facial landmarks into

F GCN

via a stack of GCN layers and mean pooling. The two feature streams are fused via simple concatenation after intra-branch attention refinement, resulting in the final representation

∈ F fused R 2176

. Fig. 4Simplified architecture of the proposed Fusion-N model. The network consists of two parallel branches: a CNN-based global feature extractor (left) that uses ResNet-50 with channel-wise attention to produce the global descriptor

∈ F CNN R 2048

, and a GCN-based geometric branch (right) that encodes 3D facial landmarks into

F GCN

via a stack of GCN layers and mean pooling. The two feature streams are fused via simple concatenation after intra-branch attention refinement, resulting in the final representation

∈ F fused R 2176

Fig. 5 5 Fig. 5 5Segmented architecture of the pipeline, illustrating the phases of face detection using MTCNN, face validation via face_recognition, landmark extraction using MediaPipe FaceMesh, and the creation of soft labels for training the Fusion-N model. Fig. 5Segmented architecture of the pipeline, illustrating the phases of face detection using MTCNN, face validation via face_recognition, landmark extraction using MediaPipe FaceMesh, and the creation of soft labels for training the Fusion-N model.

Aligned RGB face images of size $× 2242243$ are passed through a ResNet-50 backbone, with the first 44 parameters tensors frozen and the rest fine-tuned. The output feature vector $∈ f img R 2048$ captures global semantic information and is refined by an attention module.

III-D2 III-D2 §III-D2 <tag close=" ">III-D2</tag>GCN Branch

Facial graphs are constructed from 468 landmarks with edges defined by facial geometry (jawline, eyebrows, eyes, mouth). A 3-layer Graph Convolutional Network (GCN) extracts relational features, and the pooled 128-dimensional embedding $∈ f geom R 128$ is further refined with attention.

III-D3 III-D3 §III-D3 <tag close=" ">III-D3</tag>Fusion and Classification

The concatenated feature vector $f joint = [∥ f img f geom] ∈ R 2176$ is passed through a series of dense layers with dropout and LayerNorm. Emotion class probabilities are predicted using a softmax layer.

III-D4 III-D4 §III-D4 <tag close=" ">III-D4</tag>Loss Function

Model training minimizes KL divergence between predicted scores $s θ$ and calibrated targets $~ y$ :

(1) 1

= L KL ∑ i ⁢ ~ y i log (~ y i s θ, i)

where $∈ i {1, …, C}$ indexes emotion classes.

III-E III-E §III-E <tag close=" ">III-E</tag><text font="italic">Framework Used</text>

Face detection and pre-processing were performed using MTCNN, followed by validation through the face_recognition library from DLib[]. Quality control was implemented using Laplacian variance thresholding to remove blurry frames. Geometric normalization was applied to ensure alignment consistency.

For pose-invariant facial landmark extraction, we utilized the Face Mesh solution provided by MediaPipe [] . The 3D coordinates were normalized prior to further processing.

To generate soft emotion labels, the DeepFace [] and FER [] libraries were employed. These outputs were used in conjunction with the PyTorch Dataset API to structure a triplet input pipeline consisting of face images, landmarks, and corresponding soft labels.

IV IV §IV <tag close=" ">IV</tag><text font="smallcaps">Optimization and Training Framework</text>

Training is done with the AdamW optimizer [], using discriminative learning rates of $× 3 10 - 6$ and $× 1 10 - 5$ for the pretrained CNN backbone and classifier head, respectively, with a global $L 2$ weight decay of $× 5 10 - 4$ to prevent overfitting []. The main criterion is the label-smoothed KL divergence (smoothing factor $= 0.1$ ), ensuring robust learning with softened target distributions. Training stability is maintained through gradient clipping (L2 norm limit $= 1.0$ ), while effective exploration of the loss landscape is facilitated by a cosine annealing learning rate schedule with warm restarts ( $= T 0 10$ , $= T m 2$ , $= η min × 1 10 - 5$ ). The evaluation metrics include per-class precision, recall, F1 score, and overall accuracy, following recommended practices for balanced and robust evaluation, especially in the presence of minority classes [].

V V §V <tag close=" ">V</tag><text font="smallcaps">Techniques Involved</text>

This section presents a detailed computational framework for multimodal emotion recognition specifically designed for subjects with Autism Spectrum Disorder (ASD).

V-A V-A §V-A <tag close=" ">V-A</tag><text font="italic">Hierarchical Facial Region-of-Interest Detection</text>

To achieve precise anatomical localization of facial regions, we implemented a dual-step face verification strategy. Initially, the Multi-task Cascaded Convolutional Networks (MTCNN) was employed. This preliminary detector helped localize potential facial regions.

To ensure high-quality face inputs, all images were first filtered for blur (Laplacian threshold = 25) and low-confidence detections (MTCNN score $<$ 70%). A secondary validation using Dlib’s face_recognition (CNN/HOG) filtered out non-facial or corrupted frames; both backends yielded comparable results with only clean, centered faces retained. Faces smaller than 30 $×$ 30 were discarded, and accepted crops were resized to 224 $×$ 224.

To correct MTCNN’s tight cropping, temporary padding was applied during verification (not saved), preserving undistorted facial features. Final verified crops were aligned using reused MTCNN boxes and forwarded for landmark detection. Later, MediaPipe Face Mesh extracted 468 normalized 3D landmarks per face, enabling pose-invariant, topology-aware CSV features for robust graph modeling of neurodivergent expressions.

V-B V-B §V-B <tag close=" ">V-B</tag><text font="italic">Confidence-Calibrated Label Incorporation</text>

Several interactive facial emotion recognition tools targeting autistic individuals have been proposed. For instance, Abu‑Nowar et al. (2024) introduced SENSES‑ASD a web/mobile platform utilizing a compact Mini‑Xception CNN ( 60K parameters) trained on FER‑2013 (35,887 grayscale images across seven emotions). The system initially achieved 60% validation accuracy, which improved to 66% after tuning, with training accuracy reaching 71% []. To account for the semantic ambiguity and inter-class overlap prevalent in ASD expression datasets, we proposed a confidence-aware novel soft-labeling mechanism based on ensemble modeling. This approach jointly leverages the high representational capacity of DeepFace (Mini-Xception) and the robustness of FER network.

Dual-Model Ensemble DeepFace Backbone

We used the Mini-Xception model from DeepFace [], a lightweight CNN trained on FER-2013, producing softmax outputs $∈ p DF Δ C$ across $= C 7$ emotion classes. These predictions contribute to our ensemble fusion strategy. Despite its efficiency, Mini-Xception has shown performance comparable to human-level accuracy on benchmark datasets.

FER Supplement

To enhance robustness against occlusions and low-resolution inputs, we incorporate a parallel FER branch (Shenk []) via the fer library. It outputs $∈ p FER Δ C$ , also trained on FER-2013 but using a deeper CNN than Mini-Xception.

Weighted Fusion

The final ensemble prediction is computed as:

(2) 2

= p ens + ⋅ 23 p FER ⋅ 13 p DF

Emotion classifiers often over-predict the neutral class. To mitigate this bias, we apply a multiplicative penalty:

(3) 3

= ~ p neutral ⋅ γ p fuse,neutral, = γ 0.7,

where $p fuse$ denotes the fused distribution over emotion classes and $γ$ is a clinically validated scaling factor. The adjusted vector $~ p$ is re-normalized to ensure a valid probability distribution:

(4) 4

=^p ⁢ softmax (~ p) .

Here, $^p$ represents the probability distribution across emotion classes after neutral adjustment.

Temperature scaling ( $= T 0.7$ ) is applied via np.power(final_vector, 1.0/T) followed by normalization, enhancing distribution sharpness. This fusion balances speed and sensitivity. Mini-Xception favors real-time applications, while FER shows improved response to subtle expressions.

V-C V-C §V-C <tag close=" ">V-C</tag><text font="italic">Primary Model Architecture: Fusion-N</text>

We introduced Fusion-N, a hybrid deep neural network combining Convolutional Neural Network (a fine-tuned ResNet-50) and Graph Convolutional (GCN) to integrate global appearance features and localized relational (landmark) geometry. The architecture of Fusion-N is shown in Fig. 6.

a. Attention on CNN feature vector (5) 5

= F CNN attn ⊙ A CNN F CNN

where $⊙$ denotes the element-wise (Hadamard) product [], $A CNN$ and $F CNN attn$ is the refined CNN feature vector used downstream.

b. Aggregated GCN Features (6) 6

= F GCN ⁢ 1 N ∑ = i 1 N H i (3)

$F GCN$ denotes the aggregated node representation after three GCN layers, $H i (3)$ is the output node features from the third GCN layer for the $i th$ node and $N$ represents number of nodes (e.g., facial landmarks). $∑ = i 1 N H i (3)$ is the mean (or sum) of the output features from all nodes in the third GCN layer.

This summarizes GCN features by aggregating the landmark node embeddings after the third GCN layer and mean pooling creates a single global feature vector per face.

c. Feature Fusion (7) 7

F fused = [F attn CNN | | F GCN]

where, $F fused$ is the final fused feature representation obtained by concatenating $F CNN$ (attention-weighted CNN features) and $F GCN$ (aggregated GCN features), denoted by the concatenation operator $[∥]$ .

This equation explains the concatenation of the features extracted from CNN (with channel-wise attention) and GCN to form a unified representation that combines both appearance and geometric information, and this fused vector is forwarded to the classification head.

V-C1 V-C1 §V-C1 <tag close=" ">V-C1</tag> CNN-Based Global Feature Extraction

We leverage a pre-trained ResNet-50 backbone. ResNet-50 backbone extracts high-level features from facial images, incorporates residual learning through skip connections. We used the standard ResNet-50 architecture [], comprising four residual stages with bottleneck blocks. The original ResNet-50 uses Batch Normalization, ReLU activations, and identity skip connections within its residual blocks to facilitate residual learning. However, in our architecture, we additionally apply a Layer Normalization step after the attention module to stabilize the reweighted feature distribution before fusion with the GCN branch. The final FC layer is removed, and the rest of the network is retained up to the Global Average Pooling (GAP) layer. This transforms ResNet-50 into a strict feature extractor, with the GAP layer producing a 2048-dimensional feature vector for each input image.

We adopt partial fine-tuning by specifically freezing first 44 parameter tensors while the remaining tensors are fine-tuned, which enable learning domain-specific features relevant to autism-oriented emotion data.

To further enhance the discriminative capacity of the extracted features, a lightweight attention module is appended after ResNet-50. This module comprises two fully connected layers with ReLU and Sigmoid activations. The resulting output is a learned attention weight vector that reweights the 2048-dimensional features, emphasizing the most informative components.

The feature map $∈ F CNN R 2048$ is refined using an attention module applied on the feature vector:

(8) 8

= A CNN ⁢ σ (⁢ ⋅ W 2 ReLU (⋅ W 1 F CNN))

Here, $F CNN$ is the 2048‑dimensional raw feature vector from the last ResNet layer, $W 1$ and $W 2$ are learned fully‑connected weight matrices, ReLU is the rectified linear activation, $σ$ is the element‑wise sigmoid function (squeezing values to [0,1]), and $A CNN$ is the attention weight vector (the same size as $F CNN$ ).

V-C2 V-C2 §V-C2 <tag close=" ">V-C2</tag> GCN-Based Landmark Encoding

We represent each face as a fixed-topology graph $= G (V, E)$ where $= | V | 468$ , and edges are manually constructed based on facial geometry (jawline, eyebrows, eyes, and mouth), partially following the the MediaPipe topology (i.e., edge-index). A 3-layer GCN computes node embeddings:

(9) 9

= H (+ l 1) ⁢ ReLU (⁢ GCNConv (H (l), E)), = H (0) X

Here, $H (ℓ)$ is the node-feature matrix output by layer $ℓ$ , $E$ represents the graph’s edge list or adjacency matrix, and the GCNConv operator, originating from Kipf and Welling’s seminal GCN model [] and implemented in PyTorch Geometric [] performs the graph convolution. $X$ is the initial $× 4683$ matrix of landmark coordinates. ReLU activation is applied in the first two GCN layer, while the third produces the final 128-D embeddings.

Fig. 6 6 Fig. 6 6Architecture of the proposed Fusion-N model for facial emotion recognition. The framework comprises two branches: (i) a global feature extractor using a pre-trained ResNet-50 with an attention module applied on the 2048-D feature vector(

F CNN

), and (ii) a geometric branch processing 3D facial landmarks through stacked GCN layers with mean pooling, followed by an attention module to refine the global landmark embedding (

F GCN

). The features are fused via concatenation, forming a joint descriptor passed through fully connected layers with layer normalization, ReLU activation, and dropout. The final dense layer outputs emotion class probabilities using softmax activation. Fig. 6Architecture of the proposed Fusion-N model for facial emotion recognition. The framework comprises two branches: (i) a global feature extractor using a pre-trained ResNet-50 with an attention module applied on the 2048-D feature vector(

F CNN

), and (ii) a geometric branch processing 3D facial landmarks through stacked GCN layers with mean pooling, followed by an attention module to refine the global landmark embedding (

F GCN

Stacking the 3 GCN layers enables each landmark to gather information from its neighbors and neighbors-of-neighbors. A try-except block is implemented to handle cases where the GCN fails. In such cases, a zero vector of dimension-128 is filled in to maintain consistency.

Mean-pooled, then attention-refined yields:

(10) 10

= F GCN ⁢ Attn (⁢ 1 N ∑ = i 1 N H i (3))

Here, $H i (3)$ denotes the 128‑D embedding of landmark $i$ after three GCN layers, $⁢ Attn (⋅)$ is a small fully‑connected attention module applied on the pooled global embedding and $N$ is the total number of landmarks (468). Layer Normalization is applied prior fusion.

V-C3 V-C3 §V-C3 <tag close=" ">V-C3</tag>Feature Fusion and Classification

While CNN and GCN features are concatenated for representational purposes, the fused representation $∈ [∥ F attn CNN F GCN] R 2176$ is passed through the classification head. Both the CNN and GCN branches contribute to the final prediction.

(11) 11

F fused = [∥ F CNN attn F GCN] ∈ R 2176

(12) 12

= h 1 ⁢ ReLU (⁢ LN (⋅ W 1 F fused))

h 1

= ⁢ ReLU (⁢ LN (⋅ W 1 F fused))

(13) 13

= h 2 ⁢ ReLU (⁢ LN (⋅ W 2 h 1))

h 2

= ⁢ ReLU (⁢ LN (⋅ W 2 h 1))

(14) 14

=^y ⁢ Softmax (⋅ W 3 h 2)

^y

= ⁢ Softmax (⋅ W 3 h 2)

Here, $∈ W 1 R × 5122176$ and $∈ W 2 R × 256512$ are learned weight matrices, $∈ W 3 R × 7256$ is the final linear projection, $h 1$ and $h 2$ are intermediate 512-dimensonal and 256-dimensional hidden vectors, respectively. ReLU is the rectified-linear activation function, LN denotes layer normalization as introduced by Ba et al. [], $F attn CNN$ is the 2048-dimensional attention-refined CNN feature vector and $^y$ is the predicted probability vector for seven emotion classes. Both CNN and GCN branches contribute complementary information to the fused representation. This process has been illustrated in Fig. .

Fig. 7 7 Fig. 7 7The attention-refined CNN feature vector (2048-D) is concatenated with the pooled GCN embedding (128-D) to get a merged 2176-D fused representation. It is passed through a classification head that contains two fully connected layers, each preceded by layer normalization, ReLU activation, and dropout for regularization. The last dense layer outputs to the target number of emotion classes, generating logits, which are then transformed into predicted class probabilities with a softmax function. This combination approach successfully combines global appearance features of the CNN and localized geometric cues of the GCN for robust facial emotion recognition. Fig. 7The attention-refined CNN feature vector (2048-D) is concatenated with the pooled GCN embedding (128-D) to get a merged 2176-D fused representation. It is passed through a classification head that contains two fully connected layers, each preceded by layer normalization, ReLU activation, and dropout for regularization. The last dense layer outputs to the target number of emotion classes, generating logits, which are then transformed into predicted class probabilities with a softmax function. This combination approach successfully combines global appearance features of the CNN and localized geometric cues of the GCN for robust facial emotion recognition.

Inputs of Fusion-N:

1. 1 item 1

Images of shape $[B, 3, H, W]$ , where $B$ is the batch size, 3 refers to RGB channels, and $× H W$ is the spatial resolution.

2. 2 item 2

Landmarks of shape $[B, 468, 3]$ , where $B$ is the batch size, 468 is the number of landmarks (from MediaPipe Face Mesh), and 3 denotes $(x, y, z)$ coordinates.

Output of Fusion-N: Logits of shape $[B, num_classes]$ , i.e., raw scores before softmax.

Feature dimensions: The model computes a 2048-dimensional attention-refined CNN feature vector and a 128-dimensional GCN embedding. CNN and GCN features are concatenated, and the fused 2176-dimensional vector is passed through the classification head for final emotion prediction.

{algorithm}

[h] Classifier Head Pseudo‑Algorithm \lx@orig@algorithmic[1] \REQUIRE $∈ X R × B 2176$ Fused feature matrix (batch size $B$ ) $∈ W 1 R × 2176512, ∈ b 1 R 512$ $∈ W 2 R × 512256, ∈ b 2 R 256$ $∈ W 3 R × 2567, ∈ b 3 R 7$ \ENSURE $∈ logits R × B 7$ Pre-softmax scores for each emotion class \FOR $← i 1$ to $B$ \STATEFC1: $← Z 1 + ⁢ X [i] W 1 b 1$ \STATELN1: $← N 1 ⁢ LayerNorm (Z 1)$ \STATEReLU1: $← A 1 ⁢ ReLU (N 1)$ \STATEDrop1: $D 1 ← Dropout (A 1, p = 0.325)$ \STATEFC2: $← Z 2 + ⁢ D 1 W 2 b 2$ \STATELN2: $← N 2 ⁢ LayerNorm (Z 2)$ \STATEReLU2: $← A 2 ⁢ ReLU (N 2)$ \STATEDrop2: $D 2 ← Dropout (A 2, p = 0.275)$ \STATEFC3: $← ⁢ logits [i] + ⁢ D 2 W 3 b 3$ \ENDFOR

V-C4 V-C4 §V-C4 <tag close=" ">V-C4</tag>Rationale for Hybridization

While CNNs excel at modeling texture and color, they fail to capture geometric expressiveness, especially in ambiguous or flattened affect. GCNs, while geometrically robust, miss texture semantics. Fusion-N effectively combines both modalities, enhancing generalizability and interpretability in real-world ASD settings.

TABLE III III TABLE III IIIFusion-N Architecture Comparison

TABLE IIIFusion-N Architecture Comparison
Characteristic	CNN	GCN
Input	RGB facial images	Facial landmarks as a graph
Backbone	Pre-trained ResNet-50	3-layer Graph Convolutional Network
Feature Representation	Deep feature representation ( $F CNN$ )	Graph representation ( $H (3)$ )
Attention Module	Channel-wise attention	Attention after mean-pooling
Output Dimension	$∈ F CNN_attn R 2048$	$∈ F GCN R 128$

VI VI §VI <tag close=" ">VI</tag><text font="smallcaps">Results</text> VI-A VI-A §VI-A <tag close=" ">VI-A</tag><text font="italic">Performance Comparison with Prior Work</text> VI-A1 VI-A1 §VI-A1 <tag close=" ">VI-A1</tag>Soft Label Generation via Ensemble Prediction

To validate our ensemble-based emotion labeling framework for ASD contexts, we used an external dataset of autistic children curated by Dr. Fatma M. Talaat []. A representative subset of 100 images was selected with regards to maintaining a balance between the emotions and to match our cohort’s age and maximize ethnic diversity, reflecting the cross-cultural variance emphasized in [].

Each image was annotated by a licensed clinical psychologist after which 61 total images were finally analysed (some were removed on the account of the image being a little difficult to label as per and to avoid confusions) and compared against predictions from our ensemble fusion pipeline, which integrates multiple pre-trained models. The approach achieved 90.16% accuracy relative to expert labels, demonstrating high reliability and reducing the annotation burden typical in ASD datasets.Compared to DeepFace(Mini-Xception) (67.07%), FER (71.95%), and their average-fused variant (73.17%), our ensemble showed superior accuracy shown in Table IV, reinforcing its robustness and suitability for real-world clinical deployment.

TABLE IV IV TABLE IV IVAccuracy comparison of individual models and ensemble methods.

TABLE IVAccuracy comparison of individual models and ensemble methods.
Model	Accuracy (%)
DeepFace only	67.07
Mini-Xception (FER)	71.95
Average Fusion (DF + FER)	73.17
Ensemble Method (Weighted Average)	90.16

VI-A2 VI-A2 §VI-A2 <tag close=" ">VI-A2</tag> Hybrid Model Training and Optimization

Several prior works have explored emotion recognition models tailored for autistic children. Alhakbani [] developed a CNN trained on ASD facial images across five emotion classes, achieving 75% accuracy, reflecting the challenges of affect recognition in this population. Smitha and Vinod [] proposed a PCA-based system deployed on FPGA; though it reached 94.1% on JAFFE, performance dropped to 82.3% on real-world ASD data, underscoring domain-specific limitations. Wang et al. [] introduced a multimodal CVT architecture combining facial and speech inputs, where the facial-only branch achieved 79.12% and the fused model reached 90%, highlighting the benefits of cross-modal integration.

These unimodal facial expression systems (75%, 82.3%, 79.12%) offer directly comparable baselines to evaluate our model, as summarized in Table V. In contrast, our architecture built on ResNet-50 and GCN backbones was trained exclusively on an in-house ASD-specific dataset and achieved 96.2% accuracy. This improvement demonstrates the advantage of residual feature fusion for capturing subtle affective cues often missed by traditional CNNs or hand-crafted methods.

TABLE V V TABLE V VComparison of unimodal facial-expression models evaluated on ASD datasets and their limitations.

TABLE VComparison of unimodal facial-expression models evaluated on ASD datasets and their limitations.
Study	Accuracy (%)	Limitations
Alhakbani (2024) []	$∼$ 75.0	Small and demographically narrow dataset with limited generalization.
Smitha & Vinod (2015) []	82.3	Low-resolution PCA features that lacks geometric cues and real-time support.
Wang et al. (2025) []	79.1	Confusion in similar emotions; no temporal modeling or ablation.
Our Model (2025)	96.2	Not real-time; possible latency in live deployment.

VI-B VI-B §VI-B <tag close=" ">VI-B</tag><text font="italic">Experimental results</text> VI-B1 VI-B1 §VI-B1 <tag close=" ">VI-B1</tag>Face pre-processing outcomes

Our preprocessing component analyzed 48,891 frames from NAO-mediated child–robot interaction videos, recorded in a naturalistic, unconstrained environment without head fixation or behavioral restrictions. Of these, 1,600 were discarded due to blurriness and 20,170 due to missed detections, leaving 19,322 valid face crops obtained through our two-stage pipeline, corresponding to a 39.5% face detection success rate. The comparatively low yield is consistent with the free-play setup, in which the NAO robot called the child’s name 12 times across sessions involving toys and spontaneous movement. The total preprocessing duration was 40,453.52 seconds ( $≈$ 11.2 hours). A summary of these statistics is provided in Table VI.

TABLE VI VI TABLE VI VISummary of face preprocessing statistics

TABLE VISummary of face preprocessing statistics
Metric	Value
Total images found	48,891
Valid images	48,886
Blurry images skipped	1,600
Images with no faces	20,170
Total faces extracted	19,322
Success rate	39.5%
Processing time (seconds)	40,453.52

VI-B2 VI-B2 §VI-B2 <tag close=" ">VI-B2</tag>Emotion distributed throughout the experiment

Each child participated in a 200-second interaction session, with video recorded at 15 frames per second, yielding a high number of frames per participant. These were processed through our facial landmark extraction and hybrid deep learning classification pipeline.

Fig. 8 presents the distribution of emotion labels obtained via our weighted ensemble method. Most frames were classified as neutral (8,969) and happy (5,309), suggesting a predominance of non-negative affective states during the interaction. Moderate representation was observed for angry (1,822), surprise (1,605), and sad (1,386), while disgust (152) and fear (79) were rare, likely due to the controlled experimental setting.

Fig. 8 8 Fig. 8 8Bar-chart representation of emotion distribution. Fig. 8Bar-chart representation of emotion distribution. VI-C VI-C §VI-C <tag close=" ">VI-C</tag><text font="italic">Prediction Analysis</text>

In order to quantitatively assess our ensemble-based emotion recognition system on responses of ASD children, a multi-layered visual and statistical analysis was conducted across seven emotion categories: happy, sad, angry, fear, disgust, surprise, and neutral. Emotion-wise softmax scores of the Fusion-N model were investigated for prediction confidence, shape of distribution, and separability between classes. From emotion_descriptive_stats.csv, mean confidence values suggested happy ( $= M 0.1459$ ), sad ( $= M 0.1443$ ), and surprise ( $= M 0.1434$ ) to be most prevailing, with neutral lowest ( $= M 0.1386$ ). Low model uncertainty is indicated by narrow standard deviations for all classes ( $≈ σ ⁢ 0.001 – 0.003$ ).

Fig. 9 9 Fig. 9 9Smoothed KDE Curves for Emotion Scores. Fig. 9Smoothed KDE Curves for Emotion Scores.

The boxplot (Fig. ) indicated a greater median and wider outlier spread for happy, tightly concentrated in $[0.145, 0.155]$ , while neutral was tightly restricted in $[0.138, 0.140]$ . KDE smoothing indicated (Fig. 9) a right-skewed peak for happy ( $≈ 0.148$ ), while overlapping distributions for sad, fear, and angry reflect difficulties in distinguishing among these emotions due to their subtle expressivity in ASD.

Additionally, to examine the overall emotional tendencies of the autistic children, we classified the emotions that were observed during name-calling event into two categories : positive (happy, surprise) and negative (sad,angry,disgust). Fig 11 (pie-chart) shows that the majority of children, i.e, 73.3 % (11 out of 15) exhibited predominantly positive emotions and the rest 26.7 %(4 out of 15) were dominated by negative emotions. This observation aligns with prior work showing that robot-based interactive interventions can foster engagement and elicit positive responses in children with ASD [].

Fig. 10 10 Fig. 10 10Box-Whisker Plot for Emotion Confidence Scores. Fig. 10Box-Whisker Plot for Emotion Confidence Scores.

Fig. 11 11 Fig. 11 11Pie-chart representing distribution of positive vs negative emotions on name-calling event. Teal shade represents positive (happy, surprise) emotions and coral shade represents negative emotions (sad, angry, disgust,fear). Fig. 11Pie-chart representing distribution of positive vs negative emotions on name-calling event. Teal shade represents positive (happy, surprise) emotions and coral shade represents negative emotions (sad, angry, disgust,fear). VI-D VI-D §VI-D <tag close=" ">VI-D</tag><text font="italic">Statistical Significance Testing</text>

ANOVA and Kruskal–Wallis tests between the seven emotion classes verified significant variation in model confidence:

• 1st item

ANOVA: $= ⁢ F (6, N) 202.00$ , $< p × 1.0 10 - 180$

• 2nd item

Kruskal–Wallis: $= ⁢ H (6) 692.18$ , $< p × 3.0 10 - 146$

Post-hoc Tukey HSD tests indicated that neutral was always separable, with significantly lower confidence than happy, sad, angry, and disgust ( $< p 0.001$ ). Both happy and sad achieved significantly higher confidence than neutral and disgust, demonstrating their salience in the ensemble’s predictions.

VII VII §VII <tag close=" ">VII</tag><text font="smallcaps">Conclusion</text> VII-A VII-A §VII-A <tag close=" ">VII-A</tag><text font="italic">Ensemble-based labeling framework</text>

The proposed framework integrates predictions from pre-trained models (DeepFace’s and FER) using a consensus strategy tailored for the expressive variability of autistic children. Given the inconsistent performance of off-the-shelf models on neurodiverse datasets, our ensemble was optimized to enhance robustness on ASD-specific facial data.

To assess generalizability, we evaluated the ensemble on a publicly available ASD dataset [] , annotated by a certified clinical psychologist. The model achieved 90.16% accuracy relative to expert labels (Table IV), demonstrating strong clinical concordance and adaptability to unseen data. Our results support ensemble learning as a scalable, clinically-aligned alternative to manual annotation in resource-constrained settings.

VII-B VII-B §VII-B <tag close=" ">VII-B</tag><text font="italic">Predictive hypothesis</text>

We compared emotion predictions made by 15 children with autism during human–robot interaction facilitated by the NAO robot comparing on 7 basic emotions. Descriptive statistics, visual distribution plots, and inferential statistical analyses were applied to determine emotional expressivity and inter-individual variability.

Mean and standard deviation values were calculated for each emotion per child. Happy, sad and surprise exhibited higher mean scores across most participants, whereas neutral, disgust, and angry remained at lower and relatively stable levels. Standard deviation patterns indicated greater variability in happy, sad, and fear, while disgust and neutral were more consistent.

Participant-8, Participant-9 and Participant-10 demonstrated a higher prevalence of happy and sad predictions, consistent with the theory of emotional salience in autism spectrum disorder (ASD) []. The emotion fear was more dominant in some children, reinforcing prior findings that ASD individuals often exhibit elevated anxiety or hyperarousal in novel contexts such as robot interaction [].

The emotions happy, sad and surprise exhibited broader confidence intervals and denser distributions, suggesting their richer expressivity. The box-and-whisker plots confirmed this with larger inter-quartile ranges. There were several outliers as well in these emotions indicating transient emotional bursts, a known characteristic of affect dysregulation in ASD []. This aligns with the known heterogeneity in affective displays among individuals on the autism spectrum, where emotional responses can range from subdued to highly exaggerated depending on context, sensory sensitivity, or individual traits.

Implications and Literature Alignment

Our results are consistent with psychological research on emotion expression in ASD, where children with developmental or emotional difficulties possess an innate bias toward positive expressions in interactive and observational situations. In our dataset, 73.3 % of the children exhibited a positive emotional dominance, represented by happy and surprise. An interesting minority (26.7%), however, manifested a negative dominance, namely sad, disgust, and angry, seen among participants 2, 5, 6, and 7. This diversity highlights the importance of individualized, emotion-sensitive interventions since children with the overarching negative affect can be helped through specialized intervention in affective learning environments. Furthermore, these findings verify the viability of using robotic stimuli like NAO to examine and perhaps augment autistic children’s emotional expressivity, and demonstrate the potential of emotion-aware robotics as a tool in affective computing and autism therapy.

VII-C VII-C §VII-C <tag close=" ">VII-C</tag><text font="italic">Future Scope and Discussions</text>

While the current system performs reliably in offline conditions, its application in real-time scenarios remains a key area for enhancement. As of now, NAO is being used only as a facilator, the primary limitation lies in latency introduced by sequential modules, particularly during face detection and preprocessing.

Future efforts can focus on optimizing the pipeline for real-time deployment by prioritising low-latency, adaptive, and hardware-efficient implementations to extend real-world applicability.

Adaptive learning with reference to personal emotional profiles can improve performance across various ASD settings by detecting nuanced differences in affective expressions. Tested and validated using a geographically representative dataset, our ResNet-50 + three-layer GCN architecture presents strong, generalizable capability for ASD emotion analysis in real-world scenarios.

VIII VIII §VIII <tag close=" ">VIII</tag><text font="smallcaps">Acknowledgement</text>

The authors thank the Smart Materials, Structures and Systems Laboratory of the Department of Mechanical Engineering and the Psychology Laboratory of the Department of Humanities and Social Sciences, IIT Kanpur, for the infrastructural facilities and support extended in conducting this research work. Special thanks go to Mr. Rohit Kumar Tiwari, a specialised clinical psychologist (Rehabilitation Psychology) at the Pushpa Khanna Memorial Centre, for his guidance in behavioral assessment, labelling of our global dataset, and support for dataset annotation. We also appreciate the cooperation and provision of logistic support from the Amrita Rehabilitation Centre and Pushpa Khanna Memorial Centre, both situated in Kanpur, India. Our sincere appreciation extends to the parents for trusting us and to the children for their voluntary participation. We lastly acknowledge all personnel who assisted in the process of data collection at partner centers and in our laboratory.

References