Facial recognition systems has been adopted widely now in security systems of today’s world but the drawbacks methodologies lack capability to identify suspicious activities. Other approaches, including CNNs applied to face recognition and hybrid CNN-LSTM models for behavior recognition, have some issues with recognizing efficient features of human appearance based on sequence-to-sequence temporal analysis, as well as the long-term dependencies inherent to the data. In this research, a new advanced hybrid ViT-LSTM model is developed to improve facial recognition and behavior anomaly detection to support smart security surveillance. The Vision Transformer (ViT) acquires the high-dimensional facial features with better accuracy as well as it gains the global spatial information which CNNs failed to do in large-scale scenarios. These features are processed by LSTM networks over time to identify behavioral patterns, in order to provide for a sound temporal handling of human activities. The mixed structure connects both components with the fusion layer so that multitask classification can be achieved for identity authentication and anomaly identification. The proposed model efficacy is demonstrated by training it and validating it on the benchmark dataset, which establishes increased face recognition and higher precision of behavioral anomaly detection as compared to the conventional CNN-LSTM models. Third, the system has less latency and high real-time performance to be implemented in security critical areas. This novel hybrid approach therefore presents a scalable, accurate and efficient solution for next-generation security monitoring systems. © 2025 IEEE.