Novel Machine Learning Methods for Video Understanding and Medical Analysis
Lade...
Datum
2025-06-26
Autor:innen
Betreuer:innen
Zeitschriftentitel
ISSN der Zeitschrift
Bandtitel
Verlag
Zusammenfassung
Artificial intelligence has developed rapidly over the past decade and has penetrated into nearly every aspect of life. New applications in areas such as human-computer interaction, virtual reality, autonomous driving and intelligent medical systems have emerged in large numbers. Video is a kind of high-dimensional data, which has one more dimension than images, requiring more computing resources. As more and more high-quality large-scale video datasets are released, video understanding has become a cutting-edge research direction in the computer vision community. Action recognition is one of the most important tasks in video understanding. There are many successful network architectures for video action recognition.
In our work, we focus on proposing new designs and architectures for video understanding and investigating their applications in medicine. We introduce a novel RGBt sampling strategy to fuse temporal information into single frames without increasing the computational load and explore different color sampling strategies to further improve network performance. We find that frames with temporal information obtained by fusing the green channels from different frames achieve the best results. We use tubes of different sizes to embed richer temporal information into tokens without increasing the computational load. We also introduce a novel bio-inspired neuron model, the MinBlock, to make the network more information selective. Furthermore, we propose a spatiotemporal architecture that slices videos in space-time and thus enables 2D-CNNs to directly extract temporal information. All the above methods are evaluated on at least two benchmark datasets and all perform better than the baselines.
We also focus on applying our networks in medicine. We use our slicing 2D-CNN architecture for glaucoma and visual impairments analysis. And we find that visual impairments may affect walking patterns of humans thus making the video analysis relevant for diagnosis. We also design a machine learning model to diagnose psychosis and show that it is possible to predict whether clinical high-risk patients would actually develop a psychosis.
Beschreibung
This research discusses advancements in video understanding within artificial intelligence, emphasizing new methods to enhance action recognition—an essential task in analyzing video data. The authors introduce innovative strategies, such as a novel RGBt sampling method that fuses temporal information into single frames without increasing computational costs, with green channel fusion yielding the best results. They also embed richer temporal data using tubes of varying sizes and propose a bio-inspired neuron model called MinBlock to improve information selectivity. Additionally, they develop a spatiotemporal architecture that enables 2D convolutional neural networks (CNNs) to extract temporal features directly from videos by slicing them in space-time.
The research demonstrates that these techniques outperform baseline models on multiple benchmark datasets. Furthermore, the authors explore practical medical applications of their methods, including the analysis of glaucoma and visual impairments—highlighting how these impairments can influence walking patterns detectable via video. They also develop a machine learning model to predict the development of psychosis in high-risk patients, showcasing the potential of these video understanding approaches for medical diagnosis and prognosis.
Schlagwörter
Deep Learning, Transformers, CNNs, Video Understanding, MRI Images.
Zitierform
Institut/Klinik
Institut für Neuro- und Bioinformatik