A research team with the ΒιΆΉΣ³»­΄«Γ½β€™s recently won a competition to improve computer vision by creating technology that can automatically track behavior in long security videos.

The competition, called the Activities in Extended Video Challenge for 2020, was sponsored by the U.S. Department of Commerce’s National Institute of Standards and Technology and was held virtually in June as part of the Conference on Computer Vision and Pattern Recognition.

Top computer vision teams from around the world, including teams from IBM, Massachusetts Institute of Technology, Carnegie Mellon University, and Purdue University competed in the challenge.

β€œVideo surveillance is of great importance for security, and manually watching surveillance videos is not only difficult but inefficient,” says Yogesh Rawat, an assistant professor at the center and team leader. β€œAlso, with so many closed-circuit television cameras all around, it is not possible to manually watch those videos. We need automatic analysis of these security videos to improve efficiency as well as accuracy.”

That need for β€œextra eyes” is why the ΒιΆΉΣ³»­΄«Γ½ computer vision team developed a deep-learning system, named Gabriella, that can detect multiple activities happening in a security video efficiently, at a speed of 100 frames per second.

β€œThis is a first step toward analyzing these security videos, and it will have a lot of applications in national security,” Rawat says.

The team also included ΒιΆΉΣ³»­΄«Γ½ trustee chair professor of computer science and center director Mubarak Shah who says the win is a big plus for the group.

β€œVideo activity recognition in unconstrained domain is a very important problem that has applications in self-driving cars, video surveillance and monitoring, human-computer interface and video search,” Shah says.

β€œOur submission was the fastest and most accurate, two criteria for the Deep Intermodal Video Analytics program,” he says

Participation in the challenge supports the ΒιΆΉΣ³»­΄«Γ½ team’s role in the Deep Intermodal Video Analytics program, which is funded by the U.S. Office of the Director of National Intelligence’s Intelligence Advanced Research Projects Activity program through a sub-contract from the University of Maryland.

The ΒιΆΉΣ³»­΄«Γ½ team was runner-up in 2018 and 2019 in the institute’s similar Text Retrieval Conference’s Video Retrieval Evaluation but lost to Carnegie Mellon University. This year, the Carnegie Mellon team was runner-up.

The ΒιΆΉΣ³»­΄«Γ½ team won by developing an end-to-end approach to computer analyzation of video footage.

End-to-end means the computer directly takes the raw RGB video as input and generates the required output, without any intermediate processing, which was required for the systems developed by the other teams.

Intermediate processing tasks such as object detection, optical-flow computation, and tracking make the whole process very complicated and difficult to train as well as test, Rawat says.

β€œThe end-to-end system avoids all of this and therefore is preferred,” he says.

The ΒιΆΉΣ³»­΄«Γ½ team is able to monitor 37 different activities over the course of more than 250 hours of video, including activities such as β€œtheft” and β€œperson abandons package,” and the scalable, machine learning system can be trained to recognize more if the data are available.

Monitoring for these kinds of activities over hours of video is difficult using computer vision because the activities vary in length; there can be multiple activities in the same frame; the same person can be doing different activities; and the scale of the activities varies as those closer to the camera are bigger in size.

The ΒιΆΉΣ³»­΄«Γ½ research team also includes ΒιΆΉΣ³»­΄«Γ½ Department of Computer Science doctoral students Praveen Tiruputtar, Aayush Rana, Kevin Duarte, Ugur Demir and Ishan Dave; and ΒιΆΉΣ³»­΄«Γ½ Office of Research doctoral fellow Nayeem Rizve.

β€œWe worked very hard to get to the top position,” Rawat says.