Recognizing 400 different activities in videos using Python and OpenCV

Vasu Gupta
5 min readSep 8, 2021

Overview

Recognizing activities in video streams has many applications in context aware advertising, tracking time spent in different activities, triggering alerts on detecting dangerous activities, etc.

This blog post includes a python code walkthrough for performing activity recognition in videos using a pretrained 3D convolutional ResNet model. OpenCV library is used for performing inference. The main goal of this post is to focus on the main concepts while using a minimal working code example.

Complete code can also be accessed here — LINK

Model correctly classifies the activity as ‘Making pizza’
Model recognizing activity as ‘making pizza’ ( Original (before making inference) Video Source )

Activity Recognition Model

A pretrained model from the authors of the paper ‘Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet ?’ ( LINK ) is used and is available on their GitHub repository ( LINK ). However, a python file may need to be run to generate the pre-trained model file. Therefore, to avoid this process, I have included code to directly download the model file from my google drive.

The model uses a 3D convolutional ResNet neural network architecture. It is referred to as 3D because convolutions are performed by sliding filters over image height, width and temporal dimensions — a sequence of consecutive frames from the video.

Input frames to the model must be in RGB format , with a resolution of 112 x 112 pixels and as an array of 16 consecutive frames from the video for a single prediction.

This model was trained on the Kinetics Human Action Video Dataset (LINK) and can recognize 400 different activities including yoga, making pizza, skating, etc.

Pretrained model file used is in .onxx format which is an open standard for machine learning interoperability [LINK]. Model file size is approx. 250MB

Code walkthrough

Below is a code walkthrough of the complete python code. Complete code can also be accessed here LINK

  1. Code snippet for downloading files_activity_recognition.zip file from my google drive. Unzips the file after downloading. 3 required files are included — pretrained model size, class names text file and sample input video file.
# downloading zip file from google drive 
from google_drive_downloader import GoogleDriveDownloader as gdd
file_id = ‘1–2fFtZbQLsF3sSwpAZC_HOHyrCrF6GWz’
gdd.download_file_from_google_drive(file_id=file_id, dest_path = ‘/content/files_activity_recognition.zip’, unzip=True)
# unzipping file
!unzip files_activity_recognition.zip

2. Importing required libraries and defining various path locations

# importing required libraries 
import numpy as np
import cv2
from time import time
# defining various input parameters
filepath_class_names = '/content/files_activity_recognition/class_names_list.txt'
filepath_model =
'/content/files_activity_recognition/resnet-34_kinetics.onnx'

filepath_in_video = '/content/files_activity_recognition/video_making_pizza_resized.mp4'
filepath_out_video = '/content/output.mp4'

3. Loading class names and model file . OpenCV’s dnn module is used for loading the model .

# loading class names 
with open(filepath_class_names,’r’) as fh :
class_names = fh.read().strip().split(‘\n’)
# loading the model
model = cv2.dnn.readNet(filepath_model)

4. Defining function for preprocessing the frames before input to the model. OpenCV’s dnn.blobFromImages function is used to

  1. resize the images to 112 x 112 pixels
  2. subtract the specified mean values from all pixels as a preprocessing step required by the model
  3. swaps red and blue channels to convert images to RGB format since input frames to the function are in BGR format. BGR is the default OpenCV format and frames read from disk using OpenCV are returned in BGR format. However, the model assumes input is in RGB format, therefore this conversion from BGR to RGB format is required.
  4. crop =True center crops images if needed after resizing so that final image size is 112 x 112

blob returned from cv2.dnn.blobFromImages function is an array of shape N x C x H x W where N is number of frames (16) , C is number of channels (3 for R,G,B), H and W are height (112) and width (112) respectively.

blob is reshaped to C x N x H x W using np.transpose since 3D convolutions are being used with channels dimension being the first one . blob is then reshaped to 1 x C x N x H x W where the first dimension is the batch dimension with a single example in the batch. A single example consists of 16 consecutive frames of size 112 x 112 pixels and 3 color channels .

# defining function for preprocessing the frames 
def preprocess(frames) :
model_img_w = 112 # as per model input image width
model_img_h = 112 # as per model input image height
mean = (114.7748, 107.7354, 99.4750)
blob = cv2.dnn.blobFromImages( frames, scalefactor=1 , size=(model_img_w, model_img_h), mean=mean, swapRB=True, crop=True)

blob = np.transpose(blob, (1,0,2,3))
blob = np.expand_dims(blob, axis=0)
return blob

5. Defining helper function to retrieve the next 16 image frames from the video.

  1. vcap is a cv2.VideoCapture object used to read frames from video using the .read() function
  2. if a frame is successfully read, then grabbed is True else it is False
  3. get_video_frames function returns the next num_frames or -1 if unable to read all frames on reaching end of video or if error in reading frames
def get_video_frames(vcap, num_frames) :
frames = []
for i in range(num_frames) :
grabbed, frame = vcap.read()
if grabbed == False :
print("No more frames to read")
break
frames.append(frame)

if len(frames) < num_frames :
return -1
else :
return frames

6. Defining function for generating output video frames having the predicted activity class written on the video frames

def write_pred_on_frame(frame, pred_class_name) :
text = pred_class_name
cv2.rectangle(frame, (0,0), (150,30), (0,0,0), -1)
cv2.putText(frame, text, (10,20), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255,255,255), 1)
return frame

7. Instantiating objects of cv2.VideoCapture and cv2.VideoWriter classes for reading and writing video frames respectively. fourcc is the codec code used for encoding/decoding video files

vcap = cv2.VideoCapture(filepath_in_video)
if not vcap.isOpened() :
print("Error opening video")
else :
width = int(vcap.get(3))
height = int(vcap.get(4))
fps = int(vcap.get(5))
num_frames = int(vcap.get(7))
duration = num_frames/fps
print("frame width:{}, height:{}, fps:{}".format(width, height, fps))
print("Video duration {} seconds".format(duration))
fourcc = cv2.VideoWriter_fourcc(*'XVID') # XVID codec for writing MP4 files
vout = cv2.VideoWriter(filepath_out_video, fourcc, fps, (width, height), 1) # 1 for coloured video

8. Finally, we get to the code for performing inference on the input video and writing results to output video . After running the below code an output .mp4 video file will be generated with frames annotated with the predicted activity.

  1. Loop until all input video frames are read. Fetch the next num_frames using get_video_frames function
  2. Preprocess the frames
  3. perform prediction using model.setInput and model.forward function calls
  4. find the class idx and corresponding class name for class with highest prediction score
  5. Annotate output frames with the predicted class and write to output video using vout.write
  6. Release the vcap and vout objects after processing is finished.

Sample input video used in the code is approx. 18 seconds in duration and it takes approx. 37 seconds to process it on CPU on Google Colab (without GPU enabled)

# performing inference and writing to output video stream num_frames = 16 # number of frames passed to model for making single inference . this specification is as per the model used and should not be changed start = time()
while True :
frames = get_video_frames(vcap, num_frames)
if frames == -1 :
break

frames_processed = preprocess(frames)
model.setInput(frames_processed)
pred = model.forward() # resulting pred.shape will be (1 , 400)
pred_class_idx = np.argmax(pred)
pred_class_name = class_names[pred_class_idx]
for frame in frames :
output_frame = write_pred_on_frame(frame, pred_class_name)
vout.write(output_frame)
end = time()
print("Finished processing. Took {} seconds".format(end-start))
vcap.release()
vout.release()
Generated output video, predicted activity shown in top left corner

Thanks for reading !!!

References

https://github.com/opencv/opencv/blob/master/samples/dnn/action_recognition.py

https://www.pyimagesearch.com/2019/11/25/human-activity-recognition-with-opencv-and-deep-learning/

--

--

Vasu Gupta

Computer Vision & Deep Learning Developer | Graduate from Stanford University & IIT Bhubaneswar | Enjoys exploring and building new things