Recognizing 400 different activities in videos using Python and OpenCV

5 min readSep 8, 2021

Overview

Recognizing activities in video streams has many applications in context aware advertising, tracking time spent in different activities, triggering alerts on detecting dangerous activities, etc.

This blog post includes a python code walkthrough for performing activity recognition in videos using a pretrained 3D convolutional ResNet model. OpenCV library is used for performing inference. The main goal of this post is to focus on the main concepts while using a minimal working code example.

Complete code can also be accessed here — LINK

Model correctly classifies the activity as ‘Making pizza’ — Model recognizing activity as ‘making pizza’ ( Original (before making inference) Video Source )

Activity Recognition Model

A pretrained model from the authors of the paper ‘Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet ?’ ( LINK ) is used and is available on their GitHub repository ( LINK ). However, a python file may need to be run to generate the pre-trained model file. Therefore, to avoid this process, I have included code to directly download the model file from my google drive.

The model uses a 3D convolutional ResNet neural network architecture. It is referred to as 3D because convolutions are performed by sliding filters over image height, width and temporal dimensions — a sequence of consecutive frames from the video.

Input frames to the model must be in RGB format , with a resolution of 112 x 112 pixels and as an array of 16 consecutive frames from the video for a single prediction.

This model was trained on the Kinetics Human Action Video Dataset (LINK) and can recognize 400 different activities including yoga, making pizza, skating, etc.

Pretrained model file used is in .onxx format which is an open standard for machine learning interoperability [LINK]. Model file size is approx. 250MB

Code walkthrough

Below is a code walkthrough of the complete python code. Complete code can also be accessed here LINK

Code snippet for downloading files_activity_recognition.zip file from my google drive. Unzips the file after downloading. 3 required files are included — pretrained model size, class names text file and sample input video file.

# downloading zip file from google drive 
from google_drive_downloader import GoogleDriveDownloader as gdd
file_id = ‘1–2fFtZbQLsF3sSwpAZC_HOHyrCrF6GWz’
gdd.download_file_from_google_drive(file_id=file_id, dest_path = ‘/content/files_activity_recognition.zip’, unzip=True)# unzipping file
!unzip files_activity_recognition.zip

2. Importing required libraries and defining various path locations

# importing required libraries 
import numpy as np 
import cv2 
from time import time# defining various input parameters 
filepath_class_names = '/content/files_activity_recognition/class_names_list.txt'                filepath_model = 
'/content/files_activity_recognition/resnet-34_kinetics.onnx'   
          
filepath_in_video = '/content/files_activity_recognition/video_making_pizza_resized.mp4'      filepath_out_video   = '/content/output.mp4'

3. Loading class names and model file . OpenCV’s dnn module is used for loading the model .

# loading class names 
with open(filepath_class_names,’r’) as fh :
 class_names = fh.read().strip().split(‘\n’)# loading the model 
model = cv2.dnn.readNet(filepath_model)

4. Defining function for preprocessing the frames before input to the model. OpenCV’s dnn.blobFromImages function is used to

resize the images to 112 x 112 pixels
subtract the specified mean values from all pixels as a preprocessing step required by the model
swaps red and blue channels to convert images to RGB format since input frames to the function are in BGR format. BGR is the default OpenCV format and frames read from disk using OpenCV are returned in BGR format. However, the model assumes input is in RGB format, therefore this conversion from BGR to RGB format is required.
crop =True center crops images if needed after resizing so that final image size is 112 x 112

blob returned from cv2.dnn.blobFromImages function is an array of shape N x C x H x W where N is number of frames (16) , C is number of channels (3 for R,G,B), H and W are height (112) and width (112) respectively.

blob is reshaped to C x N x H x W using np.transpose since 3D convolutions are being used with channels dimension being the first one . blob is then reshaped to 1 x C x N x H x W where the first dimension is the batch dimension with a single example in the batch. A single example consists of 16 consecutive frames of size 112 x 112 pixels and 3 color channels .

# defining function for preprocessing the frames 
def preprocess(frames) :
 model_img_w = 112 # as per model input image width 
 model_img_h = 112 # as per model input image height 
 mean = (114.7748, 107.7354, 99.4750)  blob = cv2.dnn.blobFromImages( frames, scalefactor=1 , size=(model_img_w, model_img_h), mean=mean, swapRB=True, crop=True)
 
 blob = np.transpose(blob, (1,0,2,3)) 
 blob = np.expand_dims(blob, axis=0) 
 return blob

5. Defining helper function to retrieve the next 16 image frames from the video.

vcap is a cv2.VideoCapture object used to read frames from video using the .read() function
if a frame is successfully read, then grabbed is True else it is False
get_video_frames function returns the next num_frames or -1 if unable to read all frames on reaching end of video or if error in reading frames

def get_video_frames(vcap, num_frames) :
  frames = []
  for i in range(num_frames) :
    grabbed, frame = vcap.read()
    if grabbed == False :
      print("No more frames to read")
      break
    frames.append(frame)
  
  if len(frames) < num_frames :
    return -1
  else :
    return frames

6. Defining function for generating output video frames having the predicted activity class written on the video frames

def write_pred_on_frame(frame, pred_class_name) :
  text = pred_class_name
  cv2.rectangle(frame, (0,0), (150,30), (0,0,0), -1)
  cv2.putText(frame, text, (10,20), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255,255,255), 1)
  return frame

7. Instantiating objects of cv2.VideoCapture and cv2.VideoWriter classes for reading and writing video frames respectively. fourcc is the codec code used for encoding/decoding video files

vcap = cv2.VideoCapture(filepath_in_video)
if not vcap.isOpened() :
  print("Error opening video")
else :
  width  = int(vcap.get(3))
  height = int(vcap.get(4))
  fps    = int(vcap.get(5))
  num_frames = int(vcap.get(7))
  duration = num_frames/fps
  print("frame width:{}, height:{}, fps:{}".format(width, height, fps))
  print("Video duration {} seconds".format(duration))fourcc = cv2.VideoWriter_fourcc(*'XVID') # XVID codec for writing MP4 files
vout = cv2.VideoWriter(filepath_out_video, fourcc, fps, (width, height), 1) # 1 for coloured video

8. Finally, we get to the code for performing inference on the input video and writing results to output video . After running the below code an output .mp4 video file will be generated with frames annotated with the predicted activity.

Loop until all input video frames are read. Fetch the next num_frames using get_video_frames function
Preprocess the frames
perform prediction using model.setInput and model.forward function calls
find the class idx and corresponding class name for class with highest prediction score
Annotate output frames with the predicted class and write to output video using vout.write
Release the vcap and vout objects after processing is finished.

Sample input video used in the code is approx. 18 seconds in duration and it takes approx. 37 seconds to process it on CPU on Google Colab (without GPU enabled)

# performing inference and writing to output video stream num_frames = 16 # number of frames passed to model for making single inference . this specification is as per the model used and should not be changed start = time()
while True :
  frames = get_video_frames(vcap, num_frames)
  if frames == -1 :
    break 
  
  frames_processed = preprocess(frames)
  model.setInput(frames_processed)
  pred = model.forward() # resulting pred.shape will be (1 , 400) 
  pred_class_idx  = np.argmax(pred)
  pred_class_name = class_names[pred_class_idx]for frame in frames :
    output_frame = write_pred_on_frame(frame, pred_class_name)
    vout.write(output_frame)
end = time()
print("Finished processing. Took {} seconds".format(end-start))vcap.release()
vout.release()

Generated output video, predicted activity shown in top left corner

Thanks for reading !!!

References

https://github.com/opencv/opencv/blob/master/samples/dnn/action_recognition.py

https://www.pyimagesearch.com/2019/11/25/human-activity-recognition-with-opencv-and-deep-learning/

Recognizing 400 different activities in videos using Python and OpenCV

Overview

Activity Recognition Model

Code walkthrough

Written by Vasu Gupta