Multi-Person tracking (MPT) System

Multi-Person Tracking (MPT) system designed to address the pressing challenges in security surveillance. The project focuses on developing a Multi-Person Tracking (MPT) system for advanced surveillance. Utilizing cutting-edge technologies like YOLOv7 for person detection and StrongSORT for tracking, the system is engineered to perform in complex environments. It not only assigns unique identifiers to individuals but also creates tracklets for continuous
monitoring, playing a crucial role in understanding object interactions and behaviors. The system employs ResNet as the backbone for feature extraction. The primary focus of this project is to present a comprehensive solution to the specific problem of enhancing security measures through intelligent tracking and re-identification of individuals in complex environments. The integration of YOLOv7, StrongSORT, and OSNet enables precise detection, tracking, and
re-association of persons, transcending the limitations of traditional surveillance methods. The project investigates the system’s efficacy in fortifying security, optimizing traffic management, and responding proactively to incidents

Labelling and Classifying Persons The initial phase involves training a model to detect and classify people in video frames. This process begins with setting up the environment, which includes preparing the datasets and configuring the necessary tools and frameworks. Data collection is a critical step, where video footage from various environments is gathered. Each frame of this footage

Multi-Person tracking (MPT) System is meticulously labeled, marking individuals to create a comprehensive dataset. This dataset is then used to train the model, ensuring it can accurately identify and classify persons in diverse settings. The evaluation phase follows, where the model’s accuracy and efficiency are assessed, leading to necessary adjustments to improve
its performance.

Person Detection with YOLOv7 For person detection, the system utilizes YOLOv7, a state-of-the-art deep learning algorithm known for its speed and accuracy. The camera feeds serve as input to the YOLOv7 model, which processes the video frames in realtime. The model detects persons within these frames and outputs bounding boxes around them, indicating their location and size. This detection phase is crucial as it lays the groundwork for the
subsequent tracking and re-identification processes.

Person Tracking with StrongSORT Once persons are detected, the StrongSORT algorithm takes over for tracking. The detections from YOLOv7, encapsulated in bounding boxes, are fed into the StrongSORT algorithm. StrongSORT excels in tracking the movement of individuals across frames, maintaining a consistent identifier for each person. This continuous tracking is vital for monitoring individual trajectories and understanding movement patterns within the camera’s field of view.

Person Re-identification using OSNet The OSNet algorithm is employed for person re-identification. It is designed to associate images of the same person captured at different times or camera angles throughout their trajectory. This re-identification is crucial for maintaining continuous tracking, especially when individuals leave and re-enter the camera’s view or move across multiple cameras.

Feature Extraction of Detected Persons Feature extraction involves analyzing the detected persons to gather useful data. This data, extracted using YOLOv7 and StrongSORT, includes attributes like appearance, gait, and movement patterns. These features are pivotal for object association and tracking, aiding the system in distinguishing between different individuals and ensuring accurate
tracking.

ID Propagation Algorithm The system explores various algorithms for tracking objects across different camera views. These include SORT (Simple Online and Realtime Tracking), StrongSORT, Kalman Filtering, and the Hungarian Algorithm. Each algorithm has its strengths in handling object tracking and identity propagation, ensuring seamless tracking as individuals move through
various camera coverage areas.

Program Development Program development involves setting
up the StrongSORT algorithm and developing the system for frameby-frame extraction of details. This phase also includes defining the confidence score for classifying and boxing individuals and filtering the class to be detected. The performance of the system, measured in frames per second, is a critical metric, indicating the system’s ability to process video feeds in real-time.

Testing and Evaluation The final phase is extensive testing
and evaluation. This involves assessing the system’s performance in
real-world scenarios, and measuring its accuracy in detecting, tracking, and re-identifying individuals. Key performance indicators like
inference time and system responsiveness are measured, providing
insights into the system’s efficiency and effectiveness. This phase
is crucial for identifying areas of improvement and ensuring the system meets the required standards for real-time surveillance and
tracking

Assessing multi-object tracking (MOT) has been found to be particularly difficult. Existing metrics overemphasize the importance of either identity or linkage. Hence we have used the novel MOT assessment metric called higher-order tracking accuracy (HOTA). This statistic explicitly balances the impact of performing exact identification, relationship, and localization into a single aggregated value for comparing tracking devices. HOTA deconstructs into a collection of ancillary metrics capable of measuring each of the five main mistake categories separately, allowing for a clear analysis of monitoring proficiency. HOTA’s efficacy is evaluated using the MOTChallenge benchmark, proving its ability to evaluate critical aspects of MOT effectiveness that have not been assessed by recognized metrics. MOTA (Multi-Object Tracking Accuracy) matches at the detection level. In every frame, a bijective (one-to-one) mapping is built between prDets and gtDets. True positives (TPs) are any prDets and gtDets that match (right predictions). Any remaining unmatched prDets (additional predictions) become false positives (FPs). Any gtDets that are inconsistent (without predictions) are classified as false negatives (FNs). Only if prDets and gtDets are sufficiently geographically comparable can they be matched. MOTA thus necessitates the establishment of a similarity score, S, between detections. IDF1 computes a bijective (one-to-one) mapping between gtTraj and prTraj sets (as opposed to MOTA, which compares at the detection layer). This introduces new kinds of detecting matches. IDTPs (identity true positives) are matches on overlapping tracks that have been compared together. IDFNs (identity false negatives) and IDFPs (identity false positives) are the residual gtDets and prDets that result from non-overlapping regions of matched sequences and unmatched sequences. Track-mAP (mean average precision) compares predictions with the actual truth along a trajectory. It necessitates the establishment of a trajectory similarity score, Str,
between trajectories (as opposed to MOTA and IDF1, which employ an identification similarity score, S), as well as a limit, tr below which trajectories are only permitted to match. If a prTraj has the greatest confidence score among all prTrajs, it is paired with a gtTraj. HOTA is intended to: (i) offer a single tracker assessment score that fairly combines all different aspects of tracking evaluation; (ii) assess long-term higher-level tracking association; and (iii) break down into submetrics that allow evaluation of the various elements
of tracker effectiveness. In our quantitative analysis, we use the evaluation script that was available from the TrackEval repository since it is the official
method of evaluating results from the MOT Challenge benchmark.
For the input video data, we used 3 videos from the MOT15 dataset. The 3 videos are ETH-Sunnyday, TUD-Campus & TUD-Stadtmitte. ETH-Sunnyday is a street scene on a sunny day from a moving platform. TUD-Campus is a short sequence with side-view pedestrians. TUD-Stadtmitte is a static camera at about 2 meters in height that shows walking people on the street. We run our tests for 4 different categories of metrics: HOTA, CLEAR, Identity, and Count. We also compare all the results with the baseline values.

In most of the metrics our solutions fall behind the baseline and further analysis needs to be done to understand the shortcomings. Only for the LocA and LocaA(0) metric, our solution performs equally or sometimes even better than the baseline. Next, we evaluate the baseline and our work on the CLEAR metrics. This comprises an exhaustive set of 17 sub-metrics. T. The graphs do not contain all the metrics, we have only included the important ones. In terms of comparison, our solution performs worse than the baseline for all metrics. An outlier here is the CLR-Re metric for the ETH-Sunnyday dataset where our
work does better than the baseline. For the 3rd category – Identity, our work performs comparably and in some cases better than the
baseline.

We test our work against the baseline for the Count metric category. We note that our solution produces more detections and IDs than the ground truth values which tells us that our model is re-identifying more objects than necessary and
this would be a future improvement that we could work on. Apart from the quantitative metrics, we have also noted that the average inference time for Yolov7 was 30ms and StrongSORT was sub 20ms. For evaluation, we used a device that had a GeForce RTX 3050 GPU with 8GB memory that helped us achieve detections in Real-Time

Rakhshanda Rukham

Leave a Reply Cancel reply