Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review, and Challenges

Pedestrian detection and monitoring in a surveillance system are critical for numerous utility areas which encompass unusual event detection, human gait, congestion or crowded vicinity evaluation, gender classification, fall detection in elderly humans, etc. Researchers’ primary focus is to develop surveillance system that can work in a dynamic environment, but there are major issues and challenges involved in designing such systems. These challenges occur at three different levels of pedestrian detection, viz. video acquisition, human detection, and its tracking. The challenges in acquiring video are, viz. illumination variation, abrupt motion, complex background, shadows, object deformation, etc. Human detection and tracking challenges are varied poses, occlusion, crowd density area tracking, etc. These results in lower recognition rate. A brief summary of surveillance system along with comparisons of pedestrian detection and tracking technique in video surveillance is presented in this chapter. The publicly available pedestrian benchmark databases as well as the future research directions on pedestrian detection have also been discussed.


Introduction
The word surveillance, prefix sur is a French word means "over" and the root veiller means "to watch." In distinction to surveillance, Steve Mann in [1] introduces the term "sousveillance." Contrasting the word sur, sous meaning is "under," i.e., it signifies that the camera is with human physically (ex. camera mounting on head). Surveillance and sousveillance both are used for continuous attentive observation of a suspect, prisoner, person, group, or ongoing behavior and activity in order to collect information. In order to improve conventional security systems, the use of surveillance system has been increasingly emboldened by government and private organizations. Currently, surveillance systems have been widely investigated and used effectively in several applications like (a) transport systems (railway stations, airports, urban and ruler motorway road networks), (b) government agencies (military base camps, prisons, strategic infrastructures, radar centers, laboratories, and hospitals), (c) industrial environments, automated teller machine (ATM), banks, shopping malls, and public buildings, etc. The most of the surveillance systems at public and private places depend on the human operator observer, who detect any suspicious pedestrian activities in a video scene [2,3]. The term pedestrian is a person who is walking or running on the street. In some communities, a person using wheelchair is also considered as pedestrians. The most challenging task for automatic video surveillance is to detect and track the suspicious pedestrian activity. For a real-time dynamic environment, the learning-based methods did not provide an appropriate solution for real-time scene analysis because it is difficult to obtain a prior knowledge about all the objects. Still, the learning-based methods are adopted due to their accuracy and robust nature. In the literature, several researchers use efficiently deep-learning (DL) based model for classification purpose in video surveillance over traditional approaches viz. These variants of DL algorithms are used in many computer vision applications like face recognition, image classification, speech recognition, text-to-speech generation, handwriting transcription, machine translation, medical diagnosis, cars: drivable area, lane keeping, pedestrian and landmark detection for driver, digital assistants, ads, search, social recommendations, game playing, and content-based image retrieval. The advantage of DL approaches is its ability to learn complex scene features with very less processing of raw data and its capability of learning unlabeled raw data efficiently. Most recently, a new deep-learning technique called CNN have shown high performances over conventional methods in video processing research space. CNN can handle efficiently complex and large data.
During the past decade video surveillance systems have revolved from the simple video acquisition system to real-time intelligent autonomous systems. Figure 1 shows a timeline chart of the evolution of video surveillance.
Visual surveillance systems come back into existence back in 1942. Primarily, closed-circuit television (CCTV) is used commercially as a security system, mainly for indoor environment. The main concerns of initial CCTVs were (1) voltage signals not openly transmitted in a distributed environment, (2) CCTV depends on strategic placements of cameras as per the geographical structure of workplace, (3) human observer is required for camera inputs to monitor the CCTV recorded footage [4]. The CCTV loses its primary advantage as an active, real-time medium, because the video footage can be used only after the fact or incident occurs, that can be used as a legal evidence or forensic tool. Next, in 1996, IP-based surveillance cameras were introduced by Axis, that overcomes the limitation of initial CCTV cameras such as (1) IP-based camera's transmits the raw images instead of voltage signals using the secure transmission channel of TCP/IP, (2) IP-camera comes along with the video analytics, i.e., camera itself can be used for analyzing the images, (3) Ethernet cable can be used as a medium for power supply instead of dedicated power supply, and (4) two-way bidirectional audio signals can be transmitted over a single dedicated network [5]. The recent surveillance system facilitates with remote location monitoring on handheld device like mobile phones.
The video surveillance systems can be categories based on a camera system, application and architecture. The camera system includes single camera, multi camera, fixed camera, moving camera and hybrid camera systems, etc. The application-based system includes object tracking and recognition, ID reidentification, customized event notification and alert based system, behavior analysis, etc. Finally, the architecture-based system includes standalone systems, cloudbased and distributed systems [6]. A general framework of automated visual surveillance system is shown in Figure 2 [7][8][9]. Normally video surveillance system is based on multiple cameras, the videos from the multiple cameras are taken through the network and store in database.
The data need to be fused before incorporating the further processing. This can be done using data fusion techniques such as multi-sensory level, track to track and appearance to appearance [10][11][12]. After the data fusion following steps are performed. The traditional video surveillance system consists of various steps such A general framework of an automated visual surveillance system [7][8][9]. as (1) motion and object detection, (2) object classification, (3) object tracking, (4) behavior understanding and activity analysis, (5) pedestrian identification and (6) data fusion. Each stage of automated visual surveillance system is described as follows.

Motion and object detection
Object detection is the first step that deals with detecting instances of semantic objects of a certain class, such as humans, buildings, cars, etc. in a sequence of videos. The different approaches of object detection are frame-to-frame difference, background subtraction and motion analysis using optical flow techniques [13]. These approaches typically use extracted features and learning algorithms to recognize instances of an object category. The object detection process is divided into two categories. First, object detection, which include mainly three types of methods such as background subtraction, optical flow and spatiotemporal filtering. Second, object classification, use primarily visual features as shape based, motion based and texture-based method [13]. Motion detection is one of the problems in video surveillance, as it is not only responsible for the extraction of moving objects, but also critical to many applications including object-based video encoding, human motion analysis, and human machine interactions [14,15].
After object detection, next step is motion segmentation. This step is used for detecting regions corresponding to moving objects such as humans or vehicles. It mainly focuses on detecting moving regions from video frames, and creating a database for tracking and behavior analysis. Motion detection is used for detecting a change in the position of an object, relative to its surroundings or a change in the surroundings, relative to an object. Motion detection can be achieved using electronic motion sensors, which detect the motion from the real environment.

Object tracking
Tracking of objects in a video sequence means identifying the same object in a sequence of frames using the object unique characteristics represented in the form of features. Generally, the detection process is always followed by tracking in video surveillance systems. Tracking is performed from one frame to another, using tracking algorithms such as kernel-based tracking, point based tracking and silhouette-based tracking [16].

Behavior and activity analysis
In some conditions, it is mandatory to analyze the behaviors of people and determine whether their behaviors are suspicious or not, such as the behavior of pedestrian at a crowded place (e.g. public market places and government offices, etc.). In this step the motion of objects is recognized from the video scene and generate the description of the action. Ahmed Elaiw et al. [80] proposed a critical analysis and modelling strategy of human crowds with the intention of selecting the most relevant scale out of three approaches, i.e., (1) microscopic, means pedestrian are individual detected based on the location, velocity and motion parameter is neglected, (2) mesoscopic, means pedestrian are detected based on position, velocity and depend on the distribution function and (3) macroscopic, mean the pedestrian are identified based on the average pedestrian quantity, moment of pedestrian. It can be used for efficient decision making in critical situations when human crowd safety is important. Safety of human crowds depends upon the quantity and density of pedestrian move physically at different high crowed places.

Person identification
The last step is human identification. Human face and gait are the main biometric features that can be used for personal identification in visual surveillance systems after a behavior analysis [8].
The goal of this chapter is to discuss the issues and challenges involved in designing visual surveillance system. Again, group pedestrian detection and tracking methods used for moving and fixed camera into broad categories and give an informative analysis of relative methods in each category. The main contributions of this chapter are as follows: • The comparative analysis of publicly available benchmark datasets of pedestrian with its use, specification and environment limitation • Analyze issues and challenges of pedestrian detection and tracking in the video sequences captured by a moving and fixed camera • Categorizing the methods of pedestrian detection and tracking in different ways based on the general concept of methods belonging to each category and described proposed improvements for each method This chapter is organized into the following sections. Section 1 gives an introductory part, the importance of video surveillance system, recent advancement and general framework of video surveillance. Section 2, discusses different benchmark pedestrian datasets used to compare the different methods of pedestrian detection and tracking. Section 3, presents a detailed discussion on issues and challenges of pedestrian detection and tracking in video sequence. Section 4, groups the methods of pedestrian detection and tracking method for moving and fixed camera into different categories, describe their general concept with the improvements in each category. In Section 5, discusses possible future directions. Finally, the chapter concluded with a discussion in Section 6.

Pedestrian datasets reported in literature
The state-of-the-art methods for pedestrian detection and tracking method include adaptive local binary pattern (LBP), histogram of oriented gradient (HOG) into a multiple kernel tracker, spatiotemporal context information-based method using benchmark databases [10]. In this section we outlined the benchmark datasets that has been commonly used by the researchers. Figure 3 shows a sample image of each pedestrian dataset. Next, we discuss each database with its specification, use and environmental constrain followed by comparative analysis.

Massachusetts Institute of Technology (MIT) pedestrian dataset
It is one of the first pedestrian datasets, fairly small and relatively well solved at this point. This data set contains 709 pedestrian images taken in city streets. Out of this 509 training and 200 test images of pedestrian in city scenes. Each image contains either a front or a back view with a relatively limited range of poses [11,12].

Caltech pedestrian dataset
The Caltech dataset consists of 640 Â 480 resolution video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated for testing and training purpose. The annotation includes bounding boxes for each pedestrian walking on streets and detailed occlusion labels for each object captured in a video sequence in an urban environment. The annotation of pedestrians is used for validating the pedestrian detection and tracking algorithm accuracy [10].

General Motors-Advanced Technical Center (GM-ATCI) pedestrian dataset
GM-ATCI dataset is a rear-view pedestrians database captured using a vehiclemounted standard automotive rear-view display camera for evaluating rear-view pedestrian detection. In total, the dataset contains 250 clips duration of 76 min and over 200K annotated pedestrian bounding boxes. The dataset has been captured at different locations, including: indoor and outdoor parking lots, city roads and private driveways. This dataset was collected in both day and night scenarios, with different weather and lighting conditions [15].

Daimler pedestrian dataset
The pedestrian images captured from a vehicle-mounted calibrated stereo camera rig in an urban environment. This dataset contains tracking information and a large number of labeled bounding box with a float disparity map and a ground truth shape image. The training set contains 15,560 pedestrian samples with 6744 label pedestrian and testing set contains more than 21,790 images with 56,492 pedestrian labels [15].

National Information and Communication Technology Australia (NICTA) pedestrian dataset
It is a large-scale urban dataset collected in multiple cities and countries. The dataset contains around 25,551 unique pedestrians of humans, allowing for a dataset of over 50 K images with mirroring and annotation for validating detection and tracking algorithm accuracy [16].

Swiss Federal Institute of Technology (ETH) pedestrian dataset
It is an urban dataset captured from a stereo rig mounted on a stroller. Observing a traffic scene from inside a vehicle. The database is used for pedestrian detection and tracking from moving platforms in an urban scenario. Dataset consists of traffic agents such as different cars and pedestrians. One can predict their further motion, or even interpret their intention. At the same time, one needs to stay clear of any obstacles, remain on the assigned road, and read or interpret any traffic signs on the side of the street. On top that, a human is able to assess the situation, when close to a school or pedestrian crossing, one ideally will adapt one's driving behavior [17].

TUD-Brussels pedestrian dataset
This dataset consists of pairs recorded in a crowded urban setting from a moving platform with an onboard camera and challenging automotive safety scenario in urban environment [18].

National Institute for Research in Computer Science and Automation (INRIA) pedestrian dataset
INRIA is currently one of the most popular static pedestrian detection datasets. It contains moving people with significant variation in appearance, pose, clothing, background, illumination, coupled with moving cameras and backgrounds. Each pair shows two consecutive frames [19].

PASCAL visual object classes (VOC) 2007 and 2012 dataset
This is static object dataset with diverse object views and poses. The goal of visual object classes challenge is to recognize objects from a number of visual object classes in realistic scenes. The 20 object classes that have been selected are (1) person, (2) animal, (3) vehicle [20].

Microsoft Common Object in Context (COCO) 2018 dataset
The COCO is recent dataset created by Microsoft [22]. The dataset designed to spur object detection research with a focus on detecting objects in context. The annotations include different instances of segmentations for objects belonging to 80 categories of object, stuff segmentations for 91 categories, key point annotations for person instances, and five image label per image. The different COCO 2018 dataset challenges are (1) object detection with segmentation masks on the image, (2) panoptic segmentation, (3) person key point estimation, and (4) dense pose detection. Figure 3(g) shows the sample images of MS COCO dataset.

Mapillary vistas research dataset
The Mapillary vistas panoptic segmentation targets the full perception stack for scene segmentation in street-images [22]. Panoptic segmentation solves both stuff and thing classes, unifying the typically distinct semantic and instance segmentation tasks efficiently.  application for video surveillance system is shown in Table 1. The comparison is performed in terms of application of dataset, size of dataset, dataset creation environment scenarios and type of annotation details used for testing, training and validation of detection and tracking algorithm performance. These datasets used by the researchers for testing the performance of their respective pedestrian detection and tracking algorithm.

Issues and challenges of pedestrian detection and tracking
The moving object is a nonrigid thing that moves over time in image sequences of a video captured by a fix or moving the camera. In video surveillance system the region of interest is a human being that needs to be detected and tracked in the video [23]. However, this is not an easy task to do due to the many challenges and difficulties involved. These challenges occur at three different levels of pedestrian detection. Video acquisition, human detection and its tracking. The challenges in acquiring video are, viz. illumination variation, abrupt motion, complex background, shadows, object deformation, etc. Human detection and tracking challenges are varied poses, occlusion, crowd density area tracking, etc. Each issues and challenges are represented here in this section.

Problems related to camera
Many factors related to video acquisition systems, acquisition methods, compression techniques, stability of cameras (or sensors) can directly affect the quality of a video sequence. In some cases, the device used for video acquisition might cause limitation for designing object detection and tracking (e.g., when color information is unavailable, or when the frame rate is very low). Moreover, block artifacts (as a result of compression) and blur (as a result of camera's vibrations) reduce the quality of video sequences [36]. Noise is another factor that can severely deteriorate the quality of image sequences. Besides, different cameras have different sensors, lenses, resolutions and frame rates producing different image qualities. A lowquality image sequence can affect moving object detection algorithms. Figure 4 shows an example of each challenge.

Camera motion
When dealing with detecting moving objects in the presence of moving cameras, the need for estimating and compensating the camera motion is evitable. However, it is not an easy task to do because of possible camera's depth changes and its complex movements. Many works elaborated an easy scenario by considering simple movements of the camera, i.e., pan tilt zoom (PTZ) cameras. This limited movement allows using a planar homography in order to compensate camera motions, which results in creating a mosaic (or a panorama) background for whole frames of the video sequence [37].

Nonrigid object deformation
In some cases, different parts of a moving object might have different movements in terms of speed and orientation. For instance, a walking dog when wags its tail or a moving tank when rotates its turret. When dealing with detecting such moving objects, most algorithms, different moving objects. It produces an enormous challenge, especially for nonrigid objects and in the presence of moving cameras. In Hou et al. [40], articular models have been proposed for moving nonrigid objects to handle nonrigid object deformation. In these models, each part of an articulated object is allowed to have different movements. It can be concluded that local features of a moving object along with updating background models are more efficient for dealing with this challenge.

Illumination variation
The lighting conditions of the scene and the target might change due to the motion of light source, different times of day, reflection from bright surfaces, whether in-outdoor scenes, partial or complete blockage of the light source by other objects, etc. The direct impact of these variable results in background appearance changes, which causes false positive detections for the methods based on background modeling. Thus, it is essential for these methods to adapt their model to this illumination variation. Meanwhile, because the object's appearance changes under illumination variation, appearance-based tracking methods may not be able to track the object in the sequence [23][24][25][26][27][28]. Thus, it is required for these methods to use features which are invariant to illumination.

Presence of abrupt motion
Sudden changes in the speed and direction of the object's motion or sudden camera motion are another challenge of video acquisition that affects the object detection and tracking. If the object or the camera moves very slowly, the temporal  [39]). (b) An example of appearance change challenge (Dudek in the Ross dataset [39]). (c) An example of abrupt motion challenge (Motocross in the Kalal dataset [50]). (d) An example of occlusion challenge (car in the Kalal dataset [50]). (e) An example of freely motion of camera in the Michigan University dataset [10]. (f) An example of dynamic background challenge (Kitesurf in the Zhang dataset [60]). (g) An example of shadow challenge (pedestrian 4 in the Kalal dataset [50]). (h) An example of panning in camera in the CDNET database [10]. (i) An example of zooming in camera in the CDNET database [10]. (j) An example of nonrigid moving object in a video sequence [67].
differencing methods may fail to detect the portions of the object coherent to background [31]. Meanwhile, a very fast motion produces a trail of the ghost detected region. So, if this object's motions or camera motions are not considered, the object cannot correctly be detected correctly by methods based on background modeling. On the other hand, for tracking-based methods, prediction of motion becomes hard or even impossible; as a result, the tracker might lose the target. Even if the tracker does not lose the target, the unpredictable motion can introduce a greater amount of error in algorithms [32].

Complex background
The background may be highly textured, especially in natural outdoor environments where high variability of textures is presented in outdoor scenes. Moreover, the background may be dynamic, like it may contain movement (e.g., a fountain, clouds in movement, traffic lights, trees waggle, water waves, etc.). These need to be considered as background in many moving object detection algorithms. Such movements can be periodic or nonperiodic [34].

Shadows
The presence of shadows in video image sequences complicates the task of moving object detection. Shadows are created due to the occlusion of the object by the light source. If the object does not move during the sequence, resulted shadow is considered as static and can effectively be incorporated into the background. However, a dynamic shadow, caused by a moving object, has a critical impact for accurately detecting moving objects, since it has the same motion properties as the moving object and is tightly connected to it. Shadows can be often removed from images of the sequence using their observed properties such as color, edges and texture or applying a model based on prior information such as illumination conditions and moving object shape [35,47,48]. However, dynamic shadows are still difficult to be distinguished from moving objects, especially for outdoor environment where the background is usually complex.
Next, human detection and tracking issues and challenges are discussed in brief. It includes varying poses, occlusion, crowd density area tracking, etc.

Pedestrian occlusion
The object may be occluded by other objects in the scene. In this case, some parts of the object can be camouflaged or just hidden behind other objects (partial occlusion) or the object can be completely hidden by others (complete occlusion). As an example, consider the target to be a pedestrian walking on the sidewalk. It may be occluded by trees, cars in the street, other pedestrians, etc. Occlusion severely affects the detection of objects in background modeling methods, where the object is completely missing or separated into unconnected regions [33]. If occlusion occurs, the object's appearance model can change for a short time, which can cause some of the object tracking methods.

Pose variation: moving object appearance changes
In real scenarios, most objects can occur in 3D space, but we have the projection of their 3D movement in a 2D plane. Hence, any rotation in the direction of third axis may change the object appearance [29]. Tracking algorithm performance gets affected due to variation in pose. Same pedestrian looks different in consecutive frames, if the pose changes continuously. Moreover, the objects themselves may have some changes in their pose and appearance like facial expressions, changing clothes, wearing a hat, etc. Also, the target can be a nonrigid object, where its appearance may change over time. In many applications, the goal is tracking humans or pedestrians, which makes tracking algorithms vulnerable in this challenging case [30]. Table 2 summarizes the comparative analysis of methodologies with its advantage, identified gaps and observation for handling these challenging issues in a video surveillance system.

Pedestrian detection and tracking
In video-based surveillance, one of the key tasks is to detect the presence of pedestrians in a video sequence, i.e., localizing all subjects that are human [45,68]. This problem corresponds to determining regions, typically the smallest rectangular bounding boxes in the video sequence that enclose humans. In most of the surveillance systems, human behavior has been recognized using analysis of the trajectories, positions of persons and historical or prior knowledge about the scene. Figure 5 shows some examples of pedestrian detection and tracking. Haritaoglu et al. [46] describe a combined approach of shape analysis and body tracking, and model different appearances of a person. This has been designed for outdoor environment using a single camera. The system detects and tracks groups of people and monitors the behaviors, even in the presence of partial occlusion. However, the performance is mainly based on the detected trajectories of the concerned objects in video. Furthermore, the results are not sufficient for semantic recognition of dynamic human activities and event analysis in some cases. The advanced automatic video surveillance system consists of many features such as, motion detection [69,70], human behavior analysis, detection and tracking [71][72][73]. Human tracking is quite challenging, since humans may vary in intra-class variability in shape, appearance due to different viewing perspectives and other visual properties.
Krahnstoever et al. [75] designed a real-time control system of active cameras for a multiple-camera surveillance system. Hence, various researchers shifted focus from static fixed camera-based pedestrian detection to moving dynamic multicamera-based pedestrian detection. Pedestrian tracking has been done by stationary cameras using a shape-based method [76], which detects and compares the humanbody shape in consecutive frames. The cameras have been calibrated using a common site-wide metric coordinate system described in [77,78]. Funahasahi et al. [73] developed a system for tracking the human head and face parts by means of a hierarchical tracking method using a stationary camera and a PTZ camera. The recent surveillance system focuses on human tracking by detection as described in [72][73][74][75]. Andriluka et al. [76][77][78] combined the initial estimate of the human pose across frames in a tracking-by-detection framework. Sapp et al. [79] coupled locations of body joints within and across frames from an ensemble of tractable submodels. Wu and Nevatia [80] proposed an approach for detection and tracking of partially occluded people using an assembly of body parts.
The tracking of humans becomes more challenging under moving cameras than in static cameras as discussed in Section 2. Many effective pedestrian tracking techniques used in static camera, such as background subtraction and modeling [80] and a constant ground plane assumption, makes the task more difficult. Instead of using background modeling-based methods to extract the human information, human detectors are widely used to detect the human in the video. Therefore, the challenge is to successfully detect the humans in moving cameras, and then apply the tracking techniques to detected humans. However, human detectors may effectively extract human, still have some limitations viz. human detectors may produce false or miss human detection, when humans are partially or fully occluded, the detections can fail and the tracking can be unreliable until the human reappear in the frames. It is observed that, many of the researcher works on many of challenges of pedestrian detection and tracking, but still complete and reliable solution to all the challenges like discussed. Most of the algorithms of pedestrian detection and tracking were tested in indoor and outdoor environment. Attempts were also made to estimate the accuracy of the system based on detection rate, time and computational complexity. From the performance evaluation of algorithms presented in authors, it is observed that, deep learning based pedestrian detection and tracking approaches can be efficient choice for real-time environment [45,65]. There is still a scope of improvement in existing approaches of pedestrian detection and its tracking in surveillance system.

Conclusions
This chapter describes and reviews the methodologies, strategies and steps involved in video surveillance. It also addresses the challenges, issue, available databases, available solutions and research trends for human detection and tracking in video surveillance system. Based on the literature survey, most of the available techniques proposed by the earlier researchers can perform object detection and tracking either within single camera view or across multiple cameras. However, most of them failed to encounter trade-off problem between accuracy and speed. Although the accuracy of the trackers is very good, they are often impractical because of their high computational requirements and vice versa. Thus, to achieve an optimal trade-off, adaptive object detection and tracking method, it is essential