Automatic surveillance system using f ish-eye lens camera
This letter presents an automatic surveillance system using fish-eye lens camera. Our system achieves wide-area automatic surveillance without a dead angle using only one camera. We propose a new human detection method to select the most adaptive classifier based on the locations of the human candidates. Human regions are detected from the fish-eye image effectively and are corrected for perspective versions. An experiment is performed on indoor video sequences with different illumination and crowded conditions, with results demonstrating the efficiency of our algorithm.
OCIS codes: 110.0110, 100.0100, 150.0150.
doi: 10.3788/COL201109.021101.
Due to large field of view, wide-angle lens are popularly used for various applications, such as surveillance, robotic navigation, and semi-automatic parking systems. Because the angle of view of the fish-eye lens used in our system was up to 185? , it achieved effective widearea surveillance without a dead angle only one camera. However, it brought an inherent distortion in the image, and this distorted image must be rectified or restored in order to recognize and understand the image accurately. Human detection and tracking is a necessary approach for automatic surveillance systems. However, in the image taken by our surveillance system, the region where human enters the surveillance space is distorted and it is difficult to detect humans using the original method introduced in Refs. [1?9]. To our knowledge, there is still no reliable pedestrian detection algorithm reported for fish-eye image. Refernece [10] proposed human detection method using fish-eye image to detect ellipses from the subtraction images of fish-eye pictures as human area. However, in a more crowded situation and when sudden illumination changes occur, their method shows a clear increase in false alarm rate.
In order to improve the efficiency of human detection on fish-eye images even in crowded indoor environments, we propose a human detection method. The rotations and sizes of the human regions on the fish-eye image change based on the locations of humans in the surveillance area. We propose a method to normalize these regions. Because a fish-eye lens camera is set on top of the surveillance area, the shapes of humans are changed based on their locations in the surveillance area. In this letter, we create three types of classifiers to detect humans in any part of the surveillance area; the most adaptive classifier for each human is chosen automatically from several classifiers. Moreover, we propose a method to minimize the occlusion effects. We infer the possible occlusion region in each human candidate region based on its location on the fish-eye image. Once the occluded regions are detected, the occlusion effects can be minimized by adjusting the threshold of the classifier.
Unlike other systems such as those proposed in Refs. [11,12], the human regions in our proposed method are detected initially from the fish-eye image, and only the human regions are corrected afterwards. In other systems, the entire input fish-eye images are corrected first and then the human regions are detected from the corrected images. Using our system, the processing efficiency can be improved and the processing time can be significantly reduced.
The system is designed as illustrated in Fig. 1, wherein the fish-eye lens camera is set on top of the surveillance area. The input image of the fish-eye lens camera is illustrated in Fig. 2, with the background image illustrated in Fig. 2(a) and the input image illustrated in Fig. 2(b).
The edges of the background and input images are extracted using Sobel operator[13] as illustrated in Figs. 3(a) and (b). In addition, the subtraction image between the input edge image and the background edge image is computed, as illustrated in Fig. 3(c). As shown in Fig. 3, all the head edges look like ellipses; thus, an efficient ellipse detection method[14] is adopted to extract the el lipses from the edge image as head candidates. The proposed method is presented as follows.
For each pair of pixels, (x1, y1) and (x2, y2), the following five parameters of an ellipse can be calculated:
where (x0, y0) is the center of the assumed ellipse, a is the half-length of the major axis, α is the orientation of the ellipse, f is the focus of the ellipse, b is the half-length of the minor axis, and d is the distance between (x, y) and (x0, y0). A one-dimensional accumulator array is then used to vote on the half-length of the minor axis; if the votes reach a threshold, an ellipse is found and we yield the parameters for the detected ellipse and remove all pixels on that ellipse from the image.
The results of the extracted head candidates are illustrated in Fig. 4. The following process will be executed for each head candidate.
Based on the location of the head candidate, the method introduced in Ref. [15] was adopted in this letter in order to determine the size of the human candidates in different locations. Considering a cube whose size is bigger than a normal man standing on the floor in the real world, the projection of the cube can be considered as the human candidate region when the coordinate of the upper cube’s projection is near the head candidate.
As shown in Fig. 1, points P 0 1 , P 0 2 , and P 0 3 in the fish eye image are the projections of points P1, P2, and P3 in real word, respectively. The projections of humans A and B are illustrated in Fig. 5. All the projections of humans seem to stand on the line la between the center of the fish-eye image (O) and their head candidate center (P 0 2 and P 0 3 ); the feet of the human (P 0 1 ) are always closer to the center of the fish-eye image (O) than the human’s head (P 0 2 ). The angles α between the vertical line and the line from the center of the hand candidates are computed, and all the human candidate regions are rotated using
where (x, y) is the coordinate of the original image and (x 0 , y 0 ) is the coordinate of the rotated image. The results of the normalized human candidate regions are illustrated in Fig. 6.
As shown in Fig. 5, when humans stay at different locations, their shapes will change, making it impossible to detect all of them using the same detector. In this letter, we created three types of classifiers to categorize human and non human. Selecting the most adaptive classifier is thus an important issue. The shapes of humans change based on their distances from the head candidate center to the center of the image. We constructed three classifiers using different training images, with the most adaptive classifier selected using the following rules: If di ≤ θ1, classifier 1 is activated; if θ1 ≤ di≤ θ2, classifier 2 is activated; if di≥ θ3, classifier 3 is activated. Here di is the distance between the center of the fish-eye image and the head candidate center; θ1, θ2, and θ3 are some constants related to the threshold values.
We propose a method for selecting the values of θ1, θ2, and θ3. We consider a cube whose size is bigger than a normal man standing on the floor in the real world and whose height is three times longer than its width.
Assuming that the cube moves from a distance and its projection is observed, when the height of the rectangle becomes 1.5 times longer than its width, θ2 equals the distance between the upper point of the cube’s projection and the center of the image. In the same way, when the height of the rectangle becomes equal to its width, θ1 equals the distance between the upper point of the cube’s projection and the center of the image.
In order to achieve accurate human detection, we adopt the histograms of oriented gradient (HOG) descriptors and use 13 cascade AdaBoost classifiers[7,8] to construct the three classifiers.
Since the human detection approach controls partial occlusions poorly, its accuracy may decrease in crowded conditions. Hence, we propose a method to handle occlusion effects.
As shown in Fig. 1, human A is closer to the center of the surveillance area (O0 ) than human B in the real world, thus the projection of human A on the fish-eye image is also closer to the center of the fish-eye image (O) than human B. Using the fish-eye image taken by our system, the location of each human in the real world can be estimated. In this system, the human candidates are listed in order of their distances from the center of the fish-eye image, and human detection will proceed following this sequence.
If a human candidate region is evaluated as a human, this rectangle area is labeled as human A, as illustrated in Fig. 5. When a portion of the human candidate region is occluded by a human who has been previously detected, the occluded area (shown as the striped quadrilateral area in Fig. 5) and the occluded ratio in this region are computed using
where Aocl is the occluded area and Aall is the human candidate area.
The ratio rocl is used to adjust the threshold of the classifier. In this system, we adjust the number of cascades to minimize the occlusion effects. Increasing the number of cascades causes the detection of humans to become more difficult; decreasing the number cascades makes humans easier to be detected. The number of cascades is adjusted based on rocl using the following rules: if rocl < 0.3, the number of cascades is adjusted to 11; if 0.3 ≤ rocl < 0.5, the number of cascades is adjusted to 9; if rocl > 0.5, the human is considered to be the same person as the human in the front.
Fish-eye imaging system brings an inherent distortion in the image. Therefore, it is necessary to correct the human region to make it easier to understand. In our proposed system, we adopt the method introduced in Refs. [16,17] to correct the detected human region. The object plane shown in Fig. 7 is a typical region of interest; we aim to determine its mapping relationship onto the image plane to properly correct the perspective of the object. The image plane corresponds to the input fisheye images. The direction-of-view vector, DOV(x,y,z), determines the zenith and azimuth angles for mapping the object plane onto the image plane, XY. The object plane is defined to be perpendicular to the vector ???→ DOV(x,y,z).
where l 0 is the distance from the object plane to the image plane (see Fig. 7), and R is the radius of the fish-eye camera. Equations (9) and (10) provide a direct mapping from the XY image space to the XvYv space, providing the fundamental mathematical foundation for the omni-directional viewing system. By determining the desired zenith, azimuth, and object plane rotation angles and magnification, the locations of x and y in the input image can be established. Using Eqs. (3) and (4), the points in the object plane can then be computed, and the corrected human regions are achieved.
We collected images for training and video sequences of scenes for testing. The training data consisted of 2000 positive images and 2000 negative images for each classifier, whereas the test data consisted of 800 positive images and 800 negative images. The training data, as well as the test data, were captured on different places, and involved different people. All the test images were indoor scenes. The following conditions hold true: the maximum number of pedestrians in each scene is 10; 98 people are captured on the crowded conditions; the number of each scene is over 6; 110 people are captured when sudden illumination changes occur; 48 people are captured with the large cart. We compared five experiments to demonstrate the efficiency of our proposed method.
All the training data used for constructing classifier 1 follow this rule: the distances between the head candidate center of people and the center of the image are less than θ1. Figure 8 shows the examples of detection results using only classifier 1; the rectangle shows the detected human region.
All the training data used for constructing classifier 2 follow this rule: the distances between the head candidate center of people and the center of the image are di , θ1 ≤ di ≤ θ2. Figure 9 shows the examples of detection results using only classifier 2; the rectangle shows the detected human region.
All the training data used for constructing classifier 3 follow this rule: the distances between the head candidate center of people and the center of the image are larger than θ2. Figure 10 shows the examples of detection results using only classifier 3; the rectangles shows the detected human region.
The training data used for constructing the overall classifier include all the training data used for constructing classifiers 1, 2, and 3. The training data consisted of 6000 positive images and 6000 negative images. Figures 11?13 show the examples of detection results using the overall classifier; the rectangles are the detected human regions. As shown in the figures, too many false alarms appear using the overall classifier.
In Ref. [10], the authors proposed a method to detect ellipse regions as human regions. The experiment for comparison was carried out using the same testing images that were used in our proposed system. Experimental results show that the performance of their method is very poor when sudden illumination changes occur. Figure 14 shows the examples of detection results using the ellipse detection method.
As shown in Figs. 15?17, the proposed method may successfully detect humans in fish-eye images. The experimental results show that the performance of the proposed method is better than any other methods.
A receiver operating characteristic (ROC) curve, which plots the detection rate versus the false positive rate, shows the experimental results (Fig. 18). With a false negative rate of 10%, our method has a false positive rate that is 9.75% lower than the system using only classifier 1 for classification, 5.25% lower than the system using only classifier 2, 4.5% lower than the system using only classifier 3, 4% lower than the system using the overall classifier, and 11.25% lower than the system that finds ellipses as humans. These results indicate that our proposed method has better accuracy compared with other methods.
In conclusion, we present an automatic surveillance system using fish-eye lens camera. We propose a human detection method using fish-eye lens camera, which achieves wide-area automatic surveillance without a dead angle using only one camera. The experimental results demonstrate the efficiency of our algorithm.