CN106355171A

CN106355171A - Video monitoring internetworking system

Info

Publication number: CN106355171A
Application number: CN201611063348.0A
Authority: CN
Inventors: 邱林新
Original assignee: Shenzhen Kaida Photoelectric Technology Co Ltd
Current assignee: Shenzhen Kaida Photoelectric Technology Co Ltd
Priority date: 2016-11-24
Filing date: 2016-11-24
Publication date: 2017-01-25

Abstract

The invention provides a video monitoring internetworking system. The video monitoring internetworking system is used for identifying personnel by two types of voices and images. The video monitoring internetworking system comprises a collection system, a voice recognition system and an image recognition system, wherein the collection system is used for collecting the voices and images; the voice recognition system comprises a dictionary scene voice module, a similarity comparison module and a voice recognition engine module; the image recognition system comprises a preprocessing module, a feature extraction module, a training module, a re-recognition module and an evaluation module. The video monitoring internetworking system has the advantage that the personnel can be effectively recognized.

Description

A video monitoring network system

技术领域technical field

本发明涉及视频监控领域，具体涉及一种视频监控联网系统。The invention relates to the field of video monitoring, in particular to a video monitoring networking system.

背景技术Background technique

视频监控是安全防范系统的重要组成部分，传统的监控系统包括前端摄像机、传输线缆、视频监控平台。摄像机可分为网络数字摄像机和模拟摄像机，可作为前端视频图像信号的采集，它是一种防范能力较强的综合系统。视频监控以其直观、准确、及时和信息内容丰富而广泛应用于许多场合。近年来，随着计算机、网络以及图像处理、传输技术的飞速发展，视频监控技术也有了长足的发展。Video surveillance is an important part of the security system. The traditional surveillance system includes front-end cameras, transmission cables, and video surveillance platforms. Cameras can be divided into network digital cameras and analog cameras, which can be used as front-end video image signal collection. It is a comprehensive system with strong preventive capabilities. Video surveillance is widely used in many occasions because of its intuition, accuracy, timeliness and rich information content. In recent years, with the rapid development of computer, network, image processing and transmission technology, video surveillance technology has also made great progress.

发明内容Contents of the invention

本发明旨在提供一种能够对人员进行快速、有效识别的视频监控联网系统。The invention aims to provide a video monitoring network system capable of quickly and effectively identifying personnel.

本发明的目的采用以下技术方案来实现：The object of the present invention adopts following technical scheme to realize:

提供了一种视频监控联网系统，能够通过语音和图像两种方式对人员进行识别，包括采集系统、语音识别系统和与图像识别系统，所述采集系统对语音和图像进行采集，所述语音识别系统包括词典场景语音模块、相似度比较模块和语音识别引擎模块，所述图像识别系统包括预处理模块、特征提取模块、训练模块、再识别模块和评价模块；所述预处理模块用于确定行人图像中的人员位置，获取包含人员的矩形区域；所述特征提取模块，用于在包含人员的矩形区域中进行外观特征提取；所述训练模块用于训练多个跨模态投影模型，每一个跨模态投影模型中包含两个投影函数，它们分别将不同摄像机中的图像持征映射到共同的特征空间中并完成相似度计算；所述再识别模块，用于识别数据库中是否含有与查询人员一致的行人图像并确认查询人员身份；所述评价模块用于对系统性能进行评估。Provided is a networked video monitoring system capable of identifying personnel through voice and image, including a collection system, a voice recognition system and an image recognition system, the collection system collects voice and images, and the voice recognition system The system includes a dictionary scene speech module, a similarity comparison module and a speech recognition engine module, and the image recognition system includes a preprocessing module, a feature extraction module, a training module, a re-identification module and an evaluation module; the preprocessing module is used to determine pedestrians The position of the person in the image is to obtain a rectangular area containing the person; the feature extraction module is used to extract appearance features in the rectangular area containing the person; the training module is used to train multiple cross-modal projection models, each The cross-modal projection model contains two projection functions, which respectively map the images in different cameras into a common feature space and complete the similarity calculation; the re-identification module is used to identify whether the database contains and query Pedestrian images with the same personnel and confirm the identity of the query personnel; the evaluation module is used to evaluate the performance of the system.

本发明的有益效果为：实现了对人员的有效识别。The beneficial effect of the present invention is that the effective identification of personnel is realized.

附图说明Description of drawings

利用附图对本发明作进一步说明，但附图中的实施例不构成对本发明的任何限制，对于本领域的普通技术人员，在不付出创造性劳动的前提下，还可以根据以下附图获得其它的附图。The present invention is further described by using the accompanying drawings, but the embodiments in the accompanying drawings do not constitute any limitation to the present invention. For those of ordinary skill in the art, without paying creative work, other embodiments can also be obtained according to the following accompanying drawings Attached picture.

图1是本发明的结构连接示意图。Fig. 1 is a schematic diagram of structural connection of the present invention.

附图标记：Reference signs:

采集系统1、语音识别系统2、图像识别系统3。Acquisition system 1, voice recognition system 2, image recognition system 3.

具体实施方式detailed description

结合以下实施例对本发明作进一步描述。The present invention is further described in conjunction with the following examples.

参见图1，本实施例的一种视频监控联网系统，能够通过语音和图像两种方式对人员进行识别，包括采集系统1、语音识别系统2和与图像识别系统3，所述采集系统1对语音和图像进行采集，所述语音识别系统2包括词典场景语音模块、相似度比较模块和语音识别引擎模块，所述图像识别系统3包括预处理模块、特征提取模块、训练模块、再识别模块和评价模块；所述预处理模块用于确定行人图像中的人员位置，获取包含人员的矩形区域；所述特征提取模块用于在包含人员的矩形区域中进行外观特征提取；所述训练模块用于训练多个跨模态投影模型，每一个跨模态投影模型中包含两个投影函数，它们分别将不同摄像机中的图像持征映射到共同的特征空间中并完成相似度计算；所述再识别模块用于识别数据库中是否含有与查询人员一致的行人图像并确认查询人员身份；所述评价模块用于对系统性能进行评估。Referring to Fig. 1, a kind of video surveillance networking system of the present embodiment can identify people through voice and image, including acquisition system 1, voice recognition system 2 and image recognition system 3, and the acquisition system 1 is paired with Voice and image are collected, and described voice recognition system 2 comprises dictionary scene voice module, similarity comparison module and voice recognition engine module, and described image recognition system 3 comprises preprocessing module, feature extraction module, training module, re-identification module and An evaluation module; the preprocessing module is used to determine the position of the person in the pedestrian image, and obtains a rectangular area containing the person; the feature extraction module is used to extract the appearance feature in the rectangular area containing the person; the training module is used for Training multiple cross-modal projection models, each cross-modal projection model contains two projection functions, which respectively map images in different cameras into a common feature space and complete similarity calculations; the re-identification The module is used to identify whether there is a pedestrian image consistent with the queryer in the database and to confirm the identity of the queryer; the evaluation module is used to evaluate the performance of the system.

优选地，词典场景语音模块，适于对用户词汇表中的词典、场景语音依次进行采集，并将采集的特征矢量作为模版进行保存；Preferably, the dictionary scene speech module is adapted to sequentially collect the dictionary and scene speech in the user vocabulary, and save the collected feature vector as a template;

相似度比较模块，适于将语音输入语音信号的特征矢量依次与所述词典场景语音模块中保存的每个特征矢量模版进行相似度比较，将相似度最高者作为语音识别结果输出。The similarity comparison module is adapted to compare the feature vectors of the speech input speech signal with each feature vector template stored in the dictionary scene speech module in turn, and output the one with the highest similarity as the speech recognition result.

本有选实施例实现了对人员的有效识别。This alternative embodiment achieves effective identification of persons.

优选地，所述词典场景语音模块中的模版包括监控系统术语模版和人体语音加词典模版。Preferably, the templates in the dictionary scene speech module include monitoring system term templates and human voice plus dictionary templates.

本有选实施例加快了识别速度。This alternative embodiment speeds up the recognition speed.

优选地，所述预处理模块包括图像融合单元，所述图像融合单元用于对不同来源的图像进行融合处理，以便更好地获取图像的全面特征，包括：对需要融合的两幅源图像分别用双正交小波变换进行小波分解，确定分解后图像的小波系数；对低频系数按设定的比例选取分解后图像的小波系数，构成融合图像的小波低频系数矩阵；对高频系数采用纹理一致性测度分析特定区域不同高低频系数的边缘特性，计算图像区域的纹理一致性测度，并按照预定的规则确定融合图像的高频小波系数矩阵，所述图像区域的纹理一致性测度的计算公式定义为：Preferably, the preprocessing module includes an image fusion unit, and the image fusion unit is used to fuse images from different sources so as to better obtain comprehensive features of the images, including: separate the two source images that need to be fused Use biorthogonal wavelet transform for wavelet decomposition to determine the wavelet coefficients of the decomposed image; select the wavelet coefficients of the decomposed image according to the set ratio for the low-frequency coefficients to form the wavelet low-frequency coefficient matrix of the fused image; use consistent texture for the high-frequency coefficients The property measure analyzes the edge characteristics of different high and low frequency coefficients in a specific area, calculates the texture consistency measure of the image area, and determines the high-frequency wavelet coefficient matrix of the fused image according to predetermined rules, and the calculation formula of the texture consistency measure of the image area is defined for:

$E E. F f ((x x)) = = \frac{33}{88} (({EF EF}_{l l} + + {EF EF}_{c c})) + + \frac{11}{44} {EF EF}_{d d}$

式中，EF(x)表示图像区域x的纹理一致性测度，EF_l表示图像区域x的各高频分量图像在水平方向上的纹理一致性测度，EF_c表示图像区域x的各高频分量图像在垂直方向上的纹理一致性测度，EF_d表示图像区域x的各高频分量图像在对角线方向上的纹理一致性测度；将所述融合图像的小波低频系数矩阵、所述融合图像的高频小波系数矩阵进行离散双正交小波逆变换，最终获得融合图像。In the formula, EF(x) represents the texture consistency measure of the image region x, EF _l represents the texture consistency measure of each high-frequency component image in the image region x in the horizontal direction, and EF _c represents each high-frequency component of the image region x The texture consistency measure of the image in the vertical direction, EF _d represents the texture consistency measure of each high-frequency component image of the image area x in the diagonal direction; the wavelet low-frequency coefficient matrix of the fusion image, the fusion image The high-frequency wavelet coefficient matrix is subjected to discrete biorthogonal wavelet inverse transform, and finally the fused image is obtained.

本优选实施例设置图像融合单元，按照纹理一致性测度可较好地分辨出图像的伪边缘，在保证整体视觉效果的同时使细节信息更加丰富和真实；定义了图像区域的纹理一致性测度的计算公式，加快了图像融合的速度。This preferred embodiment sets the image fusion unit, which can better distinguish the false edges of the image according to the texture consistency measure, and makes the detail information more abundant and real while ensuring the overall visual effect; defines the texture consistency measure of the image area The calculation formula speeds up the speed of image fusion.

优选地，所述预定的规则包括：Preferably, the predetermined rules include:

(1)若图像区域中有88％以上像素值具有较大的纹理一致性测度，定义该图像区域为边缘区，选取相应的边缘纹理一致性测度最大的高频图像小波系数构成所述融合图像的高频小波系数矩阵；(1) If more than 88% of the pixel values in the image area have a larger texture consistency measure, define the image area as an edge area, and select the wavelet coefficient of the high-frequency image with the largest corresponding edge texture consistency measure to form the fusion image The high-frequency wavelet coefficient matrix of ;

(2)若图像区域中有88％以上像素值具有较小的纹理一致性测度，定义该图像区域为平滑区，分别计算两幅源图像在该图像区域的能量及匹配度，根据能量及匹配度确定两幅源图像的小波系数在融合图像小波系数中所占的比重，根据下式确定所述融合图像的高频小波系数矩阵：(2) If more than 88% of the pixel values in the image area have a small texture consistency measure, define the image area as a smooth area, and calculate the energy and matching degree of the two source images in the image area respectively. According to the energy and matching Determine the proportion of the wavelet coefficients of the two source images in the wavelet coefficients of the fusion image, and determine the high-frequency wavelet coefficient matrix of the fusion image according to the following formula:

R_G＝β_AR_A+β_BR_B R _G =β _A R _A +β _B R _B

式中，R_G表示融合图像的高频小波系数矩阵，R_A、β_A分别表示一副源图像的小波系数、该小波系数在融合图像小波系数中所占的比重，R_B、β_B分别表示另一副源图像的小波系数、该小波系数在融合图像小波系数中所占的比重，其中β_A+β_B＝1。In the formula, R _G represents the high-frequency wavelet coefficient matrix of the fused image, _RA and β _A represent the wavelet coefficient of a source image and the proportion of the wavelet coefficient in the wavelet coefficient of the fused image, and R _B and β _B respectively Indicates the wavelet coefficient of another secondary source image and the proportion of the wavelet coefficient in the wavelet coefficient of the fused image, where β _A + β _B =1.

本优选实施例按照预定的规则确定融合图像的高频小波系数矩阵，提高了融合的效果以及融合的速度。This preferred embodiment determines the high-frequency wavelet coefficient matrix of the fused image according to a predetermined rule, which improves the fusion effect and speed.

优选地，所述在包含人员的矩形区域中进行外观特征提取，包括：Preferably, the extraction of appearance features in a rectangular area containing people includes:

(1)进行图像的光照归一化处理，具体包括：a、设图像为I，利用LOG对数将图像I转换到对数域，利用差分高斯滤波器对图像I进行平滑处理；b、对图像I进行全局对比度均衡化处理；(1) Carry out the illumination normalization processing of image, specifically comprise: a, set image as I, utilize LOG logarithm to convert image I to logarithmic domain, utilize differential Gaussian filter to carry out smoothing process to image I; b, to image I Image I performs global contrast equalization processing;

(2)进行图像尺寸归一化处理；(2) Carry out image size normalization processing;

(3)进行图像分块，针对每个图像块，进行特征向量提取；(3) Carry out image segmentation, for each image block, carry out feature vector extraction;

(4)将所有图像块的特征向量进行串联，然后对串联后的图像进行PCA特征降维。(4) Concatenate the feature vectors of all image blocks, and then perform PCA feature dimensionality reduction on the concatenated images.

本优选实施例设置特征提取模块，在提取特征前先对图像进行光照归一化处理，减少了因光照变化而产生的图像扭曲，使特征的提取更为精确。In this preferred embodiment, a feature extraction module is provided to perform illumination normalization processing on the image before feature extraction, which reduces image distortion caused by illumination changes and makes feature extraction more accurate.

优选地，所述训练模块包括样本分类单元和跨模态投影模型学习单元；所述样本分类单元具体执行：Preferably, the training module includes a sample classification unit and a cross-modal projection model learning unit; the sample classification unit specifically performs:

设两个摄像机C₁和C₂对应的特征空间分别为和d₁和d₂分别表示两个摄像机特征空间的维度，假定训练数据集合为K对跨摄像机图像特征s_k＝s(x_k，y_k)∈{-1,+1}表示样本对的类别标签，-1表示异类，+1表示同类，根据类别标签将训练集合分为负样本集合和正样本集合|D₁|+|D₂|＝K；Let the feature spaces corresponding to two cameras C ₁ and C ₂ be and d ₁ and d ₂ represent the dimensions of the feature space of the two cameras respectively, assuming that the training data set is K pairs of cross-camera image features s _k ＝s(x _k ，y _k )∈{-1,+1} indicates the category label of the sample pair, -1 indicates heterogeneity, +1 indicates the same category, and the training set is divided into negative sample sets according to the category label and a set of positive samples |D ₁ |+|D ₂ |=K;

所述跨模态投影模型学习单元具体执行：The cross-modal projection model learning unit specifically executes:

设跨模态投影模型集合H＝[h₁h₂,…,h_L]，L个子模型用于处理L种数据差异，每一个子模型由一对投影函数构成，h_l＝[p_Xl(x),p_Yl(y)]，略去脚标l，投影函数p_X(x)和p_Y(y)将x∈X和y∈Y投影到共同的特征空间： Suppose the set of cross-modal projection models H=[h ₁ h ₂ ,…,h _L ], L sub-models are used to deal with L kinds of data differences, each sub-model is composed of a pair of projection functions, h _l =[p _Xl ( x), p _Yl (y)], omitting the subscript l, the projection functions p _X (x) and p _Y (y) project x∈X and y∈Y into a common feature space:

式中，表示投影向量，a、b∈R为线性偏差，p_x(x)和p_Y(y)将原始特征投影到{-1,+1}空间中；In the formula, Represents the projection vector, a, b∈R are linear deviations, p _x (x) and p _Y (y) project the original features into {-1,+1} space;

同时存在投影函数q_X(x)和q_Y(y)将x∈X和y∈Y投影到另一共同的特征空间：There are also projection functions q _X (x) and q _Y (y) that project x∈X and y∈Y to another common feature space:

$\{\begin{matrix} {q q}_{X x} ((x x)) = = {u u}^{T T} x x + + a a \\ {q q}_{Y Y} ((y the y)) = = {v v}^{T T} y the y + + b b \end{matrix}$

建立数据类别和共同特征空间之间的关系，定义目标函数：Establish the relationship between the data categories and the common feature space, and define the objective function:

式中，E表示期望，表示同类样本对和异类样本对的重要性权衡指数；In the formula, E represents expectation, Indicates the importance trade-off index of similar sample pairs and heterogeneous sample pairs;

式中，w_k表示样本对{x_k，y_k}在本次子模型学习中的样本权重，s_k＝s(x_k，y_k)∈{-1,+1}表示样本对的类别标签，In the formula, w _k represents the sample weight of the sample pair {x _k , y _k } in this sub-model learning, s _k ＝s(x _k ，y _k )∈{-1,+1} represents the category label of the sample pair,

通过最小化目标函数来学习参数{u,v,a,b}，得到相应的投影函数。The parameters {u,v,a,b} are learned by minimizing the objective function to obtain the corresponding projection function.

本优选实施例采用多个跨模态投影模型，可充分应对各种不同的数据分布差异。This preferred embodiment adopts multiple cross-modal projection models, which can fully cope with various data distribution differences.

优选地，所述识别数据库中是否含有与查询人员一致的行人图像并确认查询人员身份，包括：Preferably, whether the identification database contains pedestrian images consistent with the inquiring person and confirming the identity of the inquiring person, including:

假设被查询人员集合为{f_i,STA(f_i)}，i＝1,2,…,N，f_i表示第i个被查询人员，STA(f_i)表示第个被查询人员的身份，对于查询人员集合{g_j,STA(g_j)，j＝1,2,…,M：Suppose the set of queried persons is {f _i , STA(f _i )}, i=1, 2,...,N, f _i represents the i-th queried person, STA(f _i ) represents the identity of the th queried person , for the set of query personnel {g _j , STA(g _j ), j=1,2,...,M:

STA(g_j)＝STA(f)STA(g _j )=STA(f)

$f f = = \underset{i i}{argmax argmax} Z Z (({g g}_{j j},, {f f}_{i i}))$

g_j和f_i的相似度Z(g_j，f_i)表示为：The similarity Z(g _j , f _i ) between g _j and f _i is expressed as:

Z(g_j，f_i)＝sign(u^Tg_j+a)·sign(v^Tf_i+b)+||(u^Tg_j+a)-(v^Tf_i+b)||Z(g _j ，f _i )＝sign(u ^T g _j +a)·sign(v ^T f _i +b)+||(u ^T g _j +a)-(v ^T f _i +b)||

设定阔值T，T∈[1，2]，若Z(g_j，f_i)<T，则被查询人员中不存在与查询人员一致的图像；Set the threshold T, T∈[1, 2], if Z(g _j , f _i )<T, there is no image consistent with the query person among the queried persons;

若Z(g_j，f_i)≥T，将被查询人员按照相似度从大到小排序，排在最前面的与查询人员具有相同的身份。If Z(g _j , f _i )≥T, the inquired persons are sorted in descending order of similarity, and the ones at the top have the same identity as the inquiring persons.

本优选实施例提高了视频监控联网系统人员的识别精度和效率。This preferred embodiment improves the identification accuracy and efficiency of personnel in the video surveillance networking system.

优选地，所述对图像识别系统性能进行评估，包括：Preferably, said evaluating the performance of the image recognition system includes:

定义评价函数：Define the evaluation function:

$F f ((n no)) = = \frac{{Σ Σ}_{n no = = 11}^{N N} {S S}_{n no}}{{N N}^{22}}$

式中，N表示查询次数，S_n表示前n位中可以找到正确结果的次数，评价函数值越大，则系统的再识别性能越好。In the formula, N represents the number of queries, and S _n represents the number of times the correct result can be found in the first n bits. The larger the value of the evaluation function, the better the re-identification performance of the system.

本优选实施例设置评价模块，有利于对视频监控联网系统进行改进。The evaluation module is set in this preferred embodiment, which is beneficial to improve the video surveillance networking system.

本发明视频监控联网系统的一组识别结果如下表所示：A group of recognition results of the video surveillance networking system of the present invention are shown in the following table:

NN 人员识别平均用时Average time for person identification 人员识别准确率Person recognition accuracy 66 0.14s0.14s 95.5％95.5% 1212 0.12s0.12s 95.3％95.3% 1818 0.16s0.16s 95.7％95.7%

最后应当说明的是，以上实施例仅用以说明本发明的技术方案，而非对本发明保护范围的限制，尽管参照较佳实施例对本发明作了详细地说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的实质和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting the protection scope of the present invention, although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand , the technical solution of the present invention may be modified or equivalently replaced without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A video surveillance networked system, characterized in that it can identify personnel by two modes of voice and image, including a collection system, a voice recognition system and an image recognition system, and the collection system collects voice and image, The speech recognition system includes a dictionary scene speech module, a similarity comparison module and a speech recognition engine module, and the image recognition system includes a preprocessing module, a feature extraction module, a training module, a re-identification module and an evaluation module; the preprocessing module It is used to determine the position of the person in the pedestrian image, and obtain the rectangular area containing the person; the feature extraction module is used to extract the appearance feature in the rectangular area containing the person; the training module is used to train multiple cross-modal projections model, each cross-modal projection model contains two projection functions, which respectively map images in different cameras to a common feature space and complete similarity calculations; the re-identification module is used to identify Whether it contains a pedestrian image consistent with the queryer and confirm the identity of the queryer; the evaluation module is used to evaluate the system performance.

2. A kind of video surveillance networking system according to claim 1, is characterized in that, the dictionary scene voice module is suitable for collecting the dictionary in the user vocabulary, the scene voice successively, and carries out the characteristic vector of collection as template save;

The similarity comparison module is adapted to compare the feature vectors of the speech input speech signal with each feature vector template stored in the dictionary scene speech module in turn, and output the one with the highest similarity as the speech recognition result.

3. A video monitoring networking system according to claim 2, wherein the templates in the dictionary scene speech module include a monitoring system term template and a human voice plus dictionary template.