CN116189680B

CN116189680B - Voice wake-up method of exhibition intelligent equipment

Info

Publication number: CN116189680B
Application number: CN202310486209.2A
Authority: CN
Inventors: 张慧; 周林娜
Original assignee: Beijing Crystal Digital Technology Co ltd
Current assignee: Beijing Crystal Digital Technology Co ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-09-26
Anticipated expiration: 2043-05-04
Also published as: CN116189680A

Abstract

The invention provides a voice wake-up method of exhibition intelligent equipment, which relates to the technical field of intelligent voice interaction, and comprises the steps that the intelligent equipment receives voice with the highest decibel in a preset region range and preset time and face dynamic images of all users; judging whether the first user is still in a preset region range currently according to the face dynamic image of the first user; and locking the face dynamic image of the first user, performing first voice interaction with the first user, extracting first voiceprint features of the first user, and filtering voiceprint features which are not matched with the first voiceprint features based on the first voiceprint features. The voice wake-up method of the exhibition intelligent equipment solves the technical problem that the interaction pertinence is insufficient when the existing intelligent equipment communicates with people in a noisy environment.

Description

A voice wake-up method for displaying smart devices

技术领域Technical field

本发明涉及智能语音交互技术领域，特别涉及一种展演智能设备的语音唤醒方法。The present invention relates to the technical field of intelligent voice interaction, and in particular to a voice wake-up method for performing intelligent equipment.

背景技术Background technique

用于展览展会的智能语音导览，是通过对室内展览物件进行语音播报讲解，以便于参观者对展览物品进行深入了解的装置，其在语音导览的领域中得到了广泛的使用。Intelligent audio guides for exhibitions are devices that provide voice explanations of indoor exhibition objects to facilitate visitors to have an in-depth understanding of the exhibition objects. They are widely used in the field of audio guides.

目前常用的智能语音导览是用户手动开启后，根据用户的问题，在问题库中查找并播报固定的内容，播放的内容不因使用者的改变而改变，也不能与某一位参观者进行针对性的交互。在智能手机领域，智能语音助手可以根据使用者的唤醒词唤醒后与其开展人机对话。但是，这种方式一般用于单人与机器的对话，在噪音大、人员多的复杂场景中，如展演场景，存在内容识别精度不高、不易识别发声主体的问题，从而导致智能设备难以识别交互对象，交互时不能很好的理解指令，导致交互无针对性。因此，现有的智能语音助手难以用于展览展会等人多嘈杂的环境。The currently commonly used intelligent audio guide is that after the user manually turns it on, it searches for and broadcasts fixed content in the question bank according to the user's questions. The content played does not change due to the change of the user, nor can it be interacted with a specific visitor. Targeted interactions. In the field of smartphones, intelligent voice assistants can wake up according to the user's wake-up word and start a human-machine conversation with the user. However, this method is generally used for conversations between a single person and a machine. In complex scenes with loud noise and many people, such as exhibition scenes, there is a problem of low content recognition accuracy and difficulty in identifying the speaking subject, making it difficult for smart devices to recognize Interaction objects cannot understand instructions well during interaction, resulting in untargeted interaction. Therefore, existing smart voice assistants are difficult to use in crowded and noisy environments such as exhibitions and exhibitions.

因此，亟需一种改进的展演智能设备的语音唤醒方法，以改善上述技术问题。Therefore, an improved voice wake-up method for performing smart devices is urgently needed to improve the above technical problems.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供一种在人多嘈杂的环境中进行人机交互时能够很好的锁定交互者，进而能进行针对性回答的展演智能设备的语音唤醒方法。In view of this, the purpose of the present invention is to provide a voice awakening method for a presentation smart device that can well lock the interactor during human-computer interaction in a crowded and noisy environment, and thereby enable targeted responses.

本发明提供了一种展演智能设备的语音唤醒方法，包括:智能设备接收预设地域范围及预设时间内分贝最高的语音及所有用户的人脸动态图像；该分贝最高的语音为第一语音，对该第一语音进行语义分析获得第一语义；对该所有用户的人脸动态图像进行图像分析获得人脸动态图像集合，通过图像提取获得与该人脸动态图像集合对应的唇部动态图像集合，通过唇语分析获得与该唇部动态图像集合对应的第二语义集合；若该第一语义在该第二语义集合内，从该第二语义集合内提取与该第一语义对应的第一用户的唇部动态图像及第一用户的人脸动态图像；若该第一语义不在该第二语义集合内，重新接收预设地域范围及预设时间内分贝最高的语音；根据该第一用户的人脸动态图像判断第一用户当前是否还在该预设地域范围，若不在，重新接收预设地域范围及预设时间内分贝最高的语音；若在，锁定该第一用户的人脸动态图像，并与该第一用户进行第一次语音交互，提取该第一用户的第一声纹特征，并基于该第一声纹特征过滤与该第一声纹特征不匹配的声纹特征。The invention provides a voice wake-up method for performing intelligent equipment, which includes: the intelligent equipment receives the voice with the highest decibel in a preset geographical range and within a preset time and the dynamic images of faces of all users; the voice with the highest decibel is the first voice , perform semantic analysis on the first speech to obtain the first semantics; perform image analysis on the facial dynamic images of all users to obtain a facial dynamic image set, and obtain lip dynamic images corresponding to the facial dynamic image collection through image extraction Set, obtain the second semantic set corresponding to the lip dynamic image set through lip language analysis; if the first semantics is in the second semantics set, extract the third semantics corresponding to the first semantics from the second semantics set A dynamic image of a user's lips and a dynamic image of the first user's face; if the first semantics is not in the second semantic set, re-receive the voice with the highest decibel in the preset geographical range and within the preset time; according to the first The dynamic image of the user's face determines whether the first user is currently in the preset geographical range. If not, re-receive the voice with the highest decibel in the preset geographical range and within the preset time; if so, lock the first user's face. dynamic image, and perform the first voice interaction with the first user, extract the first voiceprint feature of the first user, and filter the voiceprint features that do not match the first voiceprint feature based on the first voiceprint feature .

优选地，本发明提供的一种展演智能设备的语音唤醒方法还包括：建立用户数据库，该用户数据库包括多个用户特征信息，该用户特征信息包括年龄、性别、口音、语音交互记录。Preferably, the voice awakening method of a performance smart device provided by the present invention also includes: establishing a user database. The user database includes a plurality of user characteristic information, and the user characteristic information includes age, gender, accent, and voice interaction records.

优选地，在进行第一次语音交互后，将第一声纹特征与该用户数据库的用户特征信息进行比对；将该用户特征信息进一步分为管理者数据信息、访问过用户数据信息，该管理者数据信息对应有管理者语音交互记录，该访问过用户数据信息对应有访问过用户交互记录；若判断该第一声纹特征为该管理者数据信息，调取该管理者语音交互记录进行下一次交互，若判断该第一声纹特征为访问过用户数据信息，调取该访问过用户交互记录进行下一次交互。Preferably, after the first voice interaction, the first voiceprint feature is compared with the user feature information of the user database; the user feature information is further divided into manager data information and visited user data information. The manager data information corresponds to the manager's voice interaction record, and the accessed user data information corresponds to the accessed user interaction record; if it is determined that the first voiceprint feature is the manager's data information, the manager's voice interaction record is retrieved. In the next interaction, if it is determined that the first voiceprint feature is the visited user data information, the visited user interaction record is retrieved for the next interaction.

优选地，该年龄和该性别从该人脸动态图像集合中提取；该口音的语音识别方法包括：根据特定方言的特点，构建从普通话读音到方言读音的音节映射表；根据该音节映射表，扩展已有的标准普通话语音识别器，并形成第一搜索树；用该第一搜索树替换该标准普通话语音识别器中的搜索树并形成第二搜索树。Preferably, the age and gender are extracted from the face dynamic image collection; the speech recognition method of the accent includes: constructing a syllable mapping table from Mandarin pronunciation to dialect pronunciation according to the characteristics of a specific dialect; according to the syllable mapping table, Extend the existing standard Mandarin speech recognizer and form a first search tree; replace the search tree in the standard Mandarin speech recognizer with the first search tree and form a second search tree.

优选地，该分贝最高的语音包括唤醒指令。Preferably, the voice with the highest decibel includes a wake-up instruction.

优选地，该唇语分析方法为：从该唇部动态图像中获取唇部动作特征数据；确定该唇部动作特征数据的正向标准差和/或逆向标准差；基于该正向标准差和/或逆向标准差，确定与该唇部动态图像集合对应的第二语义集合的分词结果。Preferably, the lip language analysis method is: obtaining lip motion feature data from the lip dynamic image; determining the forward standard deviation and/or inverse standard deviation of the lip motion feature data; based on the forward standard deviation and /or reverse standard deviation to determine the word segmentation result of the second semantic set corresponding to the lip dynamic image set.

优选地，该唇部动作特征数据包括：左唇角、右唇角、上唇峰构成的上唇部特征角度及上唇面积，左唇角、右唇角和下唇低点构成的下唇部特征角度及下唇面积。Preferably, the lip movement feature data includes: the upper lip characteristic angle and upper lip area formed by the left lip angle, right lip angle, and upper lip peak, and the lower lip characteristic angle formed by the left lip angle, right lip angle, and lower lip low point. and lower lip area.

优选地，确定该唇部动作特征数据的正向标准差包括：选取确定唇部动态图像的第一视帧，该第一视帧和该第一视帧的在前视帧图像的唇部动作特征确定得到；确定该唇部动作特征数据的逆向标准差包括：选取确定唇部动态图像的第一视帧，该第一视帧和该第一视帧的在后视帧图像的唇部动作特征确定得到。Preferably, determining the forward standard deviation of the lip movement feature data includes: selecting a first visual frame to determine the lip dynamic image, the first visual frame and the lip movements of the previous frame image of the first visual frame. Characteristics are determined; determining the inverse standard deviation of the lip movement feature data includes: selecting and determining the first visual frame of the lip dynamic image, the first visual frame and the lip movements of the rear view frame image of the first visual frame Characteristics are determined.

优选地，该过滤与第一声纹特征不匹配的声纹特征的方法为：智能设备包括麦克风阵列、ToF检测模块、DOA计算模块；该麦克风阵列处理多路语音信号，对该语音信号进行降噪及增强；该ToF检测模块检测该预设地域范围人员，并生成人员位置信息；通过该DOA计算模块计算产生当前的DOA区间数据；计算基于该麦克风阵列输入的数据和基于DOA计算模块产生的数据，过滤与第一声纹特征不匹配的声纹特征。Preferably, the method of filtering voiceprint features that do not match the first voiceprint feature is: the smart device includes a microphone array, a ToF detection module, and a DOA calculation module; the microphone array processes multi-channel voice signals, and performs degradation on the voice signals. Noise and enhancement; the ToF detection module detects people in the preset geographical range and generates person location information; the DOA calculation module calculates and generates the current DOA interval data; calculates the data input based on the microphone array and the data generated based on the DOA calculation module data, filtering out voiceprint features that do not match the first voiceprint feature.

优选地，对该第一语音进行语义分析获得第一语义的方法包括：根据领域任务定义文法中所有的终结符、非终结符和规则分类，该终结符为按语义分类的关键词，该关键词可包含阿拉伯数字和英文字母，每个关键词都有相应的拼音，每一条规则都被赋以一个优先级别，该优先级的规则集合通过词法分析的或非词法分析得到，该规则与语义直接关联，每一条该规则都对应一个语义分析函数，从语法配置文件中读入基于语义类的上下文无关增强文法；对用户输入的句子进行分词；对分词结果进行句法分析；取最优的句法分析结果进行语义分析，得到用户最终的搜索关键词信息。Preferably, the method for semantically analyzing the first speech to obtain the first semantics includes: defining all terminal symbols, non-terminal symbols and rule classifications in the grammar according to the domain task, the terminal symbols are keywords classified according to semantics, and the key Words can contain Arabic numerals and English letters. Each keyword has a corresponding pinyin. Each rule is assigned a priority level. The priority set of rules is obtained through lexical analysis or non-lexical analysis. The rules are related to semantics. Directly related, each rule corresponds to a semantic analysis function, which reads the context-independent enhanced grammar based on semantic classes from the grammar configuration file; performs word segmentation on the sentences input by the user; performs syntactic analysis on the word segmentation results; and obtains the optimal syntax. The analysis results are subjected to semantic analysis to obtain the user's final search keyword information.

本发明的技术方案带来了以下有益效果：在本发明提供的一种展演智能设备的语音唤醒方法中，以在展会会场使用为例，包括以下步骤：智能设备搜寻预设地域范围及预设时间内会场信息，接收在预设时间内和预设地域范围声音分贝最高的人的语音所有的用户的人脸动态图像。将分贝最高的语音设置为第一语音，对该第一语音进行语义分析处理，获得与第一语音对应的第一语义。同时，对所有用户的人脸动态图像进行图像分析获得人脸动态图像集合，通过图像提取手段获得与人脸动态图像集合对应的唇部动态图像集合，并通过唇语分析手段获得与唇部动态图像集合对应的第二语义集合。The technical solution of the present invention brings the following beneficial effects: In the voice wake-up method of a display smart device provided by the present invention, taking use at an exhibition venue as an example, it includes the following steps: the smart device searches for a preset geographical range and a preset Timely venue information, receive dynamic facial images of all users with the voice of the person with the highest decibel level within the preset time and preset geographical range. The voice with the highest decibel is set as the first voice, and semantic analysis is performed on the first voice to obtain the first semantics corresponding to the first voice. At the same time, image analysis is performed on all users' dynamic facial images to obtain a facial dynamic image set, and a lip dynamic image set corresponding to the facial dynamic image set is obtained through image extraction means, and a lip dynamic image set is obtained through lip language analysis. The second semantic set corresponding to the image set.

在对预设地域范围及预设时间内声音分贝最高的人的语音及所有的用户的人脸动态图像处理后获得第一语义和第二语义集合后，判断上述的第一语义是否在在第二语义集合内，若第一语义在第二语义集合内，从而获得了声音分贝最高的人的第一语义、人脸动态图像、唇部动态图像，锁定并获得了要找的人的特征信息。若不在，则说明嘈杂的环境中识别噪音较大，智能设备将重新接收预设地域范围及预设时间内分贝最高的语音。After obtaining the first semantics and the second semantic set by processing the voice of the person with the highest decibel voice and the face dynamic images of all users in the preset geographical range and within the preset time, it is determined whether the above-mentioned first semantics is in the first semantic set. Within the two semantic sets, if the first semantics is in the second semantics set, the first semantics of the person with the highest decibel voice, the face dynamic image, and the lip dynamic image are obtained, and the characteristic information of the person you are looking for is locked and obtained. . If not, it means that the recognition noise is large in a noisy environment, and the smart device will re-receive the voice with the highest decibel in the preset geographical range and within the preset time.

根据第一用户的人脸动态图像判断第一用户当前是否还在预设地域范围，若不在，可能第一用户已经离开预设地域范围，智能设备重新接收预设地域范围及预设时间内分贝最高的语音。若在，锁定第一用户的人脸动态图像，并与第一用户进行第一次语音交互，提取第一用户的第一声纹特征，并基于第一声纹特征过滤与第一声纹特征不匹配的声纹特征。如果此时第一用户还在现场，提取第一用户的第一声纹特征用于更有针对性的交互，并过滤掉与第一声纹特征不匹配的声纹特征。Determine whether the first user is currently still in the preset geographical range based on the dynamic image of the first user's face. If not, the first user may have left the preset geographical range, and the smart device will receive the preset geographical range and decibels within the preset time again. The highest voice. If so, lock the dynamic image of the first user's face, conduct the first voice interaction with the first user, extract the first voiceprint feature of the first user, and filter the first voiceprint feature based on the first voiceprint feature Mismatched voiceprint features. If the first user is still present at this time, the first voiceprint feature of the first user is extracted for more targeted interaction, and voiceprint features that do not match the first voiceprint feature are filtered out.

基于此本发明提供的一种展演智能设备的语音唤醒方法，通过准确查找并识别第一用户，先后获得了第一用户的第一语义、人脸动态图像、唇部动态图像。以及第一声纹特征，即使在嘈杂的环境中也能有针对性的找到用户并根据用户特征进行针对性的交流，提升了用户的使用体验。Based on this, the present invention provides a voice awakening method for performing smart devices. By accurately searching and identifying the first user, the first semantics, face dynamic images, and lip dynamic images of the first user are successively obtained. As well as the first voiceprint feature, even in a noisy environment, it can find users in a targeted manner and conduct targeted communication based on user characteristics, improving the user experience.

本发明的其他特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本发明而了解。本发明的目的和其他优点在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description, claims and appended drawings.

为使本发明的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and understandable, preferred embodiments are given below and described in detail with reference to the accompanying drawings.

附图说明Description of the drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly explain the specific embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings that need to be used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description The drawings illustrate some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1为本发明实施例提供的一种展演智能设备的语音唤醒方法的结构框图。Figure 1 is a structural block diagram of a method for displaying voice wake-up of an intelligent device according to an embodiment of the present invention.

图2为本发明实施例提供的另一种展演智能设备的语音唤醒方法的结构框图。Figure 2 is a structural block diagram of another voice wake-up method for displaying smart devices provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts fall within the scope of protection of the present invention.

为了便于对本实施例进行理解，首先对本发明实施例所公开的一种控制装置进行详细介绍。In order to facilitate understanding of this embodiment, a control device disclosed in the embodiment of the present invention is first introduced in detail.

本发明提供了一种展演智能设备的语音唤醒方法，参见图1所示，该方法包括智能设备接收预设地域范围及预设时间内分贝最高的语音及所有用户的人脸动态图像；分贝最高的语音为第一语音，对第一语音进行语义分析获得第一语义；对所有用户的人脸动态图像进行图像分析获得人脸动态图像集合，通过图像提取获得与人脸动态图像集合对应的唇部动态图像集合，通过唇语分析获得与唇部动态图像集合对应的第二语义集合。The present invention provides a voice wake-up method for performing smart devices, as shown in Figure 1. The method includes the smart device receiving the voice with the highest decibel in a preset geographical range and within a preset time and the face dynamic images of all users; the highest decibel is received by the smart device. The voice of the user is the first voice, and semantic analysis is performed on the first voice to obtain the first semantics; image analysis is performed on the dynamic face images of all users to obtain a set of dynamic face images, and the lips corresponding to the set of dynamic face images are obtained through image extraction. A set of lip dynamic images is obtained, and a second semantic set corresponding to the set of lip dynamic images is obtained through lip language analysis.

本实施例提供的一种展演智能设备的语音唤醒方法中，以在展会会场使用为例，包括以下步骤：步骤S110智能设备搜寻预设地域范围及预设时间内会场信息，接收在预设时间内和预设地域范围声音分贝最高的人的语音所有的用户的人脸动态图像。将分贝最高的语音设置为第一语音，对该第一语音进行语义分析处理，获得与第一语音对应的第一语义。同时，对所有用户的人脸动态图像进行图像分析获得人脸动态图像集合，通过图像提取手段获得与人脸动态图像集合对应的唇部动态图像集合，并通过唇语分析手段获得与唇部动态图像集合对应的第二语义集合。This embodiment provides a voice wake-up method for a performance smart device, taking use at an exhibition venue as an example, including the following steps: Step S110: The smart device searches for venue information within a preset geographical range and within a preset time, and receives the information at the preset time. Dynamic facial images of all users within the preset geographical range and the voice of the person with the highest decibel sound. The voice with the highest decibel is set as the first voice, and semantic analysis is performed on the first voice to obtain the first semantics corresponding to the first voice. At the same time, image analysis is performed on all users' dynamic facial images to obtain a facial dynamic image set, and a lip dynamic image set corresponding to the facial dynamic image set is obtained through image extraction means, and a lip dynamic image set is obtained through lip language analysis. The second semantic set corresponding to the image set.

在对预设地域范围及预设时间内声音分贝最高的人的语音及所有的用户的人脸动态图像处理后获得第一语义和第二语义集合后，进行步骤S120。After obtaining the first semantic set and the second semantic set by processing the voice of the person with the highest decibel voice and the face dynamic images of all users in the preset geographical range and within the preset time, step S120 is performed.

步骤S120：判断上述的第一语义是否在在第二语义集合内，若第一语义在第二语义集合内，从而获得了声音分贝最高的人的第一语义、人脸动态图像、唇部动态图像，锁定并获得了要找的人的特征信息。若不在，则说明嘈杂的环境中识别噪音较大，智能设备将重新接收预设地域范围及预设时间内分贝最高的语音，重新开始步骤S110。Step S120: Determine whether the above-mentioned first semantics is in the second semantic set. If the first semantics is in the second semantic set, the first semantics, facial dynamic images, and lip dynamics of the person with the highest decibel sound are obtained. image, lock and obtain the characteristic information of the person you are looking for. If not, it means that the recognition noise is large in a noisy environment, and the smart device will re-receive the voice with the highest decibel in the preset geographical range and within the preset time, and start step S110 again.

步骤S130：根据第一用户的人脸动态图像判断第一用户当前是否还在预设地域范围，若不在，可能第一用户已经离开预设地域范围，智能设备重新接收预设地域范围及预设时间内分贝最高的语音并重新开始步骤S110。若在，继续步骤S140。Step S130: Determine whether the first user is currently still in the preset geographical range based on the first user's facial dynamic image. If not, the first user may have left the preset geographical range, and the smart device receives the preset geographical range and preset settings again. the highest decibel voice within the time period and restart step S110. If yes, continue to step S140.

步骤S140：锁定第一用户的人脸动态图像，并与第一用户进行第一次语音交互，提取第一用户的第一声纹特征，并基于第一声纹特征过滤与第一声纹特征不匹配的声纹特征。如果此时第一用户还在现场，提取第一用户的第一声纹特征用于更有针对性的交互，并过滤掉与第一声纹特征不匹配的声纹特征。基于此本实施例提供的一种展演智能设备的语音唤醒方法，通过准确查找并识别第一用户，先后获得了第一用户的第一语义、人脸动态图像、唇部动态图像。以及第一声纹特征，即使在嘈杂的环境中也能有针对性的找到用户并根据用户特征进行针对性的交流，提升了用户使用体验。Step S140: Lock the dynamic face image of the first user, conduct the first voice interaction with the first user, extract the first voiceprint feature of the first user, and filter the first voiceprint feature based on the first voiceprint feature Mismatched voiceprint features. If the first user is still present at this time, the first voiceprint feature of the first user is extracted for more targeted interaction, and voiceprint features that do not match the first voiceprint feature are filtered out. Based on this, this embodiment provides a voice awakening method for a performance smart device. By accurately searching and identifying the first user, the first semantics, facial dynamic images, and lip dynamic images of the first user are successively obtained. As well as the first voiceprint feature, even in a noisy environment, it can find users in a targeted manner and carry out targeted communication based on user characteristics, improving the user experience.

进一步的，本实施例提供的一种展演智能设备的语音唤醒方法还包括步骤S150：建立用户数据库，用户数据库包括多个用户特征信息，用户特征信息包括年龄、性别、口音、语音交互记录。Further, the voice awakening method of a performance smart device provided by this embodiment also includes step S150: establishing a user database. The user database includes multiple user characteristic information, and the user characteristic information includes age, gender, accent, and voice interaction records.

进一步地，在步骤S150中，进一步包括步骤S151：将用户特征信息进一步分为管理者数据信息、访问过用户数据信息，管理者数据信息对应有管理者语音交互记录，访问过用户数据信息对应有访问过用户交互记录。Further, step S150 further includes step S151: further dividing the user characteristic information into manager data information and visited user data information. The manager data information corresponds to the manager voice interaction record, and the visited user data information corresponds to Visited user interaction records.

步骤S152：若判断第一声纹特征为管理者数据信息，调取管理者语音交互记录进行下一次交互，若判断第一声纹特征为访问过用户数据信息，调取访问过用户交互记录进行下一次交互。Step S152: If it is determined that the first voiceprint feature is the administrator's data information, retrieve the administrator's voice interaction record for the next interaction. If it is determined that the first voiceprint feature is the visited user data information, retrieve the visited user interaction record for the next interaction. Next interaction.

步骤S140中判断出第一用户还在预设范围内需要进一步交互时后，通过步骤S150建立用户数据库。并通过步骤S151将用户数据库的用户特征进一步分为管理者数据信息、访问过用户数据信息，若判断第一声纹特征与管理者数据信息匹配，则调用管理者的特征和问答记录，进行针对性的回答。若判断第一声纹特征与访问过用户数据信息匹配，则调用访问过用户数据信息进行交互，使得整个交互过程更具有针对性，在第一用户看来，展演智能设备的交互过程更智能。After it is determined in step S140 that the first user still needs further interaction within the preset range, a user database is established in step S150. And through step S151, the user characteristics of the user database are further divided into manager data information and visited user data information. If it is determined that the first voiceprint feature matches the manager data information, the manager's characteristics and question and answer records are called, and targeted sexual answer. If it is determined that the first voiceprint feature matches the visited user data information, the visited user data information is called for interaction, making the entire interaction process more targeted. In the eyes of the first user, the interaction process of the displayed smart device is more intelligent.

本实施例提供的一种展演智能设备的语音唤醒方法，还包括：建立用户数据库，用户数据库包括多个用户特征信息，用户特征信息包括年龄、性别、口音、语音交互记录，其中，年龄和性别从人脸动态图像集合中提取；口音的语音识别方法包括：根据特定方言的特点，构建从普通话读音到方言读音的音节映射表；根据音节映射表，扩展已有的标准普通话语音识别器，并形成第一搜索树；用第一搜索树替换标准普通话语音识别器中的搜索树并形成第二搜索树。根据特定方言的特点，构建从普通话读音到方言读音的音节映射表的方法包括：根据语言知识总结相关方言的音节映射规律；对于任何一个词无关的音节映射，如果映射是发生在声母，则注册声母映射对{I*(x)}→{I*(y)}，它表示含有声母x的音节其声母会映射成y。使得本实施例提供的一种展演智能设备的语音唤醒方法适用范围更广。This embodiment provides a voice wake-up method for performing smart devices, which also includes: establishing a user database. The user database includes multiple user characteristic information. The user characteristic information includes age, gender, accent, and voice interaction records, where age and gender Extracted from a collection of face dynamic images; the accent speech recognition method includes: building a syllable mapping table from Mandarin pronunciation to dialect pronunciation according to the characteristics of a specific dialect; extending the existing standard Mandarin speech recognizer based on the syllable mapping table, and Form a first search tree; replace the search tree in the standard Mandarin speech recognizer with the first search tree and form a second search tree. According to the characteristics of a specific dialect, the method of constructing a syllable mapping table from Mandarin pronunciation to dialect pronunciation includes: summarizing the syllable mapping rules of relevant dialects based on language knowledge; for any syllable mapping that is unrelated to a word, if the mapping occurs on the initial consonant, register The initial consonant mapping pair {I*(x)}→{I*(y)} means that the initial consonant of a syllable containing the initial consonant x will be mapped to y. This makes the voice wake-up method for displaying smart devices provided in this embodiment wider applicable.

进一步的，步骤S110中的分贝最高的语音包括唤醒指令，唤醒指令包括唤醒关键词、唤醒句型、疑问语气词等，本实施例提供的一种展演智能设备的语音唤醒方法在找到第一用户之后，在交互中还可以使用唤醒指令，以保证与第一用户的及时交流。Further, the voice with the highest decibel in step S110 includes a wake-up instruction, and the wake-up instruction includes wake-up keywords, wake-up sentence patterns, interrogative modal words, etc. This embodiment provides a voice wake-up method for performing smart devices after finding the first user. Later, wake-up instructions can also be used during the interaction to ensure timely communication with the first user.

进一步地，通过唇语分析获得与唇部动态图像集合对应的第二语义集合，唇语分析方法为：从唇部动态图像中获取唇部动作特征数据；确定唇部动作特征数据的正向标准差和/或逆向标准差；基于上述正向标准差和/或逆向标准差，确定与上述唇部动态图像集合对应的第二语义集合的分词结果。Further, a second semantic set corresponding to the lip dynamic image set is obtained through lip language analysis. The lip language analysis method is: obtaining lip movement feature data from the lip dynamic image; determining the forward standard of the lip movement feature data Difference and/or inverse standard deviation; based on the above forward standard deviation and/or inverse standard deviation, determine the word segmentation result of the second semantic set corresponding to the above lip dynamic image set.

唇部动作特征数据包括：左唇角、右唇角、上唇峰构成的上唇部特征角度及上唇面积，左唇角、右唇角和下唇低点构成的下唇部特征角度及下唇面积。上唇部特征角度及上唇面积表征用户说话时候的上唇特征，下唇部特征角度及下唇面积表征用户说话时候的下唇特征，用户在说不同的内容时，都有一组上唇特征和下唇特征。The lip movement feature data includes: the characteristic angle of the upper lip and the area of the upper lip composed of the left lip angle, the right lip angle and the upper lip peak, and the characteristic angle and area of the lower lip composed of the left lip angle, right lip angle and the low point of the lower lip. . The characteristic angle of the upper lip and the area of the upper lip represent the characteristics of the user's upper lip when speaking. The characteristic angle of the lower lip and the area of the lower lip represent the characteristics of the user's lower lip when speaking. When users speak different contents, they have a set of upper lip characteristics and lower lip characteristics. .

进一步的，确定唇部动作特征数据的正向标准差包括：选取确定唇部动态图像的第一视帧，第一视帧和第一视帧的在前视帧图像的唇部动作特征确定得到；确定唇部动作特征数据的逆向标准差包括：选取确定唇部动态图像的第一视帧，第一视帧和第一视帧的在后视帧图像的唇部动作特征确定得到。根据动态确定的唇部动作特征数据的正向标准差和唇部动作特征数据的逆向标准差能够确定与唇部动态图像集合对应的第二语义集合的分词结果，从而识别用户的交互内容。Further, determining the forward standard deviation of the lip motion feature data includes: selecting the first video frame to determine the lip dynamic image, and determining the lip motion features of the first video frame and the previous frame image of the first video frame. ; Determining the inverse standard deviation of the lip motion feature data includes: selecting the first video frame to determine the lip dynamic image, and determining the lip motion features of the first video frame and the subsequent video frame image of the first video frame. According to the dynamically determined forward standard deviation of the lip motion feature data and the inverse standard deviation of the lip motion feature data, the word segmentation result of the second semantic set corresponding to the lip dynamic image set can be determined, thereby identifying the user's interactive content.

进一步的，基于第一声纹特征过滤与第一声纹特征不匹配的声纹特征的方法为：智能设备包括麦克风阵列、ToF（飞行的时间）检测模块、DOA（波达方向）计算模块；麦克风阵列处理多路语音信号，对语音信号进行降噪及增强；ToF检测模块检测预设地域范围人员，并生成人员位置信息通过DOA计算模块产生当前的DOA区间数据；计算基于上述麦克风阵列输入的数据和基于DOA计算模块产生的数据，过滤与第一声纹特征不匹配的声纹特征。Further, a method for filtering voiceprint features that do not match the first voiceprint feature based on the first voiceprint feature is: the smart device includes a microphone array, a ToF (time of flight) detection module, and a DOA (direction of arrival) calculation module; The microphone array processes multi-channel voice signals, denoising and enhancing the voice signals; the ToF detection module detects people in the preset geographical range, and generates person location information. The DOA calculation module generates the current DOA interval data; the calculation is based on the above microphone array input. Data and based on the data generated by the DOA calculation module, the voiceprint features that do not match the first voiceprint feature are filtered.

本发明提供的一种展演智能设备的语音唤醒方法通过ToF检测模块检测展会内人员情况和位置信息，在输入给DOA计算模块计算产生当前的DOA区间数据，从而使语音唤醒时的DOA计算更精确，降低DOA计算错误，从而使后续的语音降噪目标准确无误，最终提高了的识别正确率，改进用户体验。The present invention provides a method for voice wake-up of intelligent equipment for exhibitions that detects the status and location information of people in the exhibition through the ToF detection module, and then inputs it to the DOA calculation module to calculate and generate the current DOA interval data, thereby making the DOA calculation during voice wake-up more accurate. , reducing DOA calculation errors, so that subsequent speech noise reduction targets are accurate, ultimately improving the recognition accuracy and improving user experience.

进一步的，对第一语音进行语义分析获得第一语义的方法包括：首先，从语法配置文件中读入基于语义类的上下文无关增强文法，具体的，根据领域任务定义文法中所有的终结符、非终结符和规则分类，终结符为按语义分类的关键词，关键词可包含阿拉伯数字和英文字母，每个关键词都有相应的拼音，每一条规则都被赋以一个优先级别，优先级的规则集合通过词法分析的或非词法分析得到，规则与语义直接关联，每一条规则都对应一个语义分析函数。其中，领域任务定义文法是语义分析中的现有语言。Further, the method of semantically analyzing the first speech to obtain the first semantics includes: first, reading the context-independent enhanced grammar based on semantic classes from the grammar configuration file. Specifically, defining all terminal symbols in the grammar according to the domain task, Classification of non-terminal symbols and rules. Terminal symbols are keywords classified according to semantics. Keywords can include Arabic numerals and English letters. Each keyword has a corresponding pinyin. Each rule is assigned a priority level. Priority The set of rules is obtained through lexical analysis or non-lexical analysis. The rules are directly related to semantics. Each rule corresponds to a semantic analysis function. Among them, domain task definition grammar is an existing language in semantic analysis.

然后，对用户输入的句子进行分词；对分词结果进行句法分析；取最优的句法分析结果进行语义分析，得到用户最终的搜索关键词信息。Then, perform word segmentation on the sentence input by the user; perform syntactic analysis on the word segmentation results; perform semantic analysis on the optimal syntactic analysis results to obtain the user's final search keyword information.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies, the present invention is also intended to include these modifications and variations.

另外，在本发明实施例的描述中，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In addition, in the description of the embodiments of the present invention, unless otherwise clearly stated and limited, the terms "installation", "connection" and "connection" should be understood in a broad sense. For example, it can be a fixed connection or a detachable connection. , or integrally connected; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium; it can be an internal connection between two components. For those skilled in the art, the specific meanings of the above terms in the present invention can be understood in specific situations.

在本发明的描述中，需要说明的是，术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. The indicated orientation or positional relationship is based on the orientation or positional relationship shown in the drawings. It is only for the convenience of describing the present invention and simplifying the description. It does not indicate or imply that the device or element referred to must have a specific orientation or a specific orientation. construction and operation, and therefore should not be construed as limitations of the invention. Furthermore, the terms “first”, “second” and “third” are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

最后应说明的是：以上实施例，仅为本发明的具体实施方式，用以说明本发明的技术方案，而非对其限制，本发明的保护范围并不局限于此，尽管参照前述实施例对本发明进行了详细的说明，本领域技术人员应当理解：任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。Finally, it should be noted that the above embodiments are only specific implementations of the present invention and are used to illustrate the technical solutions of the present invention rather than to limit them. The protection scope of the present invention is not limited thereto. Although refer to the foregoing embodiments The present invention has been described in detail. Those skilled in the art should understand that any person familiar with the technical field can still modify the technical solutions recorded in the foregoing embodiments or can easily think of them within the technical scope disclosed by the present invention. changes, or equivalent substitutions of some of the technical features; these modifications, changes or substitutions do not deviate from the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be covered by the protection scope of the present invention. Inside. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A voice wake-up method for performing smart devices, which is characterized by including:

The smart device receives the voice with the highest decibel in a preset geographical range and within a preset time and the dynamic images of faces of all users; the voice with the highest decibel is the first voice, and semantic analysis is performed on the first voice to obtain the first semantics; Perform image analysis on the facial dynamic images of all users to obtain a facial dynamic image set, obtain a lip dynamic image set corresponding to the facial dynamic image set through image extraction, and obtain a lip dynamic image set corresponding to the lip language analysis. The second semantic set corresponding to the dynamic image set;

If the first semantics is in the second semantics set, extract the first user's lip dynamic image and the first user's face dynamic image corresponding to the first semantics from the second semantics set; If the first semantics is not in the second semantics set, re-receive the voice with the highest decibel in the preset geographical range and within the preset time;

Determine whether the first user is currently still in the preset geographical range based on the dynamic image of the first user's face. If not, re-receive the voice with the highest decibel in the preset geographical range and within the preset time; if so, lock the location. and perform the first voice interaction with the first user, extract the first voiceprint feature of the first user, and filter the interaction with the first voiceprint feature based on the first voiceprint feature. The first voiceprint feature does not match the voiceprint feature;

The method of filtering voiceprint features that do not match the first voiceprint feature is: the smart device includes a microphone array, a ToF detection module, and a DOA calculation module; the microphone array processes multiple channels of voice signals, and degrades the voice signals. Noise and enhancement; the ToF detection module detects people in the preset geographical range and generates person location information; the DOA calculation module calculates and generates the current DOA interval data; calculates based on the data input by the microphone array and based on the The data generated by the DOA calculation module is used to filter out voiceprint features that do not match the first voiceprint feature.

2. The voice wake-up method for displaying smart devices according to claim 1, further comprising:

Establish a user database, which includes multiple user characteristic information, including age, gender, accent, and voice interaction records.

3. The voice wake-up method of the performance smart device according to claim 2, characterized in that, after the first voice interaction, the first voiceprint feature is compared with the user feature information of the user database. Comparison;

The user characteristic information is further divided into manager data information and visited user data information. The manager data information corresponds to the manager voice interaction record, and the visited user data information corresponds to the visited user interaction record;

If it is determined that the first voiceprint feature is the data information of the manager, the voice interaction record of the manager is retrieved for the next interaction; if it is determined that the first voiceprint feature is the data information of the visited user, the voice interaction record of the manager is retrieved for the next interaction. Get the visited user interaction record to perform the next interaction.

4. The voice awakening method of the performance smart device according to claim 2, characterized in that the age and the gender are extracted from the face dynamic image collection; the voice recognition method of the accent includes: according to a specific According to the characteristics of dialects, a syllable mapping table from Mandarin pronunciation to dialect pronunciation is constructed; according to the syllable mapping table, the existing standard Mandarin speech recognizer is expanded and a first search tree is formed; the first search tree is used to replace the search tree in a standard Mandarin speech recognizer and form a second search tree.

5. The voice wake-up method for displaying smart devices according to claim 1, wherein the voice with the highest decibel includes a wake-up instruction.

6. The voice awakening method of the performance smart device according to claim 1, characterized in that the lip language analysis method is:

Obtain lip motion feature data from the lip dynamic image;

Determine the forward standard deviation and/or inverse standard deviation of the lip movement characteristic data;

Based on the forward standard deviation and/or the inverse standard deviation, a word segmentation result of the second semantic set corresponding to the lip dynamic image set is determined.

7. The voice awakening method of the performance smart device according to claim 6, wherein the lip movement characteristic data includes: the characteristic angle of the upper lip formed by the left lip angle, the right lip angle, and the upper lip peak, and the upper lip area, The characteristic angle of the lower lip and the area of the lower lip formed by the left lip corner, the right lip corner and the low point of the lower lip.

8. The voice awakening method of a performance smart device according to claim 6, wherein determining the forward standard deviation of the lip motion feature data includes: selecting and determining the first frame of the lip dynamic image, the The lip motion features of the first video frame and the previous frame image of the first video frame are determined; determining the inverse standard deviation of the lip motion feature data includes: selecting and determining the first lip motion feature of the dynamic image. The lip motion characteristics of the first video frame and the subsequent video frame image of the first video frame are determined.

9. The voice awakening method of a performance smart device according to claim 1, wherein the method of performing semantic analysis on the first voice to obtain the first semantics includes:

All terminal symbols, non-terminal symbols and rules in the grammar are classified according to the domain task definition. The terminal symbols are keywords classified according to semantics. The keywords include Arabic numerals and English letters. Each keyword has a corresponding pinyin. , each of the rules is assigned a priority level, and the priority rule set is obtained through lexical analysis or non-lexical analysis. The rules are directly related to semantics. Each of the rules corresponds to a semantic analysis function. , read the context-independent enhanced grammar based on semantic classes from the grammar configuration file; perform word segmentation on the sentences entered by the user; perform syntactic analysis on the word segmentation results; perform semantic analysis on the optimal syntactic analysis results to obtain the user's final search keywords information.