CN120071921A

CN120071921A - Earphone interaction method and device based on voice, electronic equipment and storage medium

Info

Publication number: CN120071921A
Application number: CN202510045868.1A
Authority: CN
Inventors: 黄环; 吴海全; 姜德军; 迟欣; 曹磊; 何桂晓
Original assignee: Shenzhen Grandsun Electronics Co Ltd
Current assignee: Shenzhen Grandsun Electronics Co Ltd
Priority date: 2025-01-13
Filing date: 2025-01-13
Publication date: 2025-05-30

Abstract

The embodiments of the present application provide a voice-based headset interaction method and device, an electronic device, and a storage medium, which belong to the field of headset technology. The method is applied to a headset compartment, which is connected to the headset, and the headset compartment is provided with a display screen. The method includes: obtaining interaction instruction audio collected by the headset; performing interaction intent recognition on the interaction instruction audio to obtain a target intention instruction; sending the target intention instruction to a preset server, and obtaining instruction processing data returned by the server; performing data analysis on the instruction processing data to obtain target interaction display data and target interaction audio data; displaying the target interaction display data on the display screen, and sending the target interaction audio data to the headset, so that the headset plays audio according to the target interaction audio data. The embodiments of the present application can improve the flexibility of the headset when used.

Description

Earphone interaction method and device based on voice, electronic equipment and storage medium

Technical Field

The present application relates to the field of headphones technologies, and in particular, to a method and apparatus for interaction between headphones based on voice, an electronic device, and a storage medium.

Background

The earphone interaction means that an earphone user interacts with the earphone, and the earphone user gives an instruction to the earphone and executes the instruction by the earphone. Typically, after the earphone is taken out of the earphone house, the earphone is connected to other terminal devices, such as a terminal device of a non-earphone part of a computer or a mobile phone. The instruction sent to the earphone is forwarded to the terminal equipment through the earphone, and the terminal equipment processes the instruction, so that the earphone is coupled with other terminal equipment, and the flexibility of the earphone in use is reduced. Therefore, how to improve the flexibility of the earphone during use is a problem to be solved.

Disclosure of Invention

The embodiment of the application mainly aims to provide a voice-based earphone interaction method and device, electronic equipment and storage medium, and aims to improve flexibility of earphone in use.

To achieve the above object, a first aspect of an embodiment of the present application provides a voice-based earphone interaction method, which is applied to an earphone cabin, wherein the earphone cabin is connected with an earphone, and the earphone cabin is provided with a display screen, and the method includes:

acquiring interaction instruction audio acquired by the earphone;

performing interactive intention recognition on the interactive instruction audio to obtain a target intention instruction;

the target intention instruction is sent to a preset server, and instruction processing data returned by the server are obtained;

carrying out data analysis on the instruction processing data to obtain target interaction display data and target interaction audio data;

and carrying out picture display on the target interaction display data based on the display screen, and sending the target interaction audio data to the earphone so that the earphone can carry out audio playing according to the target interaction audio data.

In some embodiments, the identifying the interactive intention of the interactive instruction audio to obtain a target intention instruction includes:

denoising the interactive instruction audio to obtain denoising instruction audio;

carrying out semantic recognition on the denoising instruction audio frequency to obtain target semantic data;

and generating an instruction according to the target semantic data to obtain the target intention instruction.

In some embodiments, the denoising processing is performed on the interaction instruction audio to obtain denoising instruction audio, including:

performing noise detection on the interactive instruction audio to obtain noise type data and noise intensity data;

according to the noise type data and the data, performing noise removal on the interactive instruction audio to obtain original denoising audio;

echo cancellation is carried out on the original denoising frequency, so as to obtain preliminary denoising audio;

and carrying out human voice enhancement on the preliminary denoising frequency to obtain the denoising instruction audio frequency.

In some embodiments, the performing the voice enhancement on the preliminary denoising frequency to obtain the denoising instruction audio frequency includes:

Performing voice extraction on the preliminary denoising frequency to obtain voice audio data;

Voiceprint recognition is carried out on the voice audio data to obtain target voice audio;

and carrying out voice reinforcement on the target voice audio to obtain the denoising instruction audio.

In some embodiments, the performing semantic recognition on the denoising instruction audio to obtain target semantic data includes:

Performing text conversion on the denoising instruction audio frequency to obtain a denoising instruction text;

Performing entity identification on the denoising instruction text to obtain a denoising instruction entity;

carrying out grammar recognition on the denoising instruction text to obtain a denoising instruction grammar unit;

and carrying out semantic synthesis according to the denoising instruction entity and the denoising instruction grammar unit to obtain the target semantic data.

In some embodiments, the data parsing the instruction processing data to obtain target interaction display data and target interaction audio data includes:

Carrying out format analysis on the instruction processing data to obtain original display data, the target interactive audio data and the data category of the original display data;

if the data category is represented as a broadcasting category, text rendering is carried out according to the original display data, and the target interactive display data is obtained;

If the data category is expressed as an image category, performing image rendering according to the original display data to obtain the target interactive display data;

And if the data category is expressed as a video category, performing video rendering according to the original display data to obtain the target interactive display data.

In some embodiments, before the capturing the interactive instruction audio captured by the earphone, the method further includes:

Acquiring earphone acquisition audio acquired by the earphone;

extracting Mel characteristics of the audio acquired by the earphone to obtain audio Mel characteristics;

Performing wake-up word detection on the audio Mel characteristics according to a preset wake-up word detection network to obtain wake-up word detection data;

and performing audio segmentation on the earphone collected audio according to the wake-up word detection data to obtain the interactive instruction audio.

To achieve the above object, a second aspect of the embodiments of the present application provides a voice-based earphone interaction device, which is applied to an earphone cabin, the earphone cabin is connected with an earphone, the earphone cabin is provided with a display screen, and the device includes:

An acquisition data module for acquiring interactive instruction audio acquired by the earphone

The intention recognition module is used for carrying out interactive intention recognition on the interactive instruction audio to obtain a target intention instruction;

the instruction processing module is used for sending the target intention instruction to a preset server and acquiring instruction processing data returned by the server;

The data analysis module is used for carrying out data analysis on the instruction processing data to obtain target interaction display data and target interaction audio data;

and the picture display module is used for displaying pictures of the target interaction display data based on the display screen, and sending the target interaction audio data to the earphone so that the earphone plays the audio according to the target interaction audio data.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, including a memory storing a computer program and a processor implementing the method according to the first aspect when the processor executes the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect.

According to the earphone interaction method and device based on voice, electronic equipment and storage medium, interaction instruction audio acquired by the earphone is firstly acquired, then interaction intention recognition is carried out on the interaction instruction audio to obtain target intention instructions, the target intention instructions are then sent to a preset server, instruction processing data returned by the server are acquired, then data analysis is carried out on the instruction processing data to obtain target interaction display data and target interaction audio data, finally picture display is carried out on the target interaction display data based on a display screen, the target interaction audio data are sent to the earphone, so that the earphone plays the audio according to the target interaction audio data, a target intention instruction of an earphone user is identified through an earphone bin, a request is initiated to a server to achieve the target intention instruction of the earphone user, finally the target intention instruction of the earphone user is executed through the display screen arranged in the earphone bin, namely, the target instruction sent by the earphone user can be completed through the earphone bin and the earphone, connection with other terminal equipment is not needed, decoupling of the earphone intention and other terminal equipment is achieved, and finally flexibility of the earphone during earphone use is improved.

Drawings

FIG. 1 is a flow chart of a method of voice-based headset interaction provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method of voice-based headset interaction provided by another embodiment of the present application;

Fig. 3 is a flowchart of step S102 in fig. 1;

Fig. 4 is a flowchart of step S301 in fig. 3;

Fig. 5 is a flowchart of step S404 in fig. 4;

Fig. 6 is a flowchart of step S302 in fig. 3;

fig. 7 is a flowchart of step S104 in fig. 1;

fig. 8 is a schematic structural diagram of a voice-based earphone interaction device according to an embodiment of the present application;

Fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

Mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficients, MFCC), which is a technique used in sound processing to characterize audio signals. The method is based on the characteristics of human auditory perception, and utilizes the mel scale to simulate the nonlinear perception of the sound frequency by human ears. The mel-frequency cepstrum coefficient is obtained by performing mel-scale filtering on the frequency spectrum of the sound signal, and then calculating the log-power cepstrum thereof. MFCCs are widely used in the fields of speech recognition, speech synthesis, sound classification, and music information retrieval. This technique utilizes mathematical transformations to extract simplified features from complex sound signals that aid in sound analysis, thereby simulating and expanding human hearing intelligence in the artificial intelligence field, helping machines to better understand and process speech and music data.

Independent component analysis techniques (INDEPENDENT COMPONENT ANALYSIS, ICA) ICA is a computational method for separating statistically independent sub-components from multi-dimensional statistical data. Such techniques are commonly used in the field of signal processing, particularly when processing mixed signals, such as separating the original signal from multiple signal sources. The independent component analysis technology can be applied to separation of sound signals, such as separation of sound of each person in cocktail problems, and is widely applied to the fields of image processing, bioinformatics, financial data analysis and the like. By this technique, valuable information can be extracted from complex data for further analysis and processing. ICA belongs to the field of machine learning, is an important tool for data analysis and interpretation in artificial intelligence, and simulates and expands the analysis capability of human beings through mathematical and statistical methods, thereby helping machines to understand and process multi-source data more effectively.

Voiceprint (Voiceprint), which refers to a unique acoustic feature in human voice that can be used to identify and verify the identity of an individual. Each person's vocal cord structure and pronunciation habits are different so that each person's voice has uniqueness, and these voice characteristics can be translated into voiceprints, similar to the use of fingerprints in individual identification. Voiceprint recognition technology belongs to one of biological recognition technology, is an important branch in the field of artificial intelligence, and is widely applied to security authentication, forensic authentication, personal assistants, intelligent home systems and the like. The voiceprint recognition technology mainly comprises the steps of sound signal acquisition, feature extraction, pattern matching and the like. By using machine learning and pattern recognition methods, useful information can be extracted from the acoustic signal and used to verify or identify the identity of the speaker. The technology not only can improve the intelligent level of the safety system, but also can play an important role in man-machine interaction, simulate and expand the cognition and application capability of human beings to sound.

Grammar units (GRAMMATICAL UNIT) grammar units refer to the basic components that make up the language structure, including words, phrases, clauses, sentences, and the like. These elements play a central role in parsing, for building and understanding the syntactic structure of the language. Natural language processing techniques use computer algorithms to identify and parse these grammar elements to enable understanding and generation of human language. The application fields comprise machine translation, voice recognition, automatic abstract and question-answering systems and the like. Through the intensive research and application of the grammar unit, the capability of the computer for processing and understanding natural language can be greatly improved, so that the function of artificial intelligence in terms of language processing is expanded and enhanced.

Based on the above, the embodiment of the application provides a voice-based earphone interaction method and device, electronic equipment and storage medium, aiming at improving the flexibility of earphone use.

The embodiment of the application provides a voice-based earphone interaction method and device, electronic equipment and storage medium, and specifically describes the following embodiment.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a voice-based earphone interaction method, and relates to the technical field of earphones. The earphone interaction method based on the voice provided by the embodiment of the application can be applied to a terminal, a server and software running in the terminal or the server. In some embodiments, the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., the server may be configured as an independent physical server, may be configured as a server cluster or a distributed system formed by a plurality of physical servers, and may be configured as a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligent platforms, and the software may be an application for implementing a voice-based earphone interaction method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor system, a microprocessor-based system, a set top box, a programmable consumer electronics, a network PC, a minicomputer, a mainframe computer, a distributed computing environment that includes any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is an optional flowchart of a voice-based earphone interaction method provided by an embodiment of the present application, where the method in fig. 1 is applied to an earphone cabin, and the earphone cabin is connected to an earphone, and the earphone cabin is provided with a display screen, and the method may include, but is not limited to, steps S101 to S105.

Step S101, acquiring interaction instruction audio acquired by an earphone;

Step S102, carrying out interactive intention recognition on interactive instruction audio to obtain a target intention instruction;

step S103, a target intention instruction is sent to a preset server, and instruction processing data returned by the server are obtained;

step S104, data analysis is carried out on the instruction processing data to obtain target interaction display data and target interaction audio data;

Step S105, performing picture display on the target interaction display data based on the display screen, and sending the target interaction audio data to the earphone so that the earphone performs audio playing according to the target interaction audio data.

Step S101 to step S105 shown in the embodiment of the application are characterized in that firstly, interactive instruction audio acquired by the earphone is acquired, then interactive intention recognition is carried out on the interactive instruction audio to obtain a target intention instruction, then the target intention instruction is sent to a preset server, instruction processing data returned by the server are acquired, then data analysis is carried out on the instruction processing data to obtain target interactive display data and target interactive audio data, finally, picture display is carried out on the target interactive display data based on a display screen, the target interactive audio data is sent to the earphone, so that the earphone plays the audio according to the target interactive audio data, the target intention instruction of an earphone user is identified through an earphone bin, a request is initiated to a server, so that the target intention instruction of the earphone user is realized, and finally, the target intention instruction of the earphone user is executed through the display screen arranged in the earphone bin and the earphone, namely, the target intention instruction sent by the earphone user can be completed through the earphone bin and the earphone, and connection with other terminal equipment is not needed, and therefore, decoupling of the earphone and other terminal equipment is realized, and finally, the flexibility of the earphone in use is improved.

Referring to fig. 2, in some embodiments, the voice-based earphone interaction method may further include, but is not limited to, steps S201 to S204 before step S101:

Step S201, acquiring earphone collection audio acquired by an earphone;

step S202, extracting Mel characteristics of audio acquired by the earphone to obtain audio Mel characteristics;

Step S203, wake-up word detection is carried out on the audio Mel characteristics according to a preset wake-up word detection network, and wake-up word detection data are obtained;

and step S204, performing audio segmentation on the earphone acquisition audio according to the wake-up word detection data to obtain interaction instruction audio.

In the steps S201 to S204 shown in the embodiment of the present application, audio mel features are obtained by acquiring audio collected by headphones, then performing mel feature extraction on the audio collected by headphones, then performing wake-up word detection on the audio mel features according to a preset wake-up word detection network, obtaining wake-up word detection data, and then performing audio segmentation on the audio collected by headphones according to the wake-up word detection data, so as to obtain interactive instruction audio, thereby accurately determining the interactive instruction audio sent by a headphone user, avoiding mistaking non-instruction audio of the headphone user as interactive instruction audio, that is, avoiding mistaking the headphones, improving the accuracy of headphone interaction on one hand, reducing the energy consumption of the headphones when in use, and prolonging the standby time of the headphones on the other hand.

In step S201 of some embodiments, the headset captures audio signals recorded by a built-in microphone on the headset worn by the user. The audio signal includes a user's voice command, ambient sound, sound emitted by the device itself, etc. For example, the user voice command is "send information to Zhang Sanj", the environmental sound is the background noise of the environment where the user is located or the voice of the non-user, and the device alert sound is an alert sound or notification sound sent by the device, such as an incoming call ringtone, a message alert sound or a system start alert sound.

In step S202 of some embodiments, mel-feature extraction refers to converting the earpiece acquired audio into mel-frequency cepstral coefficients (MFCCs), specifically including a series of sequential processes of pre-emphasis, framing, windowing, fast fourier transform, mel-filter bank processing, log-energy computation, and discrete cosine transform. The audio mel feature refers to a feature parameter of a mel frequency cepstrum coefficient obtained by collecting audio from the earphone through a mel feature extraction process.

In step S203 of some embodiments, wake-up word detection refers to an interactive processing procedure for identifying and detecting a specific preset wake-up word in real time in the headset collection audio, so as to wake up the headset cabin. The wake-up word detection network is a deep learning network, and in one embodiment, the wake-up word detection network is a pre-trained cyclic neural network, and after the wake-up word detection, a time node where a preset wake-up word is located in the earphone collection audio, that is, wake-up word detection data, can be detected.

In step S204 of some embodiments, the audio slicing refers to slicing the earphone-collected audio into independent audio segments according to the wake-up word detection data, that is, slicing the audio after the time node where the wake-up word is located in the earphone-collected audio. In one embodiment, the audio 500ms after the wake-up word is segmented until the volume of the audio is smaller than a preset segmentation threshold, and the segmented audio is the interactive instruction audio.

In step S101 of some embodiments, the interactive instruction audio is an audio signal recorded by a built-in microphone on a headset worn by the user, and is an audio clip obtained by detecting and slicing the wake-up word.

Referring to fig. 3, in some embodiments, step S102 may include, but is not limited to, steps S301 to S303:

step S301, denoising the interactive instruction audio to obtain denoising instruction audio;

step S302, carrying out semantic recognition on denoising instruction audio to obtain target semantic data;

step S303, generating instructions according to the target semantic data to obtain target intention instructions.

In the steps S301 to S303 shown in the embodiment of the present application, through denoising the interaction instruction audio, a denoising instruction audio is obtained, then semantic recognition is performed on the denoising instruction audio, and target semantic data is obtained, and instruction generation is performed according to the target semantic data, so as to obtain a target intention instruction, thereby effectively filtering background noise, ensuring that the instruction sent by the earphone user can be accurately understood and clarified in various environments, and thus realizing accurate earphone interaction.

Referring to fig. 4, in some embodiments, step S301 may include, but is not limited to, steps S401 to S404:

Step S401, noise detection is carried out on the interactive instruction audio to obtain noise type data and noise intensity data;

step S402, noise removal is carried out on the interactive instruction audio according to the noise type data and the data, and original denoising audio is obtained;

Step S403, carrying out echo cancellation on the original denoising frequency to obtain preliminary denoising audio;

and step S404, performing voice enhancement on the preliminary denoising frequency to obtain denoising instruction audio.

In the steps S401 to S404 shown in the embodiment of the present application, noise detection is performed on the interaction instruction audio to obtain noise type data and noise intensity data, then noise removal is performed on the interaction instruction audio according to the noise type data and the data to obtain original denoising audio, then echo cancellation is performed on the original denoising audio to obtain preliminary denoising audio, finally human voice enhancement is performed on the preliminary denoising audio to obtain denoising instruction audio, so as to effectively filter various environmental noises and echoes, and ensure that the instruction sent by the earphone user can still be clearly captured in a complex scene.

In step S401 of some embodiments, noise monitoring refers to identifying and analyzing existing background noise in interactive instruction audio, where the noise detection mode may be spectrum analysis, feature extraction, machine learning or statistical analysis, in one embodiment, noise detection is performed by statistical analysis, noise types are identified by analyzing zero crossing times of audio signals, and intensities of energy level quantization noise in different frequency bands are calculated. Noise type data refers to specific information about the kind of background noise obtained by noise monitoring, such as traffic noise, crowd noise or mechanical noise. The noise intensity data refers to specific numerical information about the intensity of background noise obtained by noise monitoring, for example, the average intensity of traffic noise is 70dB, and the peak intensity of crowd noisy sound is 65dB.

In step S402 of some embodiments, noise removal is effective to remove background noise components from the interactive instruction audio while preserving the speech signal. The noise removal method may be wiener filtering or deep learning base method, etc., and the present application is not particularly limited. The original noise-removing frequency refers to an audio signal obtained after the noise removal is completed, that is, a section of audio after the noise removal.

In step S403 of some embodiments, echo cancellation refers to removing echo components generated by sound reflection or sound emitted by the device itself in audio, so as to improve the clarity and intelligibility of the speech signal and prevent echo interference. In one embodiment, the earpiece is provided with a plurality of microphones and the speech signal and the echo signal are separated by statistical independence of the plurality of microphone input signals using independent component analysis techniques (ICA). The primary denoising audio is a section of audio obtained from the original denoising frequency, the noise part and the echo part in the audio are eliminated, and the voice part of the human voice is reserved.

Referring to fig. 5, in some embodiments, step S404 includes, but is not limited to, steps S501 to S503:

step S501, voice extraction is carried out on the preliminary noise removal frequency, and voice audio data are obtained;

Step S502, voiceprint recognition is carried out on the voice audio data to obtain target voice audio;

step S503, performing voice reinforcement on the target voice audio to obtain a denoising instruction audio.

In the steps S501 to S503 shown in the embodiment of the present application, voice extraction is performed on the preliminary noise removal frequency to obtain voice audio data, then voice recognition is performed on the voice audio data to obtain target voice audio, and finally voice reinforcement is performed on the target voice audio to obtain denoising instruction audio, so that accurate extraction and recognition of the voice of the target user are achieved, and it is ensured that the earphone and the earphone bin only respond to the instruction of the target user, namely, the earphone user, and erroneous response is prevented.

In step S501 of some embodiments, the voice extraction refers to separating and extracting a voice part at a preliminary denoising frequency, so as to accurately identify and isolate a voice command of a target user from a complex audio environment, and the voice extraction may be implemented by an independent component analysis or a deep learning method, in one embodiment, the voice extraction is performed by an independent component analysis (INDEPENDENT COMPONENT ANALYSIS, ICA), and the voice is separated from other sound sources by performing an independent component analysis on a plurality of audio signals, i.e., preliminary noise audio, obtained by performing noise cancellation and echo cancellation on audio signals collected by a plurality of microphones on an earphone. The voice audio data is a section of audio, is an audio signal of a voice part obtained by voice extraction through preliminary denoising, and may include voices of a plurality of sound sources.

In step S502 of some embodiments, voiceprint recognition refers to verifying or recognizing the identity of a speaker by analyzing and recognizing individual unique acoustic features, that is, recognizing the corresponding sound source of the earphone user from the preliminary denoising frequency possibly including a plurality of sound sources, presetting the voiceprint features of the earphone user in the earphone bin, and extracting the audio conforming to the voiceprint of the earphone user, wherein the obtained audio is the target voice audio.

In step S503 of some embodiments, the voice enhancement is to enhance the clarity, loudness and intelligibility of the voice portion in the audio, and further suppress the residual noise and interference, so as to optimize the quality of the voice command, and ensure that the voice command of the earphone user can be accurately recognized and understood in various environments. In one embodiment, by setting a threshold, background noise below the threshold is automatically suppressed, and only the voice signal above the threshold is retained, so that voice enhancement is realized, and the audio obtained by voice enhancement is the denoising instruction audio. The denoising instruction audio only comprises voice of the earphone user, so that noise, echo and voice of the non-earphone user are eliminated.

Referring to fig. 6, in some embodiments, step S302 includes, but is not limited to, steps S601 to S604:

step S601, performing text conversion on the denoising instruction audio frequency to obtain a denoising instruction text;

step S602, performing entity recognition on the denoising instruction text to obtain a denoising instruction entity;

Step S603, carrying out grammar recognition on the denoising instruction text to obtain a denoising instruction grammar unit;

Step S604, carrying out semantic synthesis according to the denoising instruction entity and the denoising instruction grammar unit to obtain target semantic data.

In the steps S601 to S604 shown in the embodiment of the present application, text conversion is performed on the denoising instruction audio to obtain a denoising instruction text, then entity recognition is performed on the denoising instruction text to obtain a denoising instruction entity, grammar recognition is performed on the denoising instruction text to obtain a denoising instruction grammar unit, and finally semantic synthesis is performed according to the denoising instruction entity and the denoising instruction grammar unit to obtain target semantic data, so that accurate entity recognition and grammar analysis are realized, it is ensured that the earphone bin can correctly understand the intention and the requirement of an earphone user, and the earphone bin can correctly respond to the instruction through accurate semantic synthesis.

In step S601 of some embodiments, the text conversion is to convert a speech signal in the denoising instruction audio into a corresponding text by using a speech recognition technology, where the text conversion may be speech recognition based on a hidden markov model or speech recognition based on a DEEPSPEECH neural network model, and in one embodiment, the text conversion is performed on the denoising instruction audio based on a pre-trained DEEPSPEECH neural network model, and the text form data, that is, the denoising instruction text, is obtained after the conversion.

In step S602 of some embodiments, the entity identification refers to identifying and classifying the entity having a specific meaning in the denoising instruction text, such as a name of a person, a name of a place, an organization, a time or a number, etc., and in one embodiment, the entity identification is based on rule entity identification, and the entity in the denoising instruction text is identified by matching a specific pattern and a keyword in the text through a predefined rule and dictionary, that is, the denoising instruction entity is obtained. For example, "navigate to the nearest coffee shop" is the denoising instruction entity.

In step S603 of some embodiments, grammar recognition refers to analyzing the denoising instruction text, identifying and parsing the structure and components of the sentence, including part-of-speech tagging, phrase structure analysis, and syntactic relationship recognition, for example, the denoising instruction text is "navigate to nearest coffee shop", and the denoising instruction text is subjected to component syntactic analysis to identify "navigate to" as a verb phrase and "nearest coffee shop" as a noun phrase. The denoising instruction grammar unit refers to a component part with a specific grammar function extracted after grammar recognition, and the grammar unit includes, but is not limited to, a subject, a predicate, an object, a time object and a place object, for example, the denoising instruction grammar unit is ("verb phrase": "navigate to", "noun phrase": "nearest coffee shop").

In some embodiments, in step S604, the semantic synthesis refers to integrating and analyzing the denoising instruction entity and the denoising instruction syntax unit to generate comprehensive and accurate target semantic data, and in one embodiment, the semantic synthesis is performed on the denoising instruction entity and the denoising instruction syntax unit based on a pre-trained natural language model, specifically, a preset prompt word, the denoising instruction entity and the denoising instruction syntax unit are input into the natural language model, so as to obtain the target semantic data.

In step S303 of some embodiments, the instruction generation is json format instruction generation based on the target semantic data, where the instruction includes, but is not limited to (user identity, operation type, operation object), for example, the target instruction is intended to be (user one, navigation, nearest coffee shop).

In step S103 of some embodiments, the preset server is a preset cloud server, a wireless connection is established with the earphone cabin, the wireless connection can be realized through bluetooth, a local area network or the internet, a target intention instruction is sent to the server, and the server returns instruction processing data in json format, wherein the instruction processing data includes, but is not limited to, data category, target interaction audio data and original display data. The data category represents the category of original display data, such as characters, images or videos, and the original display data is the characters, images or videos, and the target interactive audio data is the audio played on the earphone.

Referring to fig. 7, in some embodiments, step S104 may include, but is not limited to, steps S701 to S704:

step S701, carrying out format analysis on the instruction processing data to obtain data types of original display data, target interaction audio data and original display data;

Step S702, if the data category is represented as a broadcasting category, performing text rendering according to the original display data to obtain target interaction display data;

step S703, if the data category is represented as an image category, performing image rendering according to the original display data to obtain target interactive display data;

step S704, if the data category is represented as a video category, video rendering is performed according to the original display data, so as to obtain target interactive display data.

In the steps S701 to S704 shown in the embodiment of the present application, format analysis is performed on instruction processing data to obtain data types of original display data, target interactive audio data and original display data, text rendering is performed according to the original display data to obtain the target interactive display data if the data types are represented as broadcast types, image rendering is performed according to the original display data to obtain the target interactive display data if the data types are represented as image types, and video rendering is performed according to the original display data to obtain the target interactive display data if the data types are represented as video types, thereby realizing multiple functions through headphones and improving the flexibility of use of headphones and a headphone cabin.

In step S701 of some embodiments, format parsing refers to analyzing and processing the instruction processing data to identify its structure, content and data type, that is, extracting the values corresponding to the fields from the json format by the instruction processing data, so as to obtain the data types of the original presentation data, the target interactive audio data and the original presentation data.

In step S702 of some embodiments, if the data category is represented as a broadcast category, text rendering is performed on the original display data to obtain target interactive display data. For example, the data category is broadcasting, the original display data is a section of weather broadcasting, and then the content of the weather broadcasting is rendered into a text, so that the target interactive display data is obtained.

In step S703 of some embodiments, if the data category is represented as an image category, the original display data is subjected to image resolution adjustment, and the adjusted original display data is used as the target interactive display data.

In step S704 of some embodiments, if the data category is represented as a video category, image resolution adjustment is performed on the video frame of each frame, and the adjusted video frame is used as the target interactive display data.

In step S105 of some embodiments, the target interactive display data is displayed according to the display screen, and the corresponding target interactive audio data is sent to the headphones, and the headphones perform audio playing, so as to realize final processing display of the instructions sent by the headphones and the headphone cabin to the headphone user.

Referring to fig. 8, an embodiment of the present application further provides a voice-based earphone interaction device, which may implement the foregoing voice-based earphone interaction method, where the device includes:

an acquisition data module 801, configured to acquire interaction instruction audio acquired by an earphone;

The intention recognition module 802 is configured to recognize an interactive intention of the interactive instruction audio to obtain a target intention instruction;

the instruction processing module 803 is configured to send a target intention instruction to a preset server, and obtain instruction processing data returned by the server;

the data parsing module 804 is configured to parse the instruction processing data to obtain target interactive display data and target interactive audio data;

The picture display module 805 is configured to display a picture of the target interactive display data based on the display screen, and send the target interactive audio data to the headphones, so that the headphones play audio according to the target interactive audio data.

The specific implementation manner of the voice-based earphone interaction device is basically the same as the specific embodiment of the voice-based earphone interaction method, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the earphone interaction method based on voice when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs, so as to implement the technical solution provided by the embodiments of the present application;

The memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM), among others. The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes a voice-based earphone interaction method for executing the embodiments of the present disclosure;

An input/output interface 903 for inputting and outputting information;

The communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, which stores a computer program, and the computer program realizes the voice-based earphone interaction method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the voice-based earphone interaction method, the voice-based earphone interaction device, the electronic equipment and the storage medium, interaction instruction audio acquired by the earphone is firstly acquired, then interaction intention recognition is carried out on the interaction instruction audio to obtain target intention instructions, then the target intention instructions are sent to a preset server, instruction processing data returned by the server are acquired, then data analysis is carried out on the instruction processing data to obtain target interaction display data and target interaction audio data, finally picture display is carried out on the target interaction display data based on a display screen, the target interaction audio data is sent to the earphone, so that the earphone plays the audio according to the target interaction audio data, the target intention instructions of an earphone user are identified through an earphone bin, a request is initiated to a server, so that the target intention instructions of the earphone user are realized, finally the target intention instructions of the earphone user are executed through the display screen arranged in the earphone bin, namely, the intention instructions sent by the earphone user can be completed through the earphone bin and the earphone without being connected with other terminal equipment, and therefore the earphone and the flexibility of the earphone and other terminal equipment is realized, and finally the earphone use is improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" is used to describe an association relationship of an associated object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that only a exists, only B exists, and three cases of a and B exist simultaneously, where a and B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one of a, b or c may represent a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. The storage medium includes various media capable of storing programs, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A voice-based earphone interaction method, which is characterized by being applied to an earphone bin, wherein the earphone bin is connected with an earphone, and is provided with a display screen, and the method comprises the following steps:

acquiring interaction instruction audio acquired by the earphone;

2. The method of claim 1, wherein the performing the interactive intention recognition on the interactive instruction audio to obtain a target intention instruction comprises:

3. The method according to claim 2, wherein the denoising the interactive instruction audio to obtain denoised instruction audio includes:

4. The method of claim 3, wherein said voice enhancing said preliminary denoising frequency to obtain said denoising instruction audio comprises:

5. The method according to claim 2, wherein said performing semantic recognition on the denoised instruction audio to obtain target semantic data comprises:

6. The method according to any one of claims 1 to 5, wherein the data parsing the instruction processing data to obtain target interactive presentation data and target interactive audio data includes:

7. The method of any one of claims 1 to 5, further comprising, prior to the acquiring the interactive instruction audio captured by the headphones:

Acquiring earphone acquisition audio acquired by the earphone;

8. A voice-based earphone interaction device, characterized in that it is applied to an earphone bin, the earphone bin is connected with an earphone, the earphone bin is provided with a display screen, the device comprises:

9. An electronic device comprising a memory storing a computer program and a processor implementing the voice-based headset interaction method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the voice-based headset interaction method of any of claims 1 to 7.