CN115482829B

CN115482829B - Voice processing method, audio and video communication device and vehicle

Info

Publication number: CN115482829B
Application number: CN202211020210.8A
Authority: CN
Inventors: 王子腾; 纳跃跃; 田彪; 付强
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2026-01-27
Anticipated expiration: 2042-08-24
Also published as: CN115482829A

Abstract

This application discloses a speech processing method, an audio-visual communication device, and a vehicle. The method includes: acquiring a raw speech set collected by a sound pickup device, wherein the raw speech set includes a first speech and a second speech, wherein the first speech is a speech signal emitted from a sound source located within an effective interaction area, and the second speech is a speech signal emitted from a sound source other than the sound source located within the effective interaction area, and the sound source within the effective interaction area is the speech interaction object identified by the sound pickup device; performing enhancement processing on the first speech and the second speech respectively to obtain enhanced first speech and second speech; and using a deep learning model to perform signal recovery processing on the raw speech set, the enhanced first speech, and the second speech to generate target speech. This application solves the technical problem in related technologies of the difficulty in picking up speech information from sound sources within an effective interaction area.

Description

Voice processing method, audio and video communication device and vehicle

Technical Field

The present application relates to the field of data processing, and in particular, to a voice processing method, an audio/video communication device, and a vehicle.

Background

Currently, in a voice interaction scene of AIoT (ai+iot, artificial intelligence internet of things), a microphone array is used for picking up the voice of a target speaker and providing the voice to a subsequent voice recognition model for recognition, however, noise and interference of non-target sound sources generally exist in a voice interaction environment, so that the voice quality of the sound sources in an effective interaction area can be reduced, and the difficulty of voice information picking is increased.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a voice processing method, audio and video communication equipment and a vehicle, which at least solve the technical problem that voice information of a sound source in an effective interaction area is difficult to pick up in the related technology.

According to one aspect of the embodiment of the application, a voice processing method is provided, which comprises the steps of obtaining an original voice set acquired by pickup equipment, wherein the original voice set comprises first voices and second voices, the first voices are voice signals sent by sound sources located in an effective interaction area, the second voices are voice signals sent by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects for directional recognition of the pickup equipment, respectively carrying out enhancement processing on the first voices and the second voices to obtain enhanced first voices and second voices, and carrying out signal recovery processing on the original voice set and the enhanced first voices and the enhanced second voices by using a deep learning model to generate target voices, wherein the target voices are voice information directionally picked up by the pickup equipment.

According to one aspect of the embodiment of the application, a voice processing method is provided, which comprises the steps of capturing an original voice set acquired by pickup equipment arranged on audio and video communication equipment, wherein the original voice set comprises first voices and second voices, the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects for directional recognition of the pickup equipment, respectively carrying out enhancement processing on the first voices and the second voices to obtain enhanced first voices and second voices, carrying out signal recovery processing on the original voice set and the enhanced first voices and the enhanced second voices to generate target voices, and controlling the audio and video communication equipment to output the target voices.

According to one aspect of the embodiment of the application, a voice processing method is provided, which comprises the steps of capturing an original voice set acquired by pickup equipment arranged on a target vehicle, wherein the original voice set comprises first voices and second voices, the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects for directional recognition of the pickup equipment, performing enhancement processing on the first voices and the second voices respectively to obtain enhanced first voices and enhanced second voices, performing signal recovery processing on the original voice set and the enhanced first voices and the enhanced second voices by using a deep learning model to generate target voices, and controlling the target vehicle based on the target voices.

According to one aspect of the embodiment of the application, a voice processing method is provided, which comprises the steps that a cloud server receives an original voice set uploaded by a client, wherein the original voice set is acquired through pickup equipment, the original voice set comprises first voices and second voices, the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects of pickup equipment directional recognition, the cloud server carries out enhancement processing on the first voices and the second voices respectively to obtain enhanced first voices and second voices, the cloud server carries out signal recovery processing on the original voice set and the enhanced first voices and the enhanced second voices by means of a deep learning model to generate target voices, the target voices are voice information directionally picked up by the pickup equipment, and the cloud server outputs the target voices to the client.

According to one aspect of the embodiment of the application, a voice processing system is provided, which comprises a sound pickup device and a processing device, wherein the sound pickup device is used for collecting an original voice set, the original voice set comprises a first voice and a second voice, the first voice is a voice signal sent by a sound source located in an effective interaction area, the second voice is a voice signal sent by other sound sources except the sound source located in the effective interaction area, the sound source located in the effective interaction area is a voice interaction object for directional recognition of the sound pickup device, the processing device is connected with the sound pickup device and used for conducting enhancement processing on the first voice and the second voice respectively to obtain the first voice and the second voice after the enhancement processing, and the first voice and the second voice after the enhancement processing are subjected to signal recovery processing by utilizing a deep learning model to generate target voice, and the target voice is voice information directionally picked up by the sound pickup device.

According to one aspect of the embodiment of the application, an audio-video communication device is provided, which comprises a pickup device arranged on the audio-video communication device and used for collecting an original voice set, wherein the original voice set comprises a first voice and a second voice, the first voice is a voice signal sent by a sound source positioned in an effective interaction area, the second voice is a voice signal sent by other sound sources except the sound source positioned in the effective interaction area, the sound source positioned in the effective interaction area is a voice interaction object for directional recognition of the pickup device, a processor is connected with the pickup device and used for carrying out enhancement processing on the first voice and the second voice respectively to obtain the first voice and the second voice after the enhancement processing, and carrying out signal recovery processing on the original voice set and the first voice and the second voice after the enhancement processing by utilizing a deep learning model to generate target voice, and an output device is connected with the processor and used for outputting the target voice.

According to one aspect of the embodiment of the application, a vehicle is provided, which comprises a pickup device arranged on the vehicle and used for collecting an original voice set, wherein the original voice set comprises a first voice and a second voice, the first voice is a voice signal sent out by a sound source positioned in an effective interaction area, the second voice is a voice signal sent out by other sound sources except the sound source positioned in the effective interaction area, the sound source positioned in the effective interaction area is a voice interaction object for directional recognition of the pickup device, and the controller is connected with the pickup device and used for carrying out enhancement processing on the first voice and the second voice respectively to obtain the first voice and the second voice after the enhancement processing, carrying out signal recovery processing on the original voice set, the first voice and the second voice after the enhancement processing by using a deep learning model, generating a target voice and controlling the target vehicle based on the target voice.

According to an aspect of the embodiment of the present application, there is also provided a computer-readable storage medium, including a stored program, where the program, when executed, controls a device in which the storage medium is located to perform any one of the above-mentioned speech processing methods.

In the embodiment of the application, an original voice set acquired by pickup equipment is firstly acquired, wherein the original voice set comprises first voice and second voice, the first voice is voice information sent by a sound source located in an effective interaction area, the second voice is voice information sent by other sound sources except the sound source in the effective interaction area, the sound source located in the effective interaction area is a voice interaction object for directional recognition of the pickup equipment, the first voice and the second voice are respectively enhanced to obtain the enhanced first voice and the enhanced second voice, a deep learning model is utilized to restore voice signals of the original voice set and the enhanced first voice and the enhanced second voice to generate target voice, the target voice is voice information directionally picked up by the pickup equipment, so that sound source interference and environmental noise outside the non-effective interaction area are effectively inhibited, and the extraction effect of the voice information in the effective interaction area is improved. It is easy to note that the enhancement processing can be performed on the first voices emitted by the sound sources in the effective interaction area and the second voices emitted by other sound sources except the sound sources in the effective interaction area respectively, and the second voices emitted by other sound sources are effectively restrained by combining the deep learning model, so that the sound pickup device can pick up the voice information in the effective interaction area in a directional manner, and the technical problem that the voice information of the sound sources in the effective interaction area is difficult to pick up in the related art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a voice processing method according to an embodiment of the present application;

FIG. 2 is a flowchart of a speech processing method according to embodiment 1 of the present application;

FIG. 3 is a top view of a simulated room environment according to an embodiment of the application;

FIG. 4 is a user interface schematic diagram of a training deep learning model according to an embodiment of the application;

FIG. 5 is a schematic diagram of a deep learning model according to an embodiment of the application;

FIG. 6 is a block diagram of a speech processing flow according to an embodiment of the present application;

FIG. 7 is a flowchart of a voice processing method according to embodiment 2 of the present application;

FIG. 8 is a flowchart of a voice processing method according to embodiment 3 of the present application;

FIG. 9 is a flowchart of a voice processing method according to embodiment 4 of the present application;

FIG. 10 is a schematic diagram of a speech processing system according to embodiment 5 of the present application;

fig. 11 is a schematic diagram of an audio-video communication device according to embodiment 6 of the present application;

FIG. 12 is a schematic illustration of a vehicle according to embodiment 7 of the application;

Fig. 13 is a schematic view of a speech processing apparatus according to embodiment 8 of the present application;

Fig. 14 is a schematic view of a speech processing apparatus according to embodiment 9 of the present application;

fig. 15 is a schematic view of a speech processing apparatus according to embodiment 10 of the present application;

Fig. 16 is a schematic diagram of a speech processing apparatus according to embodiment 11 of the present application;

fig. 17 is a block diagram showing the structure of a computer terminal according to embodiment 12 of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terminology appearing in the course of describing embodiments of the application are applicable to the following explanation:

Beamforming (Superdirective beamforme), also known as spatial filtering, is a processing technique that uses an array of microphones to direct received signals.

The notch Null beamformer is mainly used for filtering out signals at a certain frequency point.

RIR Room Impulse Respose describes the room transfer function from the sound source location to the microphone location.

At present, array algorithms such as beam forming or blind source separation are generally utilized to improve the voice quality of sound sources in an effective interaction area, however, the performance of the method based on classical signal processing is limited, especially in the scenes with fewer microphone arrays and more interference sound sources.

In order to solve the problems, the method and the device can effectively inhibit sound source interference and environmental noise in non-target directions by combining a directional beam forming algorithm of a deep learning model, and facilitate picking up voice information in the target directions, so that voice interaction experience in the target directions is guaranteed.

Example 1

There is also provided in accordance with an embodiment of the present application a speech processing method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.

The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 is a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a voice processing method according to an embodiment of the present application. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, 102 n) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU, or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. Among other things, a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). The data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the speech processing method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104 to perform various functional applications and data processing, that is, implement the speech processing method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

In the above-described operating environment, the present application provides a speech processing method as shown in fig. 2. It should be noted that, the voice processing method of this embodiment may be executed by the computer terminal of the embodiment shown in fig. 1. Fig. 2 is a flowchart of a voice processing method according to embodiment 1 of the present application. As shown in fig. 2, the method may include the steps of:

step S202, an original voice set acquired by the pickup device is acquired.

The original voice set comprises first voices and second voices, wherein the first voices are voice signals sent by sound sources located in an effective interaction area, the second voices are voice signals sent by other sound sources except the sound sources located in the effective interaction area, and the sound sources located in the effective interaction area are voice interaction objects for directional recognition of sound pickup equipment.

The sound pickup apparatus may be a sound pickup, wherein the sound pickup is an accessory for collecting live sound, which is an electroacoustic instrument for amplifying sound by receiving vibration of the sound. The sound pickup can be composed of a microphone and an audio amplifying circuit, and can be generally divided into a digital sound pickup and an analog sound pickup, wherein the digital sound pickup is sound sensing equipment for converting an analog audio signal into a digital signal through a digital signal processing system and performing corresponding digital signal processing, and the analog sound pickup is used for amplifying collected sound through the microphone.

The pickup device may be one or more microphones, and the plurality of microphones may be microphone arrays, where the microphone arrays may be microphone arrays formed by linearly arranging a plurality of directional microphones, and may be uniform linear arrays (uniform LINEAR ARRAY) or non-uniform linear arrays (non-uniform LINEAR ARRAY), and the specific type may be determined according to actual needs.

The voice interaction object may be a sound source in the effective interaction area, where the sound source may be a sound made by one or more users, the sound source may also be a device capable of making a sound, such as a smart sound, a television, and the sound source may also be a pet. In a driving environment, the sound source may also be the sound of a passenger.

The effective interaction area may be an interaction area where the sound pickup apparatus can recognize voice. The effective interaction area may be an interaction area preset according to an actual scene. Optionally, any interaction area may be selected as an effective interaction area, and an interaction area close to the sound pickup apparatus may be selected as an effective interaction area. The effective interaction area may be selected according to actual requirements.

The first voice may be voice information emitted from a sound source located in the effective interaction area, and the second voice may be a voice signal emitted from a sound source other than the sound source located in the effective interaction area. Wherein, other interaction areas outside the effective interaction area can be ineffective interaction areas. It should be noted that, when the pickup device collects the original voice set, the quality of the first voice is lower due to the interference of the second voice in the ineffective interaction area, so that the effect of voice recognition is reduced, and the voice interaction experience is deteriorated. Therefore, after the original voice set is collected, the first voice and the second voice in the original voice set need to be processed, so that the quality of the first voice is improved, and the voice recognition effect is achieved.

In an alternative embodiment, a video call scenario is taken as an example, where a region near a terminal microphone may be an effective interaction region, and a region far from the terminal microphone may be an ineffective interaction region, where a voice signal emitted by a sound source near the terminal microphone may be a first voice, and a voice signal emitted by another sound source far from the terminal microphone is a second voice.

In an alternative embodiment, the sound pickup device may acquire an original voice set, and determine, according to a preset effective interaction area, a voice signal belonging to a first voice and a voice signal belonging to a second voice in the original voice set, so as to distinguish a voice signal sent by a sound source in the effective interaction area from a voice signal sent by a sound source other than the effective interaction area, so as to facilitate subsequent extraction to obtain voice information directionally picked up by the sound pickup device.

Step S204, enhancement processing is performed on the first voice and the second voice respectively, and the first voice and the second voice after the enhancement processing are obtained.

In an alternative embodiment, the voice signal in the effective interaction area may be enhanced by using a beam forming algorithm, so as to obtain a first voice after enhancement processing, optionally, fourier transformation may be performed on the original voice set, and in each frequency band, according to the microphone array topology structure and the first voice in the effective interaction area, a convex optimization tool (CVX) is used to solve a filter coefficient of the beam, and the original voice set is filtered so as to obtain a processed frequency domain signal, that is, the first voice after enhancement.

In another alternative embodiment, the notch algorithm may be used to suppress the speech signal in the effective interaction area, which is equivalent to enhancing the speech signal emitted by other sound sources except the sound source in the effective interaction area, that is, the enhanced second speech may be obtained, alternatively, the original speech set may be fourier-transformed, the notch filter coefficient is solved by using the convex optimization tool according to the microphone array topology structure and the first speech in the effective interaction area in each frequency band, and the processed frequency domain signal, that is, the enhanced second speech, is obtained after filtering the original speech set.

In yet another alternative embodiment, the first voice and the second voice are respectively enhanced, so that the voice signals sent by the sound sources in the effective interaction area become obvious, the voice signals sent by the sound sources in the ineffective interaction area are also obvious, the comparison between the two voice signals is also obvious, and the subsequent directional pickup of the voice information in the effective interaction area is facilitated.

Step S206, performing signal recovery processing on the original voice set, the enhanced first voice and the enhanced second voice by using the deep learning model to generate target voice.

The target voice is voice information directionally picked up by the pickup device.

The deep learning model can be a model which is obtained through pre-training and can identify sound source in an effective interaction area to send out voice. The deep learning model may be a model in which a deep feedforward sequence Memory neural network (Deep Feedforward Sequential Memory Network, abbreviated as DFSMN) and a linear layer are stacked, and the deep learning model may also be a Long Short-Term Memory artificial neural network (LSTM), a convolutional neural network (Convolutional Neural Networks, abbreviated as CNN), or any other neural network model, which is not limited herein.

The model structure of the deep learning model can be an input layer, a hidden layer, a linear mapping layer, a sequence memory module and an output layer, wherein the model structure of the deep learning model can be adjusted according to actual requirements.

In an alternative embodiment, features of the original speech set, the first speech after enhancement processing and the second speech may be extracted first, the features are input into a deep learning model, positioning processing is performed on the features by the deep learning model, so as to obtain a time-frequency mask of the target speech, so that signal recovery processing is performed on the first speech after enhancement processing according to the time-frequency mask, and in the process of signal recovery processing, an interference signal in the first speech after enhancement processing may be masked, so as to obtain the target speech with higher quality.

The time-frequency masking may be divided into time-domain masking and frequency-domain masking, and the enhanced first speech may mask the enhanced second speech that occurs in the vicinity thereof simultaneously through the frequency-domain masking, and the enhanced first speech may mask the enhanced second speech that is adjacent in time to the enhanced first speech through the time-domain masking. That is, the enhanced first speech may be masked by time-frequency masking to the enhanced second speech to obtain the enhanced directionally picked-up speech information.

In yet another alternative embodiment, the information of the effective interaction area may be used as an input of a deep learning model, and the output of the model may be adjusted with the change of setting the effective interaction area, so that the array beam may dynamically point in a direction that does not pass through.

Taking a phone call scene as an example for explanation, in the phone call scene, a voice interaction area in a preset range of a terminal microphone can be determined to be an effective interaction area, other ranges far away from the preset range are ineffective interaction areas, a first voice can be a voice of a user for making a call, a second voice can be a voice or noise of a speaker in other ranges, an original voice set collected by the microphone can be obtained first, enhancement processing is performed on the first voice in the preset range in the original voice set, enhancement processing is performed on the second voice in other ranges, and signal recovery processing is performed on the original voice set, the first voice after the enhancement processing and the low second voice by using a deep learning model in combination with spatial information, so that voices in the effective interaction area with higher quality are obtained, and therefore, the voice recognition result of the user can be improved, and the call quality of the phone is improved.

Taking a scene of the intelligent sound as an example for explanation, in an interaction scene of the intelligent sound, a voice interaction area in a preset range of the intelligent sound voice acquisition device can be determined to be an effective interaction area, other ranges far away from the preset range are not effective interaction areas, a first voice can be a voice of a user for sending an interaction instruction to the intelligent sound, a second voice can be noise in other ranges or voice of a speaker, an original voice set acquired by the voice acquisition device can be acquired firstly, enhancement processing is carried out on the first voice in the preset range in the original voice set, and enhancement processing is carried out on the second voice in other ranges, a deep learning model is utilized to combine spatial information to carry out signal recovery processing on the original voice set, the first voice after the enhancement processing and the low second voice, so that voices in the effective interaction area with higher quality are obtained, voice recognition results of voice recognition on the interaction instruction sent by the user can be improved, and accordingly voice interaction experience of the intelligent sound is improved.

Taking a vehicle control scene as an example for explanation, in the vehicle control interaction scene, a voice interaction area in a preset range of a voice acquisition device in a vehicle can be determined to be an effective interaction area, other ranges far away from the preset range are ineffective interaction areas, a first voice can be a voice of a user sending an interaction instruction to the vehicle, a second voice can be noise in other ranges or voice of a speaker, such as loudspeaker sound of other vehicles, an original voice set acquired by the voice acquisition device can be acquired first, enhancement processing is performed on the first voice in the preset range in the original voice set, and enhancement processing is performed on the second voice in the other ranges, a deep learning model is utilized to combine spatial information to perform signal recovery processing on the original voice set, the first voice after the enhancement processing and the second voice with low quality, so that voice in the effective interaction area with higher quality can be obtained, voice recognition results of voice recognition on the interaction instruction sent by the user can be improved, and therefore accuracy of the vehicle is controlled.

Through the steps, an original voice set acquired by the pickup device is firstly acquired, wherein the original voice set comprises first voices and second voices, the first voices are voice information sent by sound sources located in an effective interaction area, the second voices are voice information sent by other sound sources except the sound sources in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects directionally recognized by the pickup device, enhancement processing is conducted on the first voices and the second voices respectively to obtain enhanced first voices and second voices, voice signal recovery is conducted on the original voice set and the enhanced first voices and the enhanced second voices through a deep learning model to generate target voices, the target voices are voice information directionally picked up by the pickup device, sound source interference and environmental noise outside the non-effective interaction area are effectively restrained, and the extraction effect of the voice information in the effective interaction area is improved. It is easy to note that the enhancement processing can be performed on the first voices emitted by the sound sources in the effective interaction area and the second voices emitted by other sound sources except the sound sources in the effective interaction area respectively, and the second voices emitted by other sound sources are effectively restrained by combining the deep learning model, so that the sound pickup device can pick up the voice information in the effective interaction area in a directional manner, and the technical problem that the voice information of the sound sources in the effective interaction area is difficult to pick up in the related art is solved.

In the embodiment of the application, the first voice and the second voice are respectively enhanced, and the first voice and the second voice after enhancement are obtained by performing superposition processing on an original voice set by utilizing a beam forming algorithm to obtain the first voice after enhancement, and performing filtering processing on the original voice set by utilizing a notch algorithm to obtain the second voice after enhancement.

The beamforming algorithm described above may be classified into an adaptive algorithm based on a direction estimation according to objects, for example, but not limited thereto, a super-directional beamforming algorithm.

In an alternative embodiment, the acting directions of the beam forming algorithm and the notch algorithm may be the target directions of the effective interaction areas, the effective interaction areas may be areas about 0 ° in front of the sound pickup apparatus and 15 ° each, the central angle is the target direction, that is, the directions of the beam forming and the notch are both 0 °, which is only used as an illustration here, and the target directions and the effective interaction areas may be set according to practical situations.

In another alternative embodiment, fourier transformation may be performed on the original speech set by using a beamforming algorithm, a convex optimization tool is used to solve a filter coefficient of the first speech in each frequency band according to a microphone array topology structure and a direction of an effective interaction area, and after the first speech is filtered by using a filter, array outputs in a target direction may be superimposed in phase, so as to obtain the first speech after enhancement processing.

In another alternative embodiment, fourier transformation may be performed on the original speech set by using a notch algorithm, and the filter coefficients of the first speech are solved by using a convex optimization tool according to the microphone array topology and the direction of the effective interaction area in each frequency band, and the first speech in the original speech set is blocked by using a notch filter, so as to achieve the purpose of enhancing the second speech in the ineffective interaction area, thereby obtaining the enhanced second speech.

In the embodiment of the application, the deep learning model is utilized to carry out signal recovery processing on the original voice set and the first voice and the second voice after the enhancement processing to generate the target voice, and the method comprises the steps of carrying out feature extraction on the original voice set and the first voice and the second voice after the enhancement processing to obtain original voice features, first voice features and second voice features, inputting the original voice features, the first voice features and the second voice features into the deep learning model to obtain time-frequency masking of the target voice, and carrying out signal recovery processing on the first voice after the enhancement processing based on the time-frequency masking to obtain the target voice.

The original voice features comprise first voice features which are not enhanced and second voice features which are not enhanced.

The time-frequency masking may be a phase-aware masking (PHASE SENSTIVE MASK, abbreviated as PSM), where the time-frequency masking may be applied in a direction of the first speech to perform signal recovery processing on the first speech after the enhancement processing, so as to obtain the target speech.

In an alternative embodiment, the first speech and the second speech after the enhancement processing may be respectively subjected to short-time fourier transformation, and then the Fbank dimensional features (audio dimensional features) may be extracted, and the mean variance of the obtained Fbank dimensional features may be normalized, so as to obtain the original speech feature, the first speech feature, and the second speech feature, where the Fbank dimensional features may be Fbank dimensional features.

In another alternative embodiment, the original voice feature, the first voice feature and the second voice feature may be input into the deep learning model to obtain a time-frequency mask capable of masking other interference voices except for the target voice, and the time-frequency mask may be used in the first voice signal after enhancement processing to mask other interference signals in the first voice signal after enhancement processing to obtain a target voice with higher quality, so that the target voice is applied in a voice interaction scene, and accuracy of voice recognition is improved.

In an alternative embodiment, the deep learning model may also use various other features that may represent source spatial information to derive the time-frequency mask of the target speech, such as the inter-channel energy difference (INTERCHANNEL LEVEL DIFFERENCE) or the inter-channel phase difference (INTERCHANNEL PHASE DIFFERENCE). That is, the feature extraction of the inter-channel energy difference or the inter-channel phase difference may be performed on the original speech set, the first speech after the enhancement processing, and the second speech, to obtain the original speech feature, the first speech feature, and the second speech feature.

In the embodiment of the application, the method further comprises the steps of constructing a plurality of simulation scenes corresponding to the sound pickup equipment, wherein the number and types of sound sources contained in different simulation scenes are different, the types of the sound sources comprise at least one of a target sound source, an interference sound source and a noise sound source, the target sound source is located in an effective interaction area, the interference sound source and the noise sound source are located in other interaction areas except the effective interaction area, a simulation voice set corresponding to the simulation scenes is generated, and the simulation voice set is utilized to train the deep learning model.

The plurality of simulated scenes described above may be simulated randomly sized rooms.

The number of the sound sources included in the simulation scene may be one or more, the types of the sound sources included in the simulation scene may be a target sound source, an interference sound source and a noise sound source, wherein the target sound source may be a sound source in an effective interaction area, the interference sound source may be a sound source in an ineffective interaction area, and the noise sound source may be an irregular sound source randomly existing in the ineffective interaction area.

In an alternative embodiment, the microphone array positions may be randomly placed in the simulated scene, the sound sources may be kept directly in front of the array, the interfering sound sources or noise may be randomly placed in a set ineffective interaction area, and the target sound sources, the interfering sound sources and the noise sources may be kept at a certain proportion to randomly appear. Alternatively, various actual scenes including sound sources, such as a target sound source+an interfering sound source, an interfering sound source+a noise sound source, a target sound source+a noise sound source, and independent scenes of the respective sound sources, may be simulated.

Fig. 3is a top view of a simulated room environment according to an embodiment of the application. As shown in fig. 3, a microphone array may be placed in a simulated room, where a region where the microphone array collects sound may be divided into an effective interaction region and an ineffective interaction region, where one sound source may be set in the effective interaction region, where two sound sources may be set in the ineffective interaction region on the left, where the simulated microphone array may collect a set of simulated voices emitted by the sound source, where the number of sound sources and the type of the sound sources in the simulated room environment may be changed to generate a set of simulated voices corresponding to the simulated scene, and where the size of the simulated room environment and the placement position of the microphone array may be changed to generate a set of simulated voices corresponding to the simulated scene. The deep learning model can be trained by using the simulated voice set.

In another alternative embodiment, the simulated voice set and the simulated scene may be used as training samples, the simulated voice set may be input into a deep learning model, the deep learning model may obtain a time-frequency mask of an effective interaction region, a loss function may be constructed according to a sound source and the time-frequency mask in the effective interaction region in the simulated scene, and model parameters of the deep learning model may be updated by using the loss function, so that the deep learning model may obtain a time-frequency mask of a target voice with higher accuracy, that is, a time-frequency mask of a voice signal in the effective interaction region. It should be noted that, in general, the time-frequency masking is to mask a sound source outside the effective interaction area, so that a loss function can be constructed according to the time-frequency masking obtained by the deep learning model and the sound source in the effective interaction area in the simulation scene to determine whether the time-frequency masking obtained by the deep learning model is accurate.

FIG. 4 is a user interface schematic diagram of a training deep learning model according to an embodiment of the application. As shown in fig. 4, a user may upload a plurality of constructed simulation scene files by clicking a control for uploading the simulation scene files on a user interface, and the user may drag the constructed simulation scene files to a dashed frame for uploading, after the uploading is successful, may display images or thumbnails of the uploaded simulation scenes in a display frame on the upper right, and the user may check whether the uploading content needs to be changed by using the images displayed in the display frame, and if not, may click a generation control to generate a simulation voice set corresponding to the simulation scenes, and may display the generated simulation voice set corresponding to the simulation scenes in the display frame on the lower right.

In the embodiment of the application, generating the simulated voice sets corresponding to the plurality of simulated scenes comprises determining simulated voices emitted by each sound source in each simulated scene, determining a transfer function corresponding to each sound source in each simulated scene through a mirror image method, and convolving the simulated voices and the transfer function to obtain the simulated voice sets corresponding to each simulated scene.

The mirroring method (IMAGE) described above may be a method of calculating an electrostatic field or a steady magnetic field. The mirroring method can be realized by an open source tool RIR-Generator (room transfer function).

The transfer function may be a room transfer function, wherein the room transfer function is used to represent the sound source position to the microphone position.

In an alternative embodiment, the simulated voice emitted by each sound source in each simulated scene may be determined first, and then the simulated voice and the room transfer function are convolved, so as to determine the position of the simulated voice to the position of the microphone, thereby determining whether the simulated voice is the voice in the effective interaction area or the voice in the ineffective interaction area, and further obtaining the simulated voice set corresponding to each simulated scene.

In the embodiment of the application, the feature extraction is performed on the original voice set and the first voice and the second voice after the enhancement processing respectively to obtain the original voice feature, the first voice feature and the second voice feature, wherein the step of performing short-time Fourier transform on the original voice set and the first voice and the second voice after the enhancement processing respectively to obtain an original frequency domain signal, a first frequency domain signal and a second frequency domain signal, the step of performing feature extraction on the original frequency domain signal, the first frequency domain signal and the second frequency domain signal respectively to obtain the original frequency domain feature, the first frequency domain feature and the second frequency domain feature, and the step of performing normalization on the original frequency domain feature, the first frequency domain feature and the second frequency domain feature respectively to obtain the original voice feature, the first voice feature and the second voice feature.

In an alternative embodiment, the original speech set and the first speech and the second speech after enhancement processing may be subjected to short-time fourier transform to obtain an original frequency domain signal, a first frequency domain signal and a second frequency domain signal, where the short-time fourier transform is a mathematical transform related to the fourier transform and is used to determine the frequency and the phase of a local area sine wave of the time-varying signal, and the original frequency domain signal, the first frequency domain signal and the second frequency domain signal are respectively subjected to feature extraction to obtain an original frequency domain feature, a first frequency domain feature and a second frequency domain feature, where the original frequency domain feature, the first frequency domain feature and the second frequency domain feature may be 80-dimensional features, and the original frequency domain feature, the first frequency domain feature and the second frequency domain feature are respectively subjected to mean variance normalization to obtain the original speech feature, the first speech feature and the second speech feature. The mean variance normalization may be that the mean and variance of the features are normalized, so as to ensure that the mean of all the features is near 0.

In the embodiment of the application, the deep learning model comprises an input layer, a plurality of compact feedforward sequential storage networks, a plurality of first hiding layers, a first linear mapping layer and an output layer which are sequentially connected, wherein each compact feedforward sequential storage network comprises a second hiding layer, a second linear mapping layer and a memory module, and the memory module in the former compact feedforward sequential storage network is connected with the memory module in the latter compact feedforward sequential storage network.

The plurality of first hidden layers described above are used to convert the output of the compact feed-forward sequential storage network to an input usable by the first linear mapping layer. Similarly, the second hidden layer is used to convert the output of the input layer into an input that can be used by the compact feed-forward sequential storage network.

The compact feedforward sequence storage network may be a compact feedforward sequence memory network (cFSMN) that maps the output of the second hidden layer to a low-dimensional vector through the second linear mapping layer, inputs the low-dimensional vector into the memory module, and the memory module performs weighted summation on the input plurality of low-dimensional vectors, and then obtains the output of the compact feedforward sequence storage network through affine change and a nonlinear function, and the memory module may further input the obtained low-dimensional vector into the memory module of the next compact feedforward sequence network. The memory module in the former compact feedforward sequential storage network structure is connected with the memory module in the latter compact feedforward sequential network so that the memory module can weight and sum a plurality of low-dimensional vectors obtained by the memory modules in different compact feedforward sequential networks.

The first linear mapping layer is used for mapping the outputs of the first hiding layers to the low-dimensional vectors, and the low-dimensional vectors can be output through the output layer through affine change and nonlinear function.

Fig. 5 is a schematic diagram of a deep learning model according to an embodiment of the present application. As shown in fig. 5, the original speech feature, the first speech feature and the second speech feature may be input into input layers that are sequentially connected, the low-dimensional vectors of the input features are weighted and summed by the compact feedforward sequential storage network, the weighted sum of the low-dimensional vectors is subjected to an affine change and a non-existing function to obtain an output of the compact feedforward sequential storage network, the output content may be converted into an input that can be used by the second linear mapping layer through the plurality of first concealment layers, the input of the first linear mapping layer may be mapped to a low-dimensional vector, and the low-dimensional vectors are subjected to an affine change and a non-linear function to output the time-frequency mask of the target speech.

Fig. 6 is a block diagram of a speech processing flow according to an embodiment of the present application. The pickup device may be a microphone array, the original speech set may be microphone signals collected by the microphone array, where the microphone signals may include speech signals in an effective interaction area and speech signals of other sound sources except for sound sources in the effective interaction area, and the microphone signals may be subjected to target direction notch and target direction beam, where the target direction notch is to suppress the speech signals in the effective interaction area by using a notch algorithm to enhance signals of other sound sources except for the sound sources in the effective interaction area, that is, obtain second speech after enhancement processing, and the target direction beam is to enhance the speech signals in the effective interaction area by using a directional beam algorithm, that is, obtain first speech after enhancement processing, and perform feature extraction and normalization processing on the first speech and the second speech after enhancement processing, and input the processing result into a deep neural network model (that is, the deep learning model), so as to obtain a time-frequency mask of the signals in the effective interaction area, and restore the signals of the first speech after enhancement processing according to the time-frequency mask, so as to obtain the speech interaction quality of the first speech after enhancement processing, that is higher than that is, and the target speech interaction experience in the target speech scene is improved.

The scheme for improving the voice quality in the existing voice interaction scene is aimed at extracting voice information of a speaker in any direction, only a relatively simple multi-speaker scene is considered, the processing capacity is limited for the condition that speaker interference and noise coexist, meanwhile, the methods only consider a large array scene with a plurality of microphones, and the analysis is absent for the condition that the number of the microphones is small (for example, two microphones) and the array spacing is small (for example, the spacing is smaller than 4 cm).

The deep learning model comprehensively utilizes wave beams pointing to an effective interaction area and notch signals of the effective interaction area, can improve the spatial filtering capability of the model, can be applied to scenes with fewer microphones and microphone array intervals, fully considers various scenes in practice in the model training data simulation process, namely a target sound source, an interference sound source, a noise sound source, the target sound source, the noise sound source and the scenes in which all sound sources independently exist, effectively improves the practical performance of the model, has the capability of processing the condition that speaker interference and noise simultaneously occur, verifies in the actual scene, can realize the effective interaction area with a 20-degree range on two microphone arrays, and obviously surpasses the array pointing effect of algorithms such as classical wave beam forming, blind source separation and the like.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present application.

Example 2

Fig. 7 is a flowchart of a voice processing method according to embodiment 2 of the present application, and as shown in fig. 7, the method may include the steps of:

Step S702, capturing an original voice collection collected by a sound pickup device provided on an audio/video communication device.

The original voice set comprises first voices and second voices, wherein the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, and the sound sources located in the effective interaction area are voice interaction objects directionally recognized by the pickup equipment.

The audio and video communication device in the above steps may be, but is not limited to, audio and video conference, intelligent speaker, intelligent home appliances (such as television with voice control function, refrigerator), etc.

Step S704, enhancement processing is performed on the first voice and the second voice respectively, so as to obtain the first voice and the second voice after the enhancement processing.

Step S706, the deep learning model is utilized to perform signal recovery processing on the original voice set, the enhanced first voice and the enhanced second voice, and target voice is generated.

In step S708, the audio/video communication apparatus is controlled to output the target voice.

It should be noted that, the preferred embodiment of the present application in the above examples is the same as the embodiment provided in example 1, the application scenario and the implementation process, but is not limited to the embodiment provided in example 1.

Example 3

Fig. 8 is a flowchart of a voice processing method according to embodiment 3 of the present application, and as shown in fig. 8, the method may include the steps of:

Step S802 captures an original voice collection collected by a sound pickup apparatus provided on a target vehicle.

The target vehicle may be a fuel vehicle, a new energy vehicle, an automatic driving vehicle, an unmanned vehicle, or the like, and is not limited herein.

Step S804, enhancement processing is performed on the first voice and the second voice respectively, so as to obtain the first voice and the second voice after the enhancement processing.

Step S806, performing signal recovery processing on the original voice set, the enhanced first voice and the enhanced second voice by using the deep learning model, and generating a target voice.

Step S808, the target vehicle is controlled based on the target voice.

Example 4

Fig. 9 is a flowchart of a voice processing method according to embodiment 4 of the present application, and as shown in fig. 9, the method may include the steps of:

In step S902, the cloud server receives the original speech set uploaded by the client.

The original voice collection is acquired through the pickup device, the original voice collection comprises first voice and second voice, the first voice is a voice signal sent by a sound source located in an effective interaction area, the second voice is a voice signal sent by other sound sources except the sound source located in the effective interaction area, and the sound source located in the effective interaction area is a voice interaction object directionally recognized by the pickup device.

In step S904, the cloud server performs enhancement processing on the first voice and the second voice, so as to obtain the first voice and the second voice after the enhancement processing.

In step S906, the cloud server performs signal recovery processing on the original speech set, the enhanced first speech, and the enhanced second speech by using the deep learning model, and generates a target speech.

In step S908, the cloud server outputs the target voice to the client.

Example 5

There is also provided in accordance with an embodiment of the present application a speech processing system, in which steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and in which although a logical order is shown in the flowcharts, in some cases steps shown or described may be performed in an order other than that shown or described herein.

Fig. 10 is a schematic diagram of a speech processing system according to embodiment 5 of the present application, and as shown in fig. 10, a speech processing system 1000 includes:

The sound pickup device 1002 is configured to collect an original set of voices, where the original set of voices includes a first voice and a second voice, where the first voice is a voice signal sent from a sound source located in an effective interaction area, the second voice is a voice signal sent from another sound source other than the sound source located in the effective interaction area, and the sound source located in the effective interaction area is a voice interaction object directionally recognized by the sound pickup device;

and the processing device 1004 is connected with the pickup device and is used for respectively carrying out enhancement processing on the first voice and the second voice to obtain the first voice and the second voice after the enhancement processing, and carrying out signal recovery processing on the original voice set and the first voice and the second voice after the enhancement processing by using the deep learning model to generate target voice, wherein the target voice is voice information directionally picked up by the pickup device.

Example 6

According to the embodiment of the application, an audio and video communication device is also provided. Fig. 11 is a schematic diagram of an audio-video communication device according to embodiment 6 of the present application, and as shown in fig. 11, an audio-video communication device 1100 includes:

The pickup device 1102 is disposed on the audio-video communication device 1100 and is configured to collect an original voice set, where the original voice set includes a first voice and a second voice, where the first voice is a voice signal sent from a sound source located in an effective interaction area, the second voice is a voice signal sent from another sound source other than the sound source located in the effective interaction area, and the sound source located in the effective interaction area is a voice interaction object directionally recognized by the pickup device;

The processor 1104 is connected with the pickup device 1102, and is configured to perform enhancement processing on the first voice and the second voice, obtain the first voice and the second voice after enhancement processing, and perform signal recovery processing on the original voice set and the first voice and the second voice after enhancement processing by using the deep learning model, so as to generate a target voice;

an output device 1106, coupled to the processor 1104, for outputting the target speech.

Example 7

There is further provided a vehicle according to an embodiment of the present application, fig. 12 is a schematic view of a vehicle according to embodiment 7 of the present application, and as shown in fig. 12, the vehicle 1200 includes:

A sound pickup apparatus 1202 provided on the vehicle 1200 for collecting an original set of voices, wherein the original set of voices includes a first voice and a second voice, wherein the first voice is a voice signal from a sound source located in an effective interaction area, the second voice is a voice signal from other sound sources than the sound source located in the effective interaction area, and the sound source located in the effective interaction area is a voice interaction object for directional recognition by the sound pickup apparatus;

And a controller 1204, coupled to the pickup device 1202, for performing enhancement processing on the first voice and the second voice, respectively, to obtain the enhanced first voice and the enhanced second voice, performing signal recovery processing on the original voice set and the enhanced first voice and second voice by using the deep learning model, generating a target voice, and controlling the target vehicle based on the target voice.

Example 8

According to an embodiment of the present application, there is further provided a speech processing apparatus for implementing the above-mentioned speech processing method, and fig. 13 is a schematic diagram of a speech processing apparatus according to embodiment 8 of the present application, and as shown in fig. 13, the apparatus 1300 includes an acquisition module 1302, an enhancement processing module 1304, and a recovery processing module 1306.

The sound pickup device comprises a sound pickup device, an acquisition module, an enhancement processing module and a restoration processing module, wherein the acquisition module is used for acquiring an original sound collection acquired by the sound pickup device, the original sound collection comprises a first sound and a second sound, the first sound is a sound signal sent by a sound source located in an effective interaction area, the second sound is a sound signal sent by other sound sources except the sound source located in the effective interaction area, the sound source located in the effective interaction area is a sound interaction object of directional recognition of the sound pickup device, the enhancement processing module is used for respectively carrying out enhancement processing on the first sound and the second sound to obtain the first sound and the second sound after the enhancement processing, the restoration processing module is used for carrying out signal restoration processing on the original sound collection and the first sound and the second sound after the enhancement processing by using a deep learning model to generate target sound, and the target sound is sound information directionally picked up by the sound pickup device.

It should be noted that, the above-mentioned obtaining module 1302, enhancement processing module 1304, and recovery processing module 1306 correspond to steps S202 to S206 of embodiment 1, and the three modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above-mentioned embodiment one. It should be noted that the above module may be implemented as a part of the apparatus in the computing terminal 10 provided in the first embodiment.

In the embodiment of the application, the enhancement processing module comprises a superposition unit and a filtering unit.

The system comprises a superposition unit, a filtering unit and a filtering unit, wherein the superposition unit is used for carrying out superposition processing on an original voice set by utilizing a wave beam forming algorithm to obtain enhanced first voice, and the filtering unit is used for carrying out filtering processing on the original voice set by utilizing a notch algorithm to obtain enhanced second voice.

In the embodiment of the application, the recovery processing module comprises an extraction unit, an input unit and a recovery unit.

The device comprises an extraction unit, an input unit and a recovery unit, wherein the extraction unit is used for extracting features of an original voice set and the first voice and the second voice after enhancement processing respectively to obtain an original voice feature, a first voice feature and a second voice feature, the input unit is used for inputting the original voice feature, the first voice feature and the second voice feature into a deep learning model to obtain a time-frequency mask of target voice, and the recovery unit is used for carrying out signal recovery processing on the first voice after enhancement processing based on the time-frequency mask to obtain the target voice.

The device further comprises a construction module, a generation module and a training module.

The sound source type comprises at least one of a target sound source, an interference sound source and a noise sound source, wherein the target sound source is positioned in an effective interaction area, the interference sound source and the noise sound source are positioned in other interaction areas except the effective interaction area, the generation module is used for generating simulated voice sets corresponding to the simulated scenes, and the training module is used for training a deep learning model by utilizing the simulated voice sets.

In the embodiment of the application, the generating module comprises a determining unit and a convolution unit.

The method comprises the steps of determining simulated voice sent by each sound source in each simulated scene, determining transfer functions corresponding to each sound source in each simulated scene through a mirror image method, and convolving the simulated voice and the transfer functions to obtain a simulated voice set corresponding to each simulated scene.

In the embodiment of the application, the extraction unit comprises a processing subunit, an extraction subunit and a normalization subunit.

The processing subunit is used for respectively carrying out short-time Fourier transform on the original voice set and the first voice and the second voice after the enhancement processing to obtain an original frequency domain signal, a first frequency domain signal and a second frequency domain signal, the extracting subunit is used for respectively carrying out feature extraction on the original frequency domain signal, the first frequency domain signal and the second frequency domain signal to obtain an original frequency domain feature, a first frequency domain feature and a second frequency domain feature, and the normalization subunit is used for respectively carrying out normalization on the original frequency domain feature, the first frequency domain feature and the second frequency domain feature to obtain the original voice feature, the first voice feature and the second voice feature.

The deep learning model comprises an input layer, a plurality of compact feedforward sequential storage networks, a plurality of first hiding layers, a first linear mapping layer and an output layer which are sequentially connected, wherein each compact feedforward sequential storage network comprises a second hiding layer, a second linear mapping layer and a memory module, and the memory module in the former compact feedforward sequential storage network is connected with the memory module in the latter compact feedforward sequential storage network.

Example 9

There is further provided a speech processing apparatus for implementing the above-mentioned speech processing method according to an embodiment of the present application, and fig. 14 is a schematic diagram of a speech processing apparatus according to embodiment 9 of the present application, and as shown in fig. 14, the apparatus 1400 includes a capturing module 1402, an enhancement processing module 1404, a recovery processing module 1406, and a control module 1408.

The audio and video communication device comprises a capturing module, an enhancement processing module and a control module, wherein the capturing module is used for capturing an original voice set acquired by pickup equipment arranged on the audio and video communication device, the original voice set comprises first voices and second voices, the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects directionally recognized by the pickup equipment, the enhancement processing module is used for respectively carrying out enhancement processing on the first voices and the second voices to obtain enhanced first voices and second voices, the restoration processing module is used for carrying out signal restoration processing on the original voice set and the enhanced first voices and the enhanced second voices to generate target voices, and the control module is used for controlling the audio and video communication device to output the target voices.

It should be noted that, the capturing module 1402, the enhancement processing module 1404, the recovery processing module 1406, and the control module 1408 correspond to steps S902 to S908 of embodiment 2, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the above module may be implemented as a part of the apparatus in the computing terminal 10 provided in the first embodiment.

Example 10

There is further provided a speech processing apparatus for implementing the above-mentioned speech processing method according to an embodiment of the present application, and fig. 15 is a schematic diagram of a speech processing apparatus according to embodiment 10 of the present application, and as shown in fig. 15, the apparatus 1500 includes a capturing module 1502, an enhancement processing module 1504, a recovery processing module 1506, and a control module 1508.

The system comprises a capturing module, an enhancement processing module, a restoration processing module and a control module, wherein the capturing module is used for capturing an original voice set acquired by pickup equipment arranged on a target vehicle, the original voice set comprises first voice and second voice, the first voice is a voice signal sent by a sound source located in an effective interaction area, the second voice is a voice signal sent by other sound sources except the sound source located in the effective interaction area, the sound source located in the effective interaction area is a voice interaction object of directional recognition of the pickup equipment, the enhancement processing module is used for respectively carrying out enhancement processing on the first voice and the second voice to obtain the first voice and the second voice after enhancement processing, the restoration processing module is used for carrying out signal restoration processing on the original voice set and the first voice and the second voice after enhancement processing by using a deep learning model to generate the target voice, and the control module is used for controlling the target vehicle based on the target voice.

It should be noted that, the capturing module 1502, the enhancing processing module 1504, the recovering processing module 1506, and the control module 1508 correspond to steps S1002 to S1008 of embodiment 3, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the above module may be implemented as a part of the apparatus in the computing terminal 10 provided in the first embodiment.

Example 11

According to an embodiment of the present application, there is further provided a speech processing apparatus for implementing the above-mentioned speech processing method, and fig. 16 is a schematic diagram of a speech processing apparatus according to embodiment 11 of the present application, and as shown in fig. 16, the apparatus 1600 includes a receiving module 1602, an enhancement processing module 1604, a recovery processing module 1606, and an output module 1608.

The method comprises the steps of receiving an original voice set uploaded by a client through a cloud server, wherein the original voice set is acquired through pickup equipment, the original voice set comprises first voice and second voice, the first voice is a voice signal sent by a sound source located in an effective interaction area, the second voice is a voice signal sent by other sound sources except the sound source located in the effective interaction area, the sound source located in the effective interaction area is a voice interaction object of pickup equipment directional recognition, an enhancement processing module is used for respectively carrying out enhancement processing on the first voice and the second voice through the cloud server to obtain the first voice and the second voice after enhancement processing, a recovery processing module is used for carrying out signal recovery processing on the original voice set and the first voice and the second voice after enhancement processing through the cloud server to generate target voice, the target voice is voice information directionally picked up by the pickup equipment, and an output module is used for outputting the target voice to the client through the cloud server.

Here, the receiving module 1602, the enhancement processing module 1604, the recovery processing module 1606 and the output module 1608 correspond to steps S1102 to S1108 of embodiment 4, and the four modules are the same as the corresponding steps and the examples and application scenarios, but are not limited to the disclosure of the first embodiment. It should be noted that the above module may be implemented as a part of the apparatus in the computing terminal 10 provided in the first embodiment.

Example 12

Embodiments of the present application may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the computer terminal may execute the program code for acquiring an original voice set acquired by the sound pickup device, where the original voice set includes a first voice and a second voice, where the first voice is a voice signal sent from a sound source located in an effective interaction area, the second voice is a voice signal sent from another sound source located in an effective interaction area, the sound source located in the effective interaction area is a voice interaction object for directional recognition of the sound pickup device, performing enhancement processing on the first voice and the second voice to obtain the first voice and the second voice after the enhancement processing, and performing signal recovery processing on the original voice set, the first voice and the second voice after the enhancement processing by using a deep learning model to generate a target voice, where the target voice is voice information directionally picked up by the sound pickup device.

Alternatively, fig. 17 is a block diagram of a computer terminal according to embodiment 12 of the present application. As shown in fig. 17, the computer terminal a may include one or more (only one is shown in the figure) processors, memories.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the voice processing method and apparatus in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the voice processing method described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call information and application programs stored in the memory through the transmission device to execute the following steps of acquiring an original voice set acquired by the pickup device, wherein the original voice set comprises first voices and second voices, the first voices are voice signals sent by sound sources located in an effective interaction area, the second voices are voice signals sent by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects for directional recognition of the pickup device, reinforcing the first voices and the second voices respectively to obtain the first voices and the second voices after the reinforcing, and performing signal recovery processing on the original voice set and the first voices and the second voices after the reinforcing by using a deep learning model to generate target voices, wherein the target voices are voice information directionally picked up by the pickup device.

Optionally, the above processor may further execute program code for performing superposition processing on the original speech set by using a beam forming algorithm to obtain the enhanced first speech, and performing filtering processing on the original speech set by using a notch algorithm to obtain the enhanced second speech.

Optionally, the processor may further execute program codes for performing feature extraction on the original speech set and the first speech and the second speech after the enhancement processing to obtain an original speech feature, a first speech feature and a second speech feature, inputting the original speech feature, the first speech feature and the second speech feature into a deep learning model to obtain a time-frequency mask of the target speech, and performing signal recovery processing on the first speech after the enhancement processing based on the time-frequency mask to obtain the target speech.

Optionally, the processor may further execute program code for constructing a plurality of simulation scenes corresponding to the sound pickup device, where the number and types of sound sources included in the different simulation scenes are different, and the types of the sound sources include at least one of a target sound source, an interference sound source, and a noise sound source, the target sound source is located in an effective interaction area, the interference sound source and the noise sound source are located in other interaction areas except the effective interaction area, generating a simulation voice set corresponding to the plurality of simulation scenes, and training the deep learning model by using the simulation voice set.

Optionally, the processor may further execute program code for determining a simulated voice emitted by each sound source in each simulated scene, determining a transfer function corresponding to each sound source in each simulated scene by mirroring, and convolving the simulated voice with the transfer function to obtain a simulated voice set corresponding to each simulated scene.

Optionally, the processor may further execute program codes for performing short-time fourier transform on the original speech set and the first speech and the second speech after the enhancement processing to obtain an original frequency domain signal, a first frequency domain signal and a second frequency domain signal, performing feature extraction on the original frequency domain signal, the first frequency domain signal and the second frequency domain signal to obtain an original frequency domain feature, a first frequency domain feature and a second frequency domain feature, and performing normalization on the original frequency domain feature, the first frequency domain feature and the second frequency domain feature to obtain the original speech feature, the first speech feature and the second speech feature.

Optionally, the processor may further execute program code for a deep learning model comprising an input layer, a plurality of compact feedforward sequential storage networks, a plurality of first hidden layers, a first linear mapping layer, and an output layer connected in sequence, wherein each compact feedforward sequential storage network comprises a second hidden layer, a second linear mapping layer, and a memory module, the memory module in the former compact feedforward sequential storage network being connected to the memory module in the latter compact feedforward sequential storage network.

The processor can call information and application programs stored in the memory through the transmission device to execute the following steps of capturing an original voice set acquired by pickup equipment arranged on the audio and video communication equipment, wherein the original voice set comprises first voices and second voices, the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects for directional recognition of the pickup equipment, reinforcing the first voices and the second voices respectively to obtain reinforced first voices and reinforced second voices, performing signal recovery processing on the original voice set and the reinforced first voices and the reinforced second voices by using a deep learning model to generate target voices, and controlling the audio and video communication equipment to output the target voices.

The processor can call information and application programs stored in the memory through the transmission device to execute the following steps of capturing an original voice set acquired by pickup equipment arranged on a target vehicle, wherein the original voice set comprises first voices and second voices, the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects for directional recognition of the pickup equipment, reinforcing the first voices and the second voices respectively to obtain reinforced first voices and reinforced second voices, performing signal recovery processing on the original voice set and the reinforced first voices and the reinforced second voices by using a deep learning model to generate target voices, and controlling the target vehicle based on the target voices.

The processor can call information and application programs stored in a memory through a transmission device to execute the following steps that the cloud server receives an original voice set uploaded by a client, wherein the original voice set is acquired through sound pickup equipment, the original voice set comprises first voices and second voices, the first voices are voice signals sent out by sound sources located in an effective interaction area, the second voices are voice signals sent out by other sound sources except the sound sources located in the effective interaction area, the sound sources located in the effective interaction area are voice interaction objects directionally recognized by the sound pickup equipment, the cloud server carries out enhancement processing on the first voices and the second voices respectively to obtain enhanced first voices and second voices, the cloud server carries out signal recovery processing on the original voice set and the enhanced first voices and the enhanced second voices by means of a deep learning model to generate target voices, and the target voices are voice information directionally picked up by the sound pickup equipment.

The method comprises the steps of firstly obtaining an original voice set collected by pickup equipment, wherein the original voice set comprises first voice and second voice, the first voice is voice information sent by a sound source located in an effective interaction area, the second voice is voice information sent by other sound sources except the sound source located in the effective interaction area, the sound source located in the effective interaction area is a voice interaction object for directional recognition of the pickup equipment, reinforcing the first voice and the second voice respectively to obtain the first voice and the second voice after the reinforcing, and performing voice signal recovery on the original voice set and the first voice and the second voice after the reinforcing by using a deep learning model to generate target voice, wherein the target voice is voice information directionally picked up by the pickup equipment, so that sound source interference and environmental noise outside the non-effective interaction area are effectively restrained, and the extraction effect of the voice information in the effective interaction area is improved. It is easy to note that the enhancement processing can be performed on the first voices emitted by the sound sources in the effective interaction area and the second voices emitted by other sound sources except the sound sources in the effective interaction area respectively, and the second voices emitted by other sound sources are effectively restrained by combining the deep learning model, so that the sound pickup device can pick up the voice information in the effective interaction area in a directional manner, and the technical problem that the voice information of the sound sources in the effective interaction area is difficult to pick up in the related art is solved.

It will be appreciated by those skilled in the art that the configuration shown in fig. 17 is merely illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile internet device (Mobile INTERNET DEVICES, MID), a PAD, etc. Fig. 17 is not limited to the structure of the electronic device. For example, the computer terminal a may also include more or fewer components (such as a network interface, a display device, etc.) than shown in fig. 17, or have a different configuration than shown in fig. 17.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include a flash disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, etc.

Example 13

Embodiments of the present application also provide a computer-readable storage medium. Alternatively, in the present embodiment, the above-described computer-readable storage medium may be used to store the program code executed by the speech processing method provided in the above-described embodiment 1.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Optionally, in this embodiment, the computer readable storage medium is configured to store program code for performing the steps of obtaining an original speech set collected by the sound pickup apparatus, wherein the original speech set includes a first speech and a second speech, wherein the first speech is a speech signal from a sound source located in an effective interaction area, the second speech is a speech signal from a sound source other than the sound source located in the effective interaction area, the sound source located in the effective interaction area is a speech interaction object for directional recognition of the sound pickup apparatus, performing enhancement processing on the first speech and the second speech respectively to obtain the enhanced first speech and the enhanced second speech, and performing signal recovery processing on the original speech set, the enhanced first speech and the enhanced second speech by using a deep learning model to generate a target speech, wherein the target speech is speech information directionally picked up by the sound pickup apparatus.

Optionally, the storage medium is further arranged to store program code for performing the steps of superimposing the original speech set with a beamforming algorithm to obtain the enhanced first speech, and filtering the original speech set with a notch algorithm to obtain the enhanced second speech.

Optionally, the storage medium is further configured to store program code for performing a feature extraction on the original speech set, the enhanced first speech, and the enhanced second speech, respectively, to obtain an original speech feature, a first speech feature, and a second speech feature, inputting the original speech feature, the first speech feature, and the second speech feature into a deep learning model, to obtain a time-frequency mask of the target speech, and performing a signal recovery process on the enhanced first speech based on the time-frequency mask, to obtain the target speech.

Optionally, the storage medium is further arranged to store program code for constructing a plurality of simulated scenes corresponding to the sound pick-up device, wherein the number and types of sound sources contained in the different simulated scenes are different, the types of sound sources comprise at least one of a target sound source, an interfering sound source and a noise sound source, the target sound source is located in an effective interaction area, the interfering sound source and the noise sound source are located in other interaction areas except the effective interaction area, generating simulated voice sets corresponding to the plurality of simulated scenes, and training the deep learning model by using the simulated voice sets.

Optionally, the storage medium is further configured to store program code for determining a simulated voice emitted by each sound source in each simulated scene, determining a transfer function corresponding to each sound source in each simulated scene by mirroring, and convolving the simulated voice with the transfer function to obtain a set of simulated voices corresponding to each simulated scene.

Optionally, the storage medium is further configured to store program code for performing a short-time fourier transform on the original speech set, the enhanced first speech and the enhanced second speech, respectively, to obtain an original frequency domain signal, a first frequency domain signal and a second frequency domain signal, performing feature extraction on the original frequency domain signal, the first frequency domain signal and the second frequency domain signal, respectively, to obtain an original frequency domain feature, a first frequency domain feature and a second frequency domain feature, and performing normalization on the original frequency domain feature, the first frequency domain feature and the second frequency domain feature, respectively, to obtain an original speech feature, a first speech feature and a second speech feature.

Optionally the storage medium is further arranged to store program code for performing the steps of the deep learning model comprising an input layer, a plurality of compact feedforward order storage networks, a plurality of first hidden layers, a first linear mapping layer and an output layer connected in sequence, wherein each compact feedforward order storage network comprises a second hidden layer, a second linear mapping layer and a memory module, the memory module in a previous compact feedforward order storage network being connected to the memory module in a subsequent compact feedforward order storage network.

Optionally, in this embodiment, the storage medium is configured to store program code for capturing an original set of voices collected by a pickup device provided on the audio-video communication device, wherein the original set of voices includes a first voice and a second voice, the first voice is a voice signal from a sound source located in an effective interaction area, the second voice is a voice signal from a sound source other than the sound source located in the effective interaction area, the sound source located in the effective interaction area is a voice interaction object for directional recognition of the pickup device, performing enhancement processing on the first voice and the second voice to obtain enhanced first voice and second voice, respectively, performing signal recovery processing on the original set of voices and the enhanced first voice and second voice by using a deep learning model to generate a target voice, and controlling the audio-video communication device to output the target voice.

Optionally, in this embodiment, the storage medium is configured to store program code for capturing an original set of voices collected by a sound pickup device provided on a target vehicle, wherein the original set of voices includes a first voice that is a voice signal from a sound source located in an effective interaction region and a second voice that is a voice signal from a sound source other than the sound source located in the effective interaction region, the sound source located in the effective interaction region is a voice interaction object for directional recognition of the sound pickup device, performing enhancement processing on the first voice and the second voice, respectively, to obtain the enhanced first voice and the enhanced second voice, performing signal recovery processing on the original set of voices, the enhanced first voice and the enhanced second voice, to generate the target voice, and controlling the target vehicle based on the target voice.

Optionally, in this embodiment, the storage medium is configured to store program codes for executing the steps of receiving, by the cloud server, an original speech set uploaded by the client, where the original speech set is acquired by the sound pickup device, the original speech set includes a first speech and a second speech, the first speech is a speech signal from a sound source located in an effective interaction area, the second speech is a speech signal from a sound source other than the sound source located in the effective interaction area, the sound source located in the effective interaction area is a speech interaction object of directional recognition of the sound pickup device, performing enhancement processing on the first speech and the second speech by the cloud server to obtain enhanced first speech and second speech, performing signal recovery processing on the original speech set and the enhanced first speech and the enhanced second speech by the cloud server by using a deep learning model, and generating a target speech, where the target speech is speech information of directional pickup by the sound pickup device, and outputting the target speech to the client by the cloud server.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and are merely a logical functional division, and there may be other manners of dividing the apparatus in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which can store the program code.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A speech processing method, characterized in that it includes:

Acquire the original speech set collected by the sound pickup device, wherein the original speech set includes a first speech and a second speech, wherein the first speech is a speech signal emitted from a sound source located within the effective interaction area, and the second speech is a speech signal emitted from a sound source other than the sound source located within the effective interaction area, wherein the sound source located within the effective interaction area is the speech interaction object identified by the sound pickup device, and the effective interaction area is an interaction area located in the target direction that is pre-set according to the actual scene;

The original speech set is superimposed using a beamforming algorithm to obtain the enhanced first speech.

The original speech set is filtered using a notch filter algorithm to obtain the enhanced second speech, wherein the filtering process is used to represent the obstruction of the first speech in the original speech set in order to enhance the second speech;

A deep learning model is used to perform signal recovery processing on the original speech set, the enhanced first speech and the second speech using spatial information to generate target speech. The target speech is the speech information picked up directionally by the sound pickup device, and the spatial information is used to represent the relevant speech features between channels.

2. The method according to claim 1, characterized in that, by using a deep learning model to perform signal recovery processing on the original speech set, the enhanced first speech, and the second speech using spatial information to generate the target speech, the method includes:

The spatial information is used to extract features from the original speech set, the enhanced first speech and the second speech, respectively, to obtain the original speech features, the first speech features and the second speech features.

The original speech features, the first speech features, and the second speech features are input into the deep learning model to obtain the time-frequency masking of the target speech;

Based on the time-frequency masking, signal recovery processing is performed on the enhanced first speech to obtain the target speech.

3. The method according to claim 2, characterized in that the method further comprises:

Multiple simulated scenarios corresponding to the sound pickup device are constructed, wherein the number and type of sound sources contained in different simulated scenarios are different. The types of sound sources include at least one of the following: target sound source, interference sound source, and noise sound source. The target sound source is located in the effective interaction area, and the interference sound source and the noise sound source are located in other interaction areas other than the effective interaction area.

Generate a set of simulated speech corresponding to the multiple simulated scenarios;

The deep learning model is trained using the simulated speech set.

4. The method according to claim 3, characterized in that generating the set of simulated speech corresponding to the plurality of simulated scenarios includes:

Determine the simulated speech emitted by each sound source in each simulated scenario;

The transfer function corresponding to each sound source in each simulated scenario is determined by the mirror method;

The simulated speech and the transfer function are convolved to obtain the set of simulated speech corresponding to each simulated scenario.

5. The method according to claim 2, characterized in that, feature extraction is performed on the original speech set, the enhanced first speech, and the second speech to obtain the original speech features, the first speech features, and the second speech features, respectively, including:

Short-time Fourier transforms are performed on the original speech set, the enhanced first speech, and the second speech to obtain the original frequency domain signal, the first frequency domain signal, and the second frequency domain signal, respectively.

Feature extraction is performed on the original frequency domain signal, the first frequency domain signal, and the second frequency domain signal respectively to obtain the original frequency domain features, the first frequency domain features, and the second frequency domain features;

The original frequency domain features, the first frequency domain features, and the second frequency domain features are normalized respectively to obtain the original speech features, the first speech features, and the second speech features.

6. The method according to claim 1, wherein the deep learning model comprises: an input layer, a plurality of compact feedforward sequential storage networks, a plurality of first hidden layers, a first linear mapping layer and an output layer connected in sequence, wherein each compact feedforward sequential storage network comprises: a second hidden layer, a second linear mapping layer and a memory module, and the memory module in the preceding compact feedforward sequential storage network is connected to the memory module in the following compact feedforward sequential storage network.

7. A speech processing method, characterized in that it includes:

The system captures a set of raw speech collected by a sound pickup device on an audio-visual communication device. The raw speech set includes a first speech and a second speech. The first speech is a speech signal emitted from a sound source located within an effective interaction area. The second speech is a speech signal emitted from a sound source other than the sound source located within the effective interaction area. The sound source located within the effective interaction area is the voice interaction object identified by the sound pickup device. The effective interaction area is an interaction area located in the target direction that is pre-set according to the actual scene.

A deep learning model is used to perform signal recovery processing on the original speech set, the enhanced first speech and the second speech using spatial information to generate the target speech, wherein the spatial information is used to represent the relevant speech features between channels;

Control the audio and video communication device to output the target voice.

8. A speech processing method, characterized in that it includes:

The system captures a set of raw speech data collected by a sound pickup device installed on the target vehicle. The raw speech data includes a first speech and a second speech. The first speech is a speech signal emitted from a sound source located within the effective interaction area. The second speech is a speech signal emitted from a sound source other than the sound source located within the effective interaction area. The sound source located within the effective interaction area is the speech interaction object identified by the sound pickup device. The effective interaction area is an interaction area located in the target direction that is pre-set according to the actual scene.

The target vehicle is controlled based on the target voice.

9. A speech processing method, characterized in that it includes:

The cloud server receives a raw voice collection uploaded by the client. The raw voice collection is acquired by a sound pickup device and includes a first voice and a second voice. The first voice is a voice signal emitted from a sound source located within the effective interaction area, and the second voice is a voice signal emitted from a sound source other than the sound source located within the effective interaction area. The sound source located within the effective interaction area is the voice interaction object identified by the sound pickup device. The effective interaction area is an interaction area located in the target direction that is pre-set according to the actual scene.

The cloud server uses a beamforming algorithm to superimpose the original speech set to obtain an enhanced first speech, and uses a notch filtering algorithm to filter the original speech set to obtain an enhanced second speech. The filtering is used to represent the obstruction of the first speech in the original speech set in order to enhance the second speech.

The cloud server uses a deep learning model to perform signal recovery processing on the original speech set, the enhanced first speech and the second speech through spatial information to generate target speech. The target speech is the speech information picked up directionally by the sound pickup device. The spatial information is used to represent the relevant speech features between channels.

The cloud server outputs the target voice to the client.

10. A speech processing system, characterized in that it comprises:

A sound pickup device is used to collect a raw speech set, wherein the raw speech set includes a first speech and a second speech, wherein the first speech is a speech signal emitted from a sound source located within an effective interaction area, and the second speech is a speech signal emitted from a sound source other than the sound source located within the effective interaction area, wherein the sound source located within the effective interaction area is a voice interaction object identified by the sound pickup device, and the effective interaction area is an interaction area located in the target direction that is pre-set according to the actual scene;

A processing device, connected to the sound pickup device, is used to superimpose the original speech set using a beamforming algorithm to obtain an enhanced first speech, and to filter the original speech set using a notch filtering algorithm to obtain the enhanced second speech. The filtering is used to impede the first speech in the original speech set to enhance the second speech. A deep learning model is then used to perform signal recovery processing on the original speech set, the enhanced first speech, and the second speech using spatial information to generate a target speech. The target speech is the speech information directionally picked up by the sound pickup device, and the spatial information represents the relevant speech features between channels.

11. An audio and video communication device, characterized in that it comprises:

A sound pickup device installed on an audio-visual communication device is used to collect a raw voice set, wherein the raw voice set includes a first voice and a second voice, wherein the first voice is a voice signal emitted from a sound source located within an effective interaction area, and the second voice is a voice signal emitted from a sound source other than the sound source located within the effective interaction area, wherein the sound source located within the effective interaction area is the voice interaction object identified by the sound pickup device, and the effective interaction area is an interaction area located in the target direction that is pre-set according to the actual scene;

The processor, connected to the sound pickup device, is used to superimpose the original speech set using a beamforming algorithm to obtain an enhanced first speech, and to filter the original speech set using a notch filtering algorithm to obtain the enhanced second speech, wherein the filtering is used to represent the obstruction of the first speech in the original speech set to enhance the second speech, and to use a deep learning model to perform signal recovery processing on the original speech set, the enhanced first speech, and the second speech using spatial information to generate target speech, wherein the spatial information is used to represent the relevant speech features between channels;

An output device, connected to the processor, is used to output the target speech.

12. A vehicle, characterized in that it comprises:

A sound pickup device installed on a vehicle is used to collect a raw speech set, wherein the raw speech set includes a first speech and a second speech, wherein the first speech is a speech signal emitted from a sound source located within an effective interaction area, and the second speech is a speech signal emitted from a sound source other than the sound source located within the effective interaction area, wherein the sound source located within the effective interaction area is the speech interaction object identified by the sound pickup device, and the effective interaction area is an interaction area located in the target direction that is pre-set according to the actual scene;

A controller, connected to the sound pickup device, is used to superimpose the original speech set using a beamforming algorithm to obtain an enhanced first speech, and to filter the original speech set using a notch filtering algorithm to obtain the enhanced second speech. The filtering is used to impede the first speech in the original speech set to enhance the second speech. A deep learning model is used to perform signal recovery processing on the original speech set, the enhanced first speech, and the second speech using spatial information to generate a target speech. The controller then controls the vehicle based on the target speech. The spatial information represents the relevant speech features between channels.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium includes a stored program, wherein, when the program is executed, it controls the device where the storage medium is located to perform the speech processing method according to any one of claims 1 to 9.