CN111063342B

CN111063342B - Speech recognition method, speech recognition device, computer equipment and storage medium

Info

Publication number: CN111063342B
Application number: CN202010001662.6A
Authority: CN
Inventors: 吴渤; 于蒙; 陈联武; 温超; 苏丹; 俞栋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2022-09-30
Anticipated expiration: 2040-01-02
Also published as: CN111063342A

Abstract

The application discloses a voice recognition method, a voice recognition device, computer equipment and a storage medium, and belongs to the field of data processing. The method comprises the following steps: inputting the collected audio data into a time domain separation model, and predicting by the time domain separation model based on the audio data to obtain time domain separation information, wherein the time domain separation information is used for separating noise data and voice data in the audio data; performing voice separation on the audio data based on the time domain separation information to obtain time domain voice data; performing feature extraction on the time domain voice data to obtain time domain voice feature information corresponding to the time domain voice data; and performing voice recognition on time domain voice characteristic information corresponding to the time domain voice data, and determining voice content corresponding to the time domain voice data. By the voice recognition method, the computer equipment does not need to convert the audio information of the time domain into the frequency domain and then carry out voice separation, the voice and other whole processes can be completed in the time domain, and the speed of voice recognition is improved.

Description

Speech recognition method, speech recognition device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a speech recognition method, apparatus, computer device, and storage medium.

Background

With the development of computer technology, it is desirable to control various intelligent devices to implement different functions in a simpler and easier manner. Since the convenience of voice recognition is emphasized by manufacturers, all manufacturers hope to reduce manual operation of users by voice recognition technology and improve usability of products.

In the related technology, the collected audio information is often subjected to time-frequency conversion to obtain frequency domain information corresponding to the audio information, noise in the audio information is removed in a frequency domain, human voice is kept, then the frequency domain information is restored to be a waveform in a time domain, feature extraction and voice recognition are carried out on the waveform, intelligent equipment is controlled through a voice recognition result to achieve a corresponding function, and manual operation is reduced.

However, in the process of speech recognition, it takes much time and computational resources to convert frequency domain information into a waveform in a time domain, and the speed of speech recognition is slow, which results in that the speed of response of the smart device to the user speech instruction is slow.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a storage medium, which can improve the speed of response of intelligent equipment to a user voice instruction, and the technical scheme is as follows:

in one aspect, a speech recognition method is provided, and the method includes:

inputting collected audio data into a time domain separation model, and predicting by the time domain separation model based on the audio data to obtain time domain separation information, wherein the time domain separation information is used for separating noise data and voice data in the audio data;

performing voice separation on the audio data based on the time domain separation information to obtain time domain voice data;

performing feature extraction on the time domain voice data to obtain time domain voice feature information corresponding to the time domain voice data;

and performing voice recognition on the time domain voice characteristic information corresponding to the time domain voice data, and determining the voice content corresponding to the time domain voice data.

converting frequency domain information of voice data in the audio data into a spectrogram;

performing feature extraction on the spectrogram to obtain frequency domain voice feature information corresponding to the spectrogram;

inputting the frequency domain voice feature information into a frequency domain voice recognition model, and predicting by the frequency domain voice recognition model based on the frequency domain voice feature information to obtain a phoneme corresponding to the frequency domain voice feature information;

and determining the voice content corresponding to the voice data based on a plurality of phonemes.

In one aspect, a speech recognition apparatus is provided, the apparatus comprising:

the prediction module is used for inputting the collected audio data into a time domain separation model, and the time domain separation model carries out prediction based on the audio data to obtain time domain separation information, wherein the time domain separation information is used for separating noise data and voice data in the audio data;

the voice separation module is used for carrying out voice separation on the audio data based on the time domain separation information to obtain time domain voice data;

the characteristic extraction module is used for extracting the characteristics of the time domain voice data to obtain time domain voice characteristic information corresponding to the time domain voice data;

and the voice recognition module is used for performing voice recognition on the time domain voice characteristic information corresponding to the time domain voice data and determining the voice content corresponding to the time domain voice data.

In one possible embodiment, the speech recognition module comprises:

the second prediction unit is used for inputting the time domain voice feature information into a time domain voice recognition model, and the time domain voice recognition model carries out prediction based on the time domain voice feature information to obtain the corresponding probability between the time domain voice feature information and a plurality of phonemes;

a phoneme determining unit, configured to determine a phoneme with the highest probability as a phoneme corresponding to the time-domain speech feature information;

and the voice content determining unit is used for determining the voice content corresponding to the time domain voice data based on the plurality of phonemes.

the conversion module is used for converting the frequency domain information of the voice data in the audio data into a spectrogram;

the characteristic extraction module is used for extracting the characteristics of the spectrogram to obtain frequency domain voice characteristic information corresponding to the spectrogram;

the phoneme prediction module is used for inputting the frequency domain voice characteristic information into a frequency domain voice recognition model, and the frequency domain voice recognition model predicts the frequency domain voice characteristic information based on the frequency domain voice characteristic information to obtain phonemes corresponding to the frequency domain voice characteristic information;

and the voice content determining module is used for determining the voice content corresponding to the voice data based on the plurality of phonemes.

In one possible implementation, the phoneme prediction module includes:

a probability prediction unit, configured to input the frequency-domain speech feature information into a frequency-domain speech recognition model, and perform prediction by the frequency-domain speech recognition model based on the frequency-domain speech feature information to obtain probabilities corresponding to multiple phonemes;

and the determining unit is used for determining the phoneme with the maximum probability as the phoneme corresponding to the voice characteristic information.

In a possible embodiment, the apparatus further comprises:

the frequency domain separation information prediction module is used for inputting frequency domain audio data into a frequency domain separation model, and the frequency domain separation model carries out prediction based on the frequency domain audio data to obtain frequency domain separation information, wherein the frequency domain separation information is used for separating noise data and voice data in the frequency domain audio data;

and the voice enhancement module is used for carrying out voice enhancement on the frequency domain audio data based on the frequency domain separation information to obtain the frequency domain information of the voice data in the audio data.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to perform the operations performed by the speech recognition method.

In one aspect, a storage medium is provided, in which at least one program code is stored, the program code being loaded and executed by a processor to implement the operations performed by the speech recognition method.

By the voice recognition method, the computer equipment does not need to convert the audio information of the time domain into the frequency domain and then carry out voice separation in the voice recognition process, the voice and other whole processes can be completed in the time domain, and the speed of voice recognition is improved; and voice recognition can be directly carried out in the frequency domain, and the voice information in the frequency domain does not need to be converted into the time domain for feature extraction and voice recognition, so that the speed of voice recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a logic flow diagram of a speech recognition method provided by an embodiment of the present application;

FIG. 4 is a flow chart of a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, "a plurality" means two or more, for example, a plurality of audio frames of the same length means two or more audio frames of the same length.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge submodel to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Fourier transform is a time-frequency transform method, which can transform information in the time domain to the frequency domain.

The phonemes are minimum phonetic units divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, one action constitutes one phoneme, and the phonemes are divided into vowels and consonants, for example, the Chinese syllables o (ā) have only one phoneme, i (aji) have two phonemes, and i (d a i) have three phonemes.

Speech enhancement represents the process of separating noise data from speech data from audio data.

Fig. 1 is a schematic diagram of an implementation environment of a network call method according to an embodiment of the present invention, and referring to fig. 1, the implementation environment includes a computer device 110 and a server 140.

The computer device 110 is connected to the server 110 through a wireless network or a wired network. The computer device 110 may be a smart phone, a tablet computer, a smart speaker, etc. Computer device 110 is installed and running with applications that support speech recognition technology. Illustratively, the computer device 110 is a computer device used by a user, and an application running in the computer device 110 is logged with a user account.

The computer device 110 is connected to the server 140 through a wireless network or a wired network.

Optionally, the server 140 comprises: the system comprises an access server, a background server and a database. The access server is used to provide access services to the computer device 110. The background server is used for providing background services related to the voice recognition. The database may include a user information database, a sample database, and the like, and of course, different services provided by the server may correspond to different databases, and the background server may be one or more. When the number of the background servers is multiple, at least two background servers exist for providing different services, and/or at least two background servers exist for providing the same service, for example, the same service is provided in a load balancing manner, which is not limited in the embodiment of the present application.

Computer device 110 may refer broadly to one of a plurality of computer devices, and the present embodiment is illustrated with computer device 110 only.

Those skilled in the art will appreciate that the number of computer devices described above may be greater or fewer. For example, the number of the computer devices may be only one, or several tens or hundreds, or more, and in this case, other computer devices may be included in the implementation environment. The embodiment of the invention does not limit the number and types of the computer equipment.

The voice recognition method can be applied to products such as a vehicle-mounted terminal, a television box, a voice recognition product and an intelligent sound box, can be applied to the front end of the products, and can also be realized through interaction between the front end and a server. If the front end of the product is less computationally powerful, then only the speech enhancement portion may be performed, with the speech recognition portion being performed by the server.

Taking the vehicle-mounted terminal as an example, the vehicle-mounted terminal can collect audio data, and perform voice enhancement on the audio data to obtain voice data. The vehicle-mounted terminal can send the voice data to a background server connected with the vehicle-mounted terminal, and the background server performs feature extraction and voice recognition on the received voice data to obtain voice content corresponding to the voice data. The background server may send the voice content corresponding to the voice data to the vehicle-mounted terminal, and the vehicle-mounted terminal executes corresponding driving control or processing based on the acquired voice content, for example, operations such as opening or closing a sunroof, opening or closing a navigation system, and opening or closing lighting.

Taking the television box as an example, a user can send audio data to the television box through a remote controller matched with the television box, and the television box can perform voice enhancement on the audio information to obtain voice data. The television box can send the voice data to a background server connected with the television box, and the background server performs feature extraction and voice recognition on the received voice data to obtain voice content corresponding to the voice data. The background server may send the voice content corresponding to the voice data to the television box, and the television box executes corresponding operations based on the acquired voice content, such as operations of switching playing content and turning on or off the television box.

Taking an automatic voice recognition product as an example, a user can wake up the automatic voice recognition product through a preset voice instruction, after the automatic voice recognition product is woken up by the user, the automatic voice recognition product can collect audio data and send the audio data to a background server, the background server performs voice enhancement on the audio data to obtain voice data, and then feature extraction and voice recognition are performed on the voice data to obtain voice content corresponding to the voice data. The background server can send the voice content corresponding to the voice data to the automatic voice recognition product, and the automatic voice recognition product executes corresponding operations based on the acquired voice content, such as setting an alarm clock, converting a language, taking a picture, and the like.

Taking the smart sound box as an example, a user can wake up the smart sound box through a preset voice instruction, the smart sound box can collect audio data after being awakened by the user, the audio data are sent to the background server, the background server performs voice enhancement on the audio data to obtain voice data, and then feature extraction and voice recognition are performed on the voice data to obtain voice content corresponding to the voice data. The background server may send the voice content corresponding to the voice data to the smart sound box, and the smart sound box executes corresponding operations based on the acquired voice content, such as operations of switching songs, single song cycle, time telling, and the like.

If the front-end product, such as the vehicle-mounted terminal, the television box, the voice recognition product, the smart sound box and the like, has enough computing power, all voice recognition operations can be executed at the front end without communicating with the server.

It should be noted that the speech recognition method provided in the embodiments of the present application can be applied to various products based on speech recognition functions, and the foregoing description is only for convenience of understanding and is not to be construed as an inappropriate limitation to the present application.

Fig. 2 is a flowchart of a speech recognition method provided in an embodiment of the present application, fig. 3 is a logic flowchart of the speech recognition method provided in the embodiment of the present application, and referring to fig. 2 and fig. 3, the method includes:

201. the computer equipment inputs the collected audio data into the time domain separation model, and the time domain separation model outputs time domain separation information which is used for separating noise data and voice data in the audio data.

The time domain separation model in the embodiment of the application can be obtained by training based on sample time domain voice data and sample audio data, wherein the sample audio data is generated by mixing the sample time domain voice data and sample time domain noise data.

In an iteration process, the computer device can input target sample audio data into the time domain separation model to be trained, the time domain separation model to be trained predicts based on the target sample audio data and outputs predicted time domain separation information, and the computer device can obtain predicted time domain voice data based on the predicted time domain separation information and the target sample audio data. The computer device may determine difference information between the predicted time-domain speech data and target sample time-domain speech data, and adjust model parameters of the time-domain separation model to be trained based on the difference information, wherein the target sample time-domain speech data is sample time-domain speech data corresponding to the target sample audio data. When the model parameters of the time-domain separation model to be trained meet the target conditions, the computer equipment can stop model training and take the model at the moment as the time-domain separation model. In particular, the computer device may segment the sample time-domain speech data and the sample audio data into vectors of target lengths, represented by sample speech vectors and sample audio vectors, respectively.

The computer equipment can input the sample audio vector into the time domain separation model to be trained, and the time domain separation model to be trained is operated based on the initialized weight to obtain the predicted time domain separation vector. The computer device may multiply the predicted time-domain separation vector and the sample audio vector to obtain a predicted speech vector. The computer device may determine similarity information between the predicted speech vector and the sample speech vector, and adjust model parameters of the time-domain separation model to be trained based on the similarity information. When the loss function of the time-domain separation model to be trained reaches a target threshold, the computer device may stop the model training and use the model at this time as the time-domain separation model. In addition, the Time Domain Separation model may also adopt an open source model trained in advance, such as a Time-Domain Audio Separation Network (TasNet) and a Deep extraction model (Deep Extractor for Music Sources With Extra tags, demucis) for Music Sources With Extra tags, which is not limited in the embodiment of the present application.

In a possible implementation manner, the computer device may divide the acquired audio data into a plurality of audio frames with the same length, and each audio frame with the same length has an overlapping part with a fixed length with the audio frames adjacent to each other in front and back in time, so that the audio data can be prevented from being lost. The computer device may sequentially input the audio frames with the same length into the time-domain separation model in time order, perform prediction by the time-domain separation model based on the audio frames with the same length to obtain a plurality of pieces of first separation information, and combine the plurality of pieces of first separation information in time order to obtain the time-domain separation information.

202. And the computer equipment performs voice separation on the audio data based on the time domain separation information to obtain time domain voice data.

In a possible implementation manner, the computer device may directly perform speech separation in the time domain based on the collected audio data in the time domain and the time domain separation information obtained based on the time domain separation model, so as to obtain time domain speech data. In particular, a computerThe device may represent the audio data and the time-domain separation information in a matrix form, and the computer device may directly multiply the matrix representing the audio data and the matrix representing the time-domain separation information to obtain a matrix representing the time-domain speech data, for example, the matrix representing the audio data is a one-dimensional matrix {1,1,2,3,4,5}, and the time-domain separation information is also a one-dimensional matrix {1,0,1,1,0,0} ^T Multiplying the two results in {1,0,2,3,0,0}, which may be used to represent time-domain speech data.

Under the embodiment, the computer equipment can directly separate the noise data and the voice data in the audio data in the time domain to obtain the time domain voice data, the audio data does not need to be converted into the frequency domain and then the noise data and the voice data are separated, and the calculation amount of the computer equipment is reduced.

203. And the computer equipment extracts the characteristics of the time domain voice data to obtain time domain voice characteristic information corresponding to the time domain voice data.

In a possible implementation manner, the computer device may input the time-domain speech data into the feature extraction model, perform feature extraction on the speech frame by the feature extraction model based on a relationship between any speech frame and an associated speech frame whose speech frame timing sequence is adjacent, to obtain feature information of the speech frame, combine the feature information of each speech frame, and output time-domain speech feature information corresponding to the time-domain speech data.

Specifically, the computer device may divide the time domain speech data into a plurality of speech frames with target lengths, input any speech frame and a target number of associated speech frames into the feature extraction model, where the feature extraction model may perform feature extraction on the speech frame and the associated speech frame, respectively, and assign different weights to the associated speech frame and the speech frame, perform weighted summation on the feature information of the plurality of speech frames after feature extraction to obtain target speech frame feature information, and combine the plurality of target speech feature information according to a time sequence to obtain time domain speech diagnosis information corresponding to the time domain speech data.

For example, the computer device may divide the time domain speech data into a plurality of speech frames with a length of 32ms, input one speech frame and four associated speech frames into the feature extraction model, where the collection time of two associated speech frames is in front of the speech frame and the collection time of two associated speech frames is behind the speech frame, and the feature extraction model may perform feature extraction on five speech frames respectively to obtain a first speech frame feature information and four second speech frame feature information, where the first speech feature information is speech feature information corresponding to the speech frame, and the second speech feature information is speech feature information corresponding to the associated speech frames. The feature extraction model may assign a weight of 0.7 to the first speech frame feature information, assign a weight of 0.075 to the four second speech frame feature information, perform weighted summation on the five speech feature information to obtain the speech frame feature information, and combine a plurality of target speech feature information according to a time sequence to obtain time domain speech feature information corresponding to time domain speech data.

It should be noted that the number of the speech frames input into the feature extraction model and the weight given to the feature information of each speech frame by the feature extraction model may be set according to actual needs, which is not limited in the embodiment of the present application.

204. And the computer equipment inputs the time domain voice characteristic information into the time domain voice recognition model, and the time domain voice recognition model carries out prediction based on the time domain voice characteristic information to obtain the phoneme corresponding to the time domain voice characteristic information.

The time domain speech recognition model is obtained by training sample time domain speech feature information and sample phonemes corresponding to the sample time domain speech feature information, and the time domain speech feature information can include information such as speech intensity and speech intonation of speech data. In the training process, the time domain speech recognition model can predict the probability capability of the corresponding phoneme based on the time domain speech feature information, and the training aims to ensure that the probability of the sample phoneme corresponding to the sample time domain speech feature is obtained as high as possible after the sample time domain speech feature information is input into the time domain speech recognition model.

Certainly, not all time-domain speech feature information has a phoneme corresponding to the time-domain speech feature information, for example, a pause or an interval when a person speaks, and because the difference between the speech feature information during the pause or the interval and the speech feature information during the person speaks is large, the time-domain speech recognition model in the application can also set a target condition to screen the time-domain speech feature information to determine whether the time-domain speech feature information is the speech feature of the person during the speaking, and if the time-domain speech feature information meets the target condition, subsequent speech recognition can be performed to obtain the phoneme corresponding to the time-domain speech feature information; if the time domain speech feature information does not meet the target condition, the time domain speech recognition model can directly output the phonemes corresponding to the speech feature information as blanks, and ensure that each input time domain speech feature information can have a corresponding output, for example, the time domain speech recognition model can judge whether the time domain speech feature information is the speech feature information when a person speaks by comparing the speech intensity in the time domain speech feature information with the speech intensity threshold value, if the speech intensity in the time domain speech feature information input into the time domain speech recognition model is greater than the speech intensity threshold value, the time domain speech feature information can be determined to be the speech feature information when the person speaks, and the subsequent speech recognition operation can be carried out on the time domain speech feature information; if the speech intensity in the time-domain speech feature information input into the time-domain speech recognition model is smaller than the speech intensity threshold, the time-domain speech feature information can be determined to be blank information, and blank phonemes can be directly output.

It should be noted that the time domain speech recognition Model in the embodiment of the present application may be a shallow acoustic Model, such as a Gaussian Mixture Model (GMM) and a Hidden Markov Model (HMM), a Deep learning Model may also be used, such as a Network structure based on a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), and a Recurrent Neural Network (RNN), or a Model obtained by improving the above models, such as a full-sequence Convolutional Neural Network (DFCNN) and a high-performance Deep Neural Network computing Library (cld for Deep Neural Network, ksnn), and the type of the time domain speech recognition Model in the embodiment of the present application is not limited.

In one possible implementation, the computer device may input the time-domain speech feature information into a time-domain speech recognition model, perform prediction by the time-domain speech recognition model based on the time-domain speech feature information to obtain probabilities of correspondence between the time-domain speech feature information and a plurality of phonemes, and determine a phoneme with the highest probability as a phoneme corresponding to the time-domain speech feature information. For example, the computer device may represent the time-domain speech feature information in the form of a vector, the computer device may input a first vector representing the time-domain speech feature information into a time-domain speech recognition model, and the time-domain speech recognition model performs an operation based on the first vector and the plurality of hidden layers, pooling layers, and full-link layers to obtain a second vector representing a probability of correspondence between the first vector and the plurality of phonemes, for example, (0.1, 0.11, 0.2, 0.6, 0.7 … … …), where 0.1 may represent that the probability of correspondence between the first vector and the phoneme "a" is 0.1, 0.7 may represent that the probability of correspondence between the first vector and the phoneme "e" is 0.7, and if 0.7 is the largest number in the second vector, the time-domain speech recognition model may determine that the phoneme corresponding to the first vector is "e", that the phoneme corresponding to the time-domain speech feature information is "e", of course, the structure of the time-domain speech recognition model in this embodiment may be set according to actual needs, which is not limited in this embodiment of the present application.

In one possible implementation, the computer device may input a plurality of time-domain speech feature information that are adjacent in time into the time-domain speech recognition model, perform prediction by the time-domain speech recognition model based on the plurality of time-domain speech feature information, obtain probabilities of correspondence between the plurality of time-domain speech feature information and a plurality of phoneme combinations, and determine a phoneme group with the highest probability as a phoneme group corresponding to the plurality of time-domain speech feature information. For example, the computer device may represent the time-domain speech feature information in the form of a vector, the computer device may input a plurality of first vectors representing the time-domain speech feature information into a time-domain speech recognition model, and the time-domain speech recognition model performs an operation based on the first vectors and the plurality of hidden layers, pooling layers, and full-connection layers to obtain a second vector representing a corresponding probability between the plurality of first vectors and the plurality of phoneme groups, such as (0.1, 0.11, 0.2, 0.6, 0.7 … … …), where 0.1 may represent that the plurality of first vectors has a corresponding probability of 0.1 to the phoneme group "ca", 0.7 may represent that the plurality of first vectors has a corresponding probability of 0.7 to the phoneme group "bo", and if 0.7 is the largest number in the second vector, the time-domain speech recognition model may determine that the phoneme group corresponding to the plurality of first vectors is "bo", that the plurality of time-domain speech feature information corresponds to the "bo", of course, the structure of the time-domain speech recognition model in this embodiment may be set according to actual needs, which is not limited in this embodiment of the present application.

205. The computer device inserts separators between the plurality of phonemes, phonemes between any two of the separators correspond to the same target phoneme, and determines speech content corresponding to the time-domain speech data based on the plurality of target phonemes.

In one possible embodiment, the computer device may combine the phonemes predicted by the speech recognition model in chronological order, and insert separators between the plurality of phonemes based on a separator insertion model that may predict insertion positions of the separators based on temporally adjacent time-domain speech feature information. The computer device may determine a phone between any two separators as the same target phone, determine a plurality of target syllables based on the plurality of target phones, and determine text information corresponding to the speech data based on the plurality of target syllables. Specifically, the main function of the separator insertion model is to distinguish two identical target phonemes, for example, a set of phonemes "hheeelloo" is obtained by the computer device in chronological order, and the computer device may directly determine the identical phonemes as the same target phoneme, that is, "hhh" as "h", "eee" as "e", "llll" as "l", and "oo" as "o", and the resulting text information is "hello", so that the obtained result may be different from the actual result "hello". The computer device may insert a separator, e.g., "/hhh/eee/ll/l/oo/", in a set of phonemes "hheeelloo" based on a separator insertion model, and the computer device may get the correct text information as "hello", where "/" represents a separator. This can improve the accuracy of speech recognition.

By the voice recognition method provided by the embodiment of the application, the computer equipment can directly carry out voice separation on the audio information in the time domain, separate the noise information from the voice information, does not need to convert the audio information into the frequency domain and then carry out voice enhancement, and can also directly carry out feature extraction and voice recognition on the voice information in the time domain, so that the speed of voice recognition is integrally improved.

Fig. 4 is a flowchart of a speech recognition method provided in an embodiment of the present application, and referring to fig. 4, the method includes:

401. and the computer equipment performs time-frequency transformation on the acquired audio data to obtain frequency domain audio data.

In one possible implementation, the computer device may sample audio based on a target sampling frequency, resulting in audio data. The computer device can assemble N sampling points into an audio frame, the speed of processing audio data by the computer device is improved, the process is also called framing, wherein N is the number of the sampling points, N is a positive integer, the size of N can be set according to actual needs, for example, N can be 256 or 512, and the size of N is not limited in the embodiment of the application. Besides, when the computer device performs framing, an overlap portion may be provided between two adjacent audio frames, where the overlap portion is called a frame shift, where a size of the frame shift is related to N, for example, may be 1/2 or 1/3 of N, and this is not limited in this embodiment of the application. By adopting the frame dividing mode, the phenomenon that the change between two adjacent audio frames is overlarge can be avoided, so that the computer equipment can obtain more accurate effect in the subsequent processing process of the audio data.

After framing, the computer device may also window the audio frames, specifically, the computer device may multiply each audio frame by a window function to obtain a windowed audio frame.

After windowing, the computer device may perform time-frequency conversion on the windowed audio frames, convert the audio data in the time domain to the frequency domain, and obtain the frequency domain audio data of each audio frame. Through the time-frequency conversion, the computer equipment can more conveniently acquire the characteristics of the audio data, and the computer equipment can further analyze and process the audio data. Specifically, the computer device may perform time-frequency transformation on the acquired audio data by using Fast Fourier Transform (FFT), wavelet Transform, or other methods to obtain frequency-domain audio data, or may use other methods capable of implementing time-frequency transformation.

402. The computer equipment inputs the frequency domain audio data into the frequency domain separation model, the frequency domain separation model carries out prediction based on the frequency domain audio data to obtain frequency domain separation information, and the frequency domain separation information is used for separating noise data and voice data in the frequency domain audio data.

The frequency domain separation model in the embodiment of the application can be obtained by training based on sample frequency domain voice data and sample frequency domain audio data, wherein the sample frequency domain audio data is generated by mixing the sample frequency domain voice data and the sample frequency domain noise data. In an iteration process, the computer device can input target sample frequency domain audio data into the frequency domain separation model to be trained, the frequency domain separation model to be trained is used for predicting based on the target sample frequency domain audio data and outputting predicted frequency domain separation information, and the computer device can obtain predicted frequency domain voice data based on the predicted frequency domain separation information and the target sample frequency domain audio data. The computer device may determine difference information between the predicted frequency-domain speech data and target sample frequency-domain speech data, and adjust model parameters of the frequency-domain separation model to be trained based on the difference information, wherein the target sample frequency-domain speech data is sample frequency-domain speech data corresponding to the target sample frequency-domain audio data. If the model parameters of the frequency domain separation model to be trained meet the target conditions, the computer equipment can stop the model training and take the model at the moment as the frequency domain separation model.

In particular, the computer device may represent the sample frequency-domain speech data and the sample frequency-domain audio data in the form of vectors, which are respectively denoted as sample frequency-domain speech vectors and sample frequency-domain audio vectors. The computer equipment can input the sample frequency domain audio vector into the frequency domain separation model to be trained, and the frequency domain separation model to be trained carries out operation based on the initialized weight to obtain the predicted frequency domain separation vector. The computer device may multiply the predicted frequency-domain separation vector with the sample frequency-domain audio vector to obtain a predicted frequency-domain speech vector. The computer device may determine similarity information between the predicted frequency-domain speech vector and the sample frequency-domain speech vector, and adjust model parameters of the frequency-domain separation model to be trained based on the similarity information. When the loss function of the frequency domain separation model to be trained reaches the target threshold, the computer device may stop the model training and use the model at this time as the frequency domain separation model. In addition, the frequency domain separation model may also use an open source model trained in advance, such as Independent Component Analysis (ICA), beam synthesis method (Delay and Sum, DSB), and linear Constrained Minimum-Variance filter (LCMV), which is not limited in the embodiment of the present application.

In a possible implementation manner, the computer device may sequentially input the frequency domain audio data into the frequency domain separation model according to time sequence, perform prediction by the frequency domain separation model based on the frequency domain audio data to obtain a plurality of pieces of second separation information corresponding to time, and combine the plurality of pieces of second separation information according to the time sequence information to obtain the frequency domain separation information.

403. And the computer equipment performs voice enhancement on the frequency domain audio data based on the frequency domain separation information to obtain the frequency domain information of the voice data in the audio data.

In one possible implementation, the computer device may perform speech enhancement in the frequency domain based on the frequency domain audio data and the frequency domain separation information, to obtain frequency domain information of the speech data in the audio data, and to use the frequency domain information of the speech data in the audio data as the frequency domain speech data. In particular, the computer device may represent the frequency domain audio data and the frequency domain separation information in the form of a matrixFor example, the matrix representing the frequency domain audio data is a one-dimensional matrix {1,1,2,3,4,5}, and the frequency domain separation information is a one-dimensional matrix {1,0,1,1,0,0} ^T Multiplying the two may result in {1,0,2,3,0,0}, which may be used to represent frequency domain speech data.

404. And the computer equipment converts the frequency domain voice data into a spectrogram and performs characteristic extraction on the spectrogram to obtain frequency domain voice characteristic information corresponding to the spectrogram.

In a possible real-time manner, the computer device may convert the frequency domain voice data into a spectrogram, input the spectrogram into the feature extraction model, perform feature extraction on the spectrogram by the feature extraction model, output feature information corresponding to the spectrogram, and use the feature information corresponding to the spectrogram as frequency domain voice feature information corresponding to the frequency domain voice data. Specifically, if the computer device represents the frequency domain voice data in a matrix form, the computer device may convert the frequency domain voice data into a spectrogram based on the matrix of the frequency domain voice data, perform feature extraction on the spectrogram to obtain feature information of the spectrogram, represent the feature information of the frequency domain voice data with the feature information of the spectrogram, and improve the speed of voice recognition without converting the frequency domain voice data into a time domain and then performing the feature extraction.

In a possible implementation manner, the computer device may also directly perform feature extraction on the frequency domain voice data without converting the frequency domain voice data into a spectrogram, so as to obtain frequency domain voice feature information corresponding to the frequency domain voice data. Specifically, if the computer device represents the frequency domain speech data in a matrix form, the computer device may directly perform feature extraction on the matrix of the frequency domain speech data to obtain frequency domain speech feature information. Under the implementation mode, the computer equipment does not need to convert the frequency domain voice data into the time domain and then perform feature extraction, and the speed of voice recognition is improved.

405. And the computer equipment inputs the frequency domain voice characteristic information into the frequency domain voice recognition model, and the frequency domain voice recognition model carries out prediction based on the frequency domain voice characteristic information to obtain phonemes corresponding to the frequency domain voice characteristic information.

The speech recognition model is obtained by training sample frequency domain speech feature information and sample phonemes corresponding to the sample frequency domain speech feature information, and the frequency domain speech feature information can include information such as speech intensity and speech intonation of speech data. In the training process, the frequency domain speech recognition model can predict the probability capability of the corresponding phoneme based on the frequency domain speech feature information, and the training aims to ensure that the probability of the sample phoneme corresponding to the sample frequency domain speech feature is obtained as high as possible after the sample frequency domain speech feature information is input into the frequency domain speech recognition model.

Certainly, not all the frequency domain speech feature information have phonemes corresponding to the frequency domain speech feature information, for example, pauses or intervals when a person speaks, and because the difference between the speech feature information during the pauses or the intervals and the speech feature information during the person speaks is large, the frequency domain speech recognition model in the application can also set a target condition to screen the frequency domain speech feature information, determine whether the frequency domain speech feature information is the speech feature of the person when the person speaks, and if the frequency domain speech feature information meets the target condition, then subsequent speech recognition can be performed to obtain the phonemes corresponding to the frequency domain feature information; if the frequency domain speech feature information does not meet the target condition, the frequency domain speech recognition model can directly output the phonemes corresponding to the speech feature information as blanks, and ensure that each input frequency domain speech feature information can have a corresponding output, for example, the frequency domain speech recognition model can judge whether the frequency domain speech feature information is the speech feature information when a person speaks by comparing the speech intensity in the frequency domain speech feature information with a speech intensity threshold value, if the speech intensity in the frequency domain speech feature information input into the frequency domain speech recognition model is greater than the speech intensity threshold value, the frequency domain speech feature information can be determined to be the speech feature information when the person speaks, and subsequent speech recognition operation can be performed on the frequency domain speech feature information; if the speech intensity in the frequency domain speech feature information input into the frequency domain speech recognition model is smaller than the speech intensity threshold, the frequency domain speech feature information can be determined to be blank information, and blank phonemes can be directly output.

It should be noted that the frequency-domain speech recognition Model in the embodiment of the present application may be a shallow acoustic Model, such as Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM), may also adopt a Deep learning Model, such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN), and may also be a Model obtained by improving the above models, such as full-sequence Convolutional Neural Networks (DFCNN) and high-performance Deep Neural Network computing Library (cld for Deep Neural Networks, nn), and the type of the frequency-domain speech recognition Model in the embodiment of the present application is not limited.

In one possible implementation, the computer device may input the frequency-domain speech feature information into a frequency-domain speech recognition model, perform prediction by the frequency-domain speech recognition model based on the frequency-domain speech feature information, obtain probabilities of correspondence between the frequency-domain speech feature information and a plurality of phonemes, and determine a phoneme with the highest probability as a phoneme corresponding to the frequency-domain speech feature information. For example, the computer device may represent the frequency-domain speech feature information in the form of a vector, the computer device may input a first vector representing the frequency-domain speech feature information into a frequency-domain speech recognition model, and the frequency-domain speech recognition model performs an operation based on the first vector and the plurality of hidden layers, pooling layers, and full-link layers to obtain a second vector representing a probability of correspondence between the first vector and the plurality of phonemes, for example, (0.1, 0.11, 0.2, 0.6, 0.7 … … …), where 0.1 may represent that the probability of correspondence between the first vector and the phoneme "a" is 0.1, 0.7 may represent that the probability of correspondence between the first vector and the phoneme "e" is 0.7, and if 0.7 is the largest number in the second vector, the frequency-domain speech recognition model may determine that the phoneme corresponding to the first vector is "e", that the phoneme corresponding to the frequency-domain speech feature information is "e", of course, the structure of the frequency domain speech recognition model in this embodiment may be set according to actual needs, which is not limited in this embodiment.

In one possible implementation, the computer device may input a plurality of temporally adjacent frequency-domain speech feature information into the frequency-domain speech recognition model, perform prediction by the frequency-domain speech recognition model based on the plurality of frequency-domain speech feature information, obtain probabilities of correspondence between the plurality of frequency-domain speech feature information and a plurality of phoneme combinations, and determine a phoneme group with the highest probability as a phoneme group corresponding to the plurality of frequency-domain speech feature information. For example, the computer device may represent the frequency-domain speech feature information in the form of vectors, the computer device may input a plurality of first vectors representing the frequency-domain speech feature information into a frequency-domain speech recognition model, and the frequency-domain speech recognition model performs an operation based on the first vectors and the plurality of hidden layers, pooling layers, and full-connection layers to obtain a second vector representing a probability of correspondence between the plurality of first vectors and the plurality of phoneme groups, such as (0.1, 0.11, 0.2, 0.6, 0.7 … … …), where 0.1 may represent that the probability of correspondence between the plurality of first vectors and the phoneme group "ca" is 0.1, 0.7 may represent that the probability of correspondence between the plurality of first vectors and the phoneme group "bo" is 0.7, and if 0.7 is the largest number in the second vector, the frequency-domain speech recognition model may determine that the phoneme group corresponding to the plurality of first vectors is "bo", that the phoneme group corresponding to the plurality of frequency-domain speech feature information is "bo", of course, the structure of the frequency domain speech recognition model in this embodiment may be set according to actual needs, which is not limited in this embodiment.

406. The computer device inserts separators between a plurality of phonemes, phonemes between any two of the separators corresponding to the same target phoneme, and determines text information corresponding to the speech data based on the plurality of target phonemes.

In one possible embodiment, the computer device may combine the phonemes predicted by the speech recognition model in chronological order, and insert separators between the plurality of phonemes based on a separator insertion model that may predict insertion positions of the separators based on temporally adjacent time-domain speech feature information. The computer device may determine a phone between any two separators as the same target phone, determine a plurality of target syllables based on the plurality of target phones, and determine text information corresponding to the speech data based on the plurality of target syllables. Specifically, the main function of the delimiter insertion model is to distinguish two identical target phonemes, for example, the computer device obtains a group of phonemes "hhheeelloo" in chronological order, and the computer device may directly determine the identical phonemes as the same target phoneme, that is, "hhh" as "h", "eee" as "e", "llll" as "l", and "oo" as "o", and the final obtained text information is "hello", so that the obtained result may be different from the actual result "hello". The computer device may insert a separator, e.g., "/hhh/eee/ll/l/oo/", in a set of phonemes "hheeelloo" based on a separator insertion model, and the computer device may get the correct text information as "hello", where "/" represents a separator.

According to the voice recognition method provided by the embodiment of the application, the computer equipment carries out voice enhancement on the audio data in the frequency domain after converting the audio data into the frequency domain, separates the noise data from the voice data, directly carries out feature extraction and voice recognition on the voice data in the frequency domain, does not need to convert the frequency domain voice information into the time domain and then carries out feature extraction and voice recognition, and improves the speed of voice recognition.

Fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application, and referring to fig. 5, the apparatus includes: a prediction module 501, a speech separation module 502, a feature extraction module 503, and a speech recognition module 504.

The prediction module 501 is configured to input the acquired audio data into the time-domain separation model, and perform prediction based on the audio data by the time-domain separation model to obtain time-domain separation information, where the time-domain separation information is used to separate noise data and voice data in the audio data.

The voice separation module 502 is configured to perform voice separation on the audio data based on the time domain separation information to obtain time domain voice data.

The feature extraction module 503 is configured to perform feature extraction on the time-domain voice data to obtain time-domain voice feature information corresponding to the time-domain voice data.

The speech recognition module 504 is configured to perform speech recognition on time-domain speech feature information corresponding to the time-domain speech data, and determine speech content corresponding to the time-domain speech data.

In one possible implementation, the prediction module includes:

and the segmentation unit is used for segmenting the audio data into a plurality of audio frames with the same length, and inputting the plurality of audio frames with the same length into the time domain separation model according to the time sequence.

The first prediction unit is used for predicting based on a plurality of audio frames with the same length by the time domain separation model to obtain a plurality of pieces of first separation information, and combining the plurality of pieces of first separation information according to a time sequence to obtain the time domain separation information.

In one possible implementation, the feature extraction module includes:

inputting the time domain voice data into a feature extraction model, performing feature extraction on the voice frame by the feature extraction model based on the relation between any voice frame and the associated voice frame adjacent to the voice frame time sequence to obtain the feature information of the voice frame, combining the feature information of each voice frame, and outputting the time domain voice feature information corresponding to the time domain voice data.

In one possible embodiment, the speech recognition module comprises:

and the second prediction unit is used for inputting the time domain voice feature information into the time domain voice recognition model, and the time domain voice recognition model carries out prediction based on the time domain voice feature information to obtain the corresponding probability between the time domain voice feature information and the multiple phonemes.

And the phoneme determining unit is used for determining the phoneme with the maximum probability as the phoneme corresponding to the time domain speech feature information.

Through the speech recognition device provided by the embodiment of the application, the computer equipment can directly carry out speech separation on the audio information in the time domain, separate the noise information from the speech information, does not need to convert the audio information into the frequency domain and then carry out speech enhancement, and can also directly carry out feature extraction and speech recognition on the speech information in the time domain, thereby improving the speed of speech recognition on the whole.

It should be noted that: in the speech recognition apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing speech recognition, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the computer device may be divided into different functional modules to complete all or part of the functions described above. In addition, the speech recognition apparatus and the speech recognition method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application, and referring to fig. 6, the apparatus includes: a conversion module 601, a feature extraction module 602, a phoneme prediction module 603, and a speech content determination module 604.

The conversion module 601 is configured to convert frequency domain information of the voice data in the audio data into a spectrogram.

The feature extraction module 602 is configured to perform feature extraction on the spectrogram to obtain frequency domain speech feature information corresponding to the spectrogram.

And the phoneme prediction module 603 is configured to input the frequency-domain speech feature information into the frequency-domain speech recognition model, and perform prediction by the frequency-domain speech recognition model based on the frequency-domain speech feature information to obtain a phoneme corresponding to the frequency-domain speech feature information.

And a speech content determining module 604, configured to determine a speech content corresponding to the speech data based on the multiple phonemes.

In one possible implementation, the feature extraction module is to:

inputting the spectrogram into a feature extraction model, performing feature extraction on the spectrogram by using the feature extraction model, outputting spectrogram feature information corresponding to the spectrogram, and taking the spectrogram feature information as frequency domain voice feature information corresponding to voice data.

In one possible implementation, the phoneme prediction module includes:

and the probability prediction unit is used for inputting the frequency domain voice characteristic information into the frequency domain voice recognition model, and performing prediction by the frequency domain voice recognition model based on the frequency domain voice characteristic information to obtain the corresponding probability between the frequency domain voice characteristic information and the multiple phonemes.

In one possible embodiment, the apparatus further comprises:

and the frequency domain separation information prediction module is used for inputting the frequency domain audio data into the frequency domain separation model, and the frequency domain separation model carries out prediction based on the frequency domain audio data to obtain frequency domain separation information, wherein the frequency domain separation information is used for separating noise data and voice data in the frequency domain audio data.

Through the voice recognition device provided by the embodiment of the application, the computer equipment carries out voice enhancement on the audio data in the frequency domain after converting the audio data into the frequency domain, separates the noise data from the voice data, directly carries out feature extraction and voice recognition on the voice data in the frequency domain, does not need to convert the frequency domain voice information into the time domain and then carries out feature extraction and voice recognition, and improves the speed of voice recognition.

Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device 700 may be: smart phones, tablet computers, smart home devices, smart bracelets, notebook computers, or desktop computers. Computer device 700 may also be referred to by other names such as user device, portable computer device, laptop computer device, desktop computer device, and so forth.

Generally, the computer device 700 includes: one or more processors 701 and one or more memories 702.

Processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory storage medium in the memory 702 is used to store at least one program code for execution by the processor 701 to implement the speech recognition methods provided by the method embodiments herein.

In some embodiments, the computer device 700 may also optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, display 705, camera 706, audio circuitry 707, positioning components 708, and power source 709.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 704 may communicate with other computer devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display screen 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 705 may be one, providing the front panel of the computer device 700; in other embodiments, the display 705 can be at least two, respectively disposed on different surfaces of the computer device 700 or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or on a folded surface of the computer device 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display 705 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

The camera assembly 706 is used to capture images or video. Optionally, the camera assembly 706 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of a computer apparatus, and a rear camera is disposed on a rear surface of the computer apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, the main camera and the wide-angle camera are fused to realize panoramic shooting and a VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and located at different locations on the computer device 700. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 707 may also include a headphone jack.

The Location component 708 is used to locate the current geographic Location of the computer device 700 for navigation or LBS (Location Based Service). The Positioning component 708 can be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

The power supply 709 is used to supply power to the various components of the computer device 700. The power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When power source 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the computer device 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the computer apparatus 700. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 701 may control the display screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the computer device 700, and the gyro sensor 712 may collect a 3D motion of the user on the computer device 700 in cooperation with the acceleration sensor 711. From the data collected by the gyro sensor 712, the processor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 713 may be disposed on a side bezel of computer device 700 and/or underneath display screen 705. When the pressure sensor 713 is disposed on a side frame of the computer device 700, a user's holding signal to the computer device 700 may be detected, and the processor 701 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of the display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 714 is used for collecting a fingerprint of a user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. Fingerprint sensor 714 may be provided on the front, back, or side of computer device 700. When a physical key or vendor Logo is provided on the computer device 700, the fingerprint sensor 714 may be integrated with the physical key or vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the display screen 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the display screen 705 is increased; when the ambient light intensity is low, the display brightness of the display screen 705 is adjusted down. In another embodiment, processor 701 may also dynamically adjust the shooting parameters of camera assembly 706 based on the ambient light intensity collected by optical sensor 715.

A proximity sensor 716, also known as a distance sensor, is typically disposed on a front panel of the computer device 700. The proximity sensor 716 is used to capture the distance between the user and the front of the computer device 700. In one embodiment, the processor 701 controls the display screen 705 to switch from the bright screen state to the dark screen state when the proximity sensor 716 detects that the distance between the user and the front face of the computer device 700 is gradually decreased; when the proximity sensor 716 detects that the distance between the user and the front of the computer device 700 is gradually increased, the processor 701 controls the display screen 705 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration illustrated in FIG. 7 is not intended to be limiting of the computer device 700 and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components may be employed.

In an exemplary embodiment, there is also provided a storage medium, such as a memory, including program code executable by a processor to perform the speech recognition method in the above-described embodiments. For example, the storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by hardware associated with program code, and the program may be stored in a storage medium, such as a read-only memory, a magnetic disk or an optical disk.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

dividing the collected audio data into a plurality of audio frames with the same length, and inputting the audio frames with the same length into a time domain separation model according to the time sequence;

predicting by the time domain separation model based on the plurality of audio frames with the same length to obtain a plurality of pieces of first separation information, and combining the plurality of pieces of first separation information according to a time sequence to obtain time domain separation information, wherein the time domain separation information is used for separating noise data and voice data in the audio data;

multiplying the time domain separation information by the audio data to obtain time domain voice data;

dividing the time domain voice data into a plurality of voice frames with target length;

inputting any one of the voice frames and the associated voice frames with the target number into a feature extraction model, respectively performing feature extraction on the voice frames and the associated voice frames with the target number through the feature extraction model, and giving different weights to the associated voice frames with the target number and the voice frames, wherein the associated voice frames with the target number are voice frames adjacent to the voice frame in time sequence;

carrying out weighted summation on the feature information of the plurality of voice frames after feature extraction to obtain the target voice frame feature information of the voice frames;

combining the target voice characteristic information of the voice frames with the target lengths according to the time sequence to obtain time domain voice characteristic information corresponding to the time domain voice data;

2. The method according to claim 1, wherein the performing speech recognition on the time-domain speech feature information corresponding to the time-domain speech data to obtain the speech content corresponding to the time-domain speech data comprises:

inputting the time domain voice feature information into a time domain voice recognition model, and predicting by the time domain voice recognition model based on the time domain voice feature information to obtain corresponding probabilities between the time domain voice feature information and a plurality of phonemes;

determining the phoneme with the maximum probability as the phoneme corresponding to the time domain voice feature information;

and determining the voice content corresponding to the time-domain voice data based on a plurality of phonemes.

3. The method of claim 2, wherein the determining the speech content corresponding to the time-domain speech data based on the plurality of phonemes comprises:

inserting separators between the phonemes to obtain a plurality of target phonemes, wherein the phonemes between any two separators correspond to the same target phoneme;

and determining the voice content corresponding to the time domain voice data based on the plurality of target phonemes.

4. The method of claim 3, wherein inserting separators between the plurality of phonemes to obtain a plurality of target phonemes comprises:

combining the plurality of phonemes in a temporal order;

inserting a delimiter between the combined plurality of phonemes based on a delimiter insertion model for predicting insertion positions of the delimiter based on a plurality of temporally adjacent time-domain speech feature information;

and determining the phoneme between any two separators as the same target phoneme.

5. The method of claim 3, wherein determining the speech content corresponding to the time-domain speech data based on the plurality of target phonemes comprises:

determining a plurality of target syllables based on the plurality of target phonemes;

and determining text information corresponding to the time domain voice data based on the plurality of target syllables.

6. The method according to claim 1, wherein the performing speech recognition on the time-domain speech feature information corresponding to the time-domain speech data to obtain the speech content corresponding to the time-domain speech data comprises:

inputting a plurality of time domain voice feature information adjacent in time into a time domain voice recognition model, and predicting by the time domain voice recognition model based on the time domain voice feature information to obtain corresponding probabilities between the time domain voice feature information and a plurality of phoneme combinations;

determining the phoneme group with the maximum probability as a phoneme group corresponding to the time domain speech feature information;

and determining the voice content corresponding to the time-domain voice data based on the phoneme group.

7. A speech recognition apparatus, characterized in that the apparatus comprises:

the prediction module is used for dividing the collected audio data into a plurality of audio frames with the same length, and inputting the audio frames with the same length into the time domain separation model according to the time sequence; predicting by the time domain separation model based on the plurality of audio frames with the same length to obtain a plurality of pieces of first separation information, and combining the plurality of pieces of first separation information according to a time sequence to obtain time domain separation information, wherein the time domain separation information is used for separating noise data and voice data in the audio data;

the voice separation module is used for multiplying the time domain separation information and the audio data to obtain time domain voice data;

the characteristic extraction module is used for dividing the time domain voice data into a plurality of voice frames with target lengths; inputting any one of the voice frames and a target number of associated voice frames into a feature extraction model, respectively performing feature extraction on the voice frames and the target number of associated voice frames through the feature extraction model, and giving different weights to the target number of associated voice frames and the voice frames, wherein the target number of associated voice frames are voice frames adjacent to the voice frame time sequence; carrying out weighted summation on the feature information of the plurality of voice frames after feature extraction to obtain the target voice frame feature information of the voice frames; combining a plurality of target voice characteristic information of the voice frames with the target lengths according to the time sequence to obtain time domain voice characteristic information corresponding to the time domain voice data;

and the voice recognition module is used for carrying out voice recognition on the time domain voice characteristic information corresponding to the time domain voice data and determining the voice content corresponding to the time domain voice data.

8. The apparatus of claim 7, wherein the speech recognition module comprises:

9. The apparatus of claim 8, wherein the speech content determining unit is configured to insert separators between the multiple phonemes to obtain multiple target phonemes, and phonemes between any two separators correspond to a same target phoneme; and determining the voice content corresponding to the time domain voice data based on the plurality of target phonemes.

10. The apparatus according to claim 9, wherein said speech content determining unit is configured to combine said plurality of phonemes in chronological order; inserting separators between the combined plurality of phonemes based on a separator insertion model for predicting insertion positions of the separators based on temporally adjacent plurality of time-domain speech feature information; and determining the phoneme between any two separators as the same target phoneme.

11. The apparatus of claim 9, wherein the speech content determining unit is configured to determine a plurality of target syllables based on the plurality of target phones; and determining text information corresponding to the time-domain voice data based on the plurality of target syllables.

12. The apparatus of claim 8, wherein the speech content determining unit is configured to input a plurality of time-domain speech feature information that are adjacent in time into a time-domain speech recognition model, and perform prediction by the time-domain speech recognition model based on the plurality of time-domain speech feature information to obtain probabilities of correspondence between the plurality of time-domain speech feature information and a plurality of phoneme combinations; determining the phoneme group with the maximum probability as a phoneme group corresponding to the time domain speech feature information; and determining the voice content corresponding to the time domain voice data based on the phoneme group.

13. A computer device, characterized in that the computer device comprises one or more processors and one or more memories having at least one program code stored therein, which is loaded and executed by the one or more processors to implement the speech recognition method according to any one of claims 1 to 6.

14. A storage medium having stored therein at least one program code, which is loaded and executed by a processor to implement the speech recognition method according to any one of claims 1 to 6.