JP2001282279A

JP2001282279A - Voice information processing method and apparatus, and storage medium

Info

Publication number: JP2001282279A
Application number: JP2000099535A
Authority: JP
Inventors: Toshiaki Fukada; 俊明深田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-03-31
Filing date: 2000-03-31
Publication date: 2001-10-12
Also published as: US6778960B2; US7089186B2; US20040215459A1; US20010032080A1

Abstract

(57)【要約】【課題】音韻系列の継続時間長を精度良く設定可能と
し、音韻・言語環境に応じた自然な音韻時間長を与え
る。【解決手段】大局的セグメントの継続時間長モデルに
基づいて、所定単位の音韻系列の継続時間長を求める
（Ｓ３０２）。局所的セグメントの継続時間長モデルに
基づいて、その音韻系列を構成する各音韻の継続時間長
を求める（Ｓ３０３）。音韻系列の継続時間長と各音韻
の継続時間長とに基づいて、各音韻の継続時間長を設定
する（Ｓ３０４）。 (57) [Summary] [PROBLEMS] To enable a duration of a phoneme sequence to be set with high accuracy, and to give a natural phoneme time according to a phoneme / language environment. SOLUTION: Based on a duration model of a global segment, a duration of a phoneme sequence of a predetermined unit is obtained (S302). Based on the duration model of the local segment, the duration of each phoneme constituting the phoneme sequence is determined (S303). The duration of each phoneme is set based on the duration of the phoneme series and the duration of each phoneme (S304).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声合成に際して
実施される音韻の継続時間長を設定する音声情報処理方
法及びその装置、及び、前記音声合成方法を実施するプ
ログラムを記憶した、コンピュータにより読取り可能な
記憶媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech information processing method and apparatus for setting a duration of a phoneme performed in speech synthesis, and a computer-readable program storing a program for executing the speech synthesis method. It concerns a possible storage medium.

【０００２】[0002]

【従来の技術】近年、任意の文字系列を音韻系列に変換
し、その音韻系列を所定の音声規則合成方式に従って合
成音声に変換する音声合成装置が開発されている。2. Description of the Related Art In recent years, a speech synthesizer has been developed which converts an arbitrary character sequence into a phoneme sequence and converts the phoneme sequence into a synthesized speech according to a predetermined speech rule synthesis method.

【０００３】[0003]

【発明が解決しようとする課題】従来の音声合成装置か
ら出力される合成音声は、人間が発声する自然音声と比
較すると不自然で機械的なものであった。The synthesized speech output from the conventional speech synthesizer is unnatural and mechanical as compared to natural speech uttered by humans.

【０００４】この原因の一つとして、例えば「おんせ
い」という文字系列を構成する音韻系列「ｏ，Ｘ，ｓ，
ｅ，ｉ」において、各音韻の継続時間長を生成する音韻
継続時間長の制御規則の精度が挙げられる。精度が悪い
場合、各音韻に対して適正に、継続時間長が付与されな
いため、合成される音声は不自然で機械的なものとな
る。As one of the causes, for example, a phoneme sequence “o, X, s,
In “e, i”, the accuracy of the control rule of the phoneme duration for generating the duration of each phoneme can be mentioned. If the accuracy is low, the duration is not properly given to each phoneme, so that the synthesized speech is unnatural and mechanical.

【０００５】本発明は上記従来例に鑑みてなされたもの
で、音韻系列の継続時間長を精度良く設定することを可
能とし、音韻・言語環境に応じた自然な音韻時間長を与
える音声情報処理方法及びその装置を提供することを目
的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above conventional example, and is capable of accurately setting the duration of a phoneme sequence and providing a natural phoneme time length corresponding to a phoneme / language environment. It is an object to provide a method and an apparatus thereof.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するため
に本発明の音声情報処理装置は以下のような構成を備え
る。即ち、大局的セグメントの継続時間長モデルに基づ
いて、所定単位の音韻系列の継続時間長を求める手段
と、局所的セグメントの継続時間長モデルに基づいて、
前記音韻系列を構成する各音韻の継続時間長を求める手
段と、前記音韻系列の継続時間長と前記各音韻の継続時
間長とに基づいて、前記各音韻の継続時間長を設定する
設定手段と、前記設定手段により設定された前記各音韻
の継続時間長に基づいて音声を合成する音声合成手段
と、を有することを特徴とする。In order to achieve the above object, a voice information processing apparatus according to the present invention has the following arrangement. In other words, based on the duration model of the global segment, means for determining the duration of the phoneme sequence in a predetermined unit, and based on the duration model of the local segment,
Means for determining the duration of each phoneme constituting the phoneme sequence, and setting means for setting the duration of each phoneme based on the duration of the phoneme sequence and the duration of each phoneme. Speech synthesis means for synthesizing speech based on the duration of each phoneme set by the setting means.

【０００７】上記目的を達成するために本発明の音声情
報処理方法は以下のような工程を備える。即ち、大局的
セグメントの継続時間長モデルに基づいて、所定単位の
音韻系列の継続時間長を求める工程と、局所的セグメン
トの継続時間長モデルに基づいて、前記音韻系列を構成
する各音韻の継続時間長を求める工程と、前記音韻系列
の継続時間長と前記各音韻の継続時間長とに基づいて、
前記各音韻の継続時間長を設定する設定工程と、前記設
定工程により設定された前記各音韻の継続時間長に基づ
いて音声を合成する音声合成工程と、を有することを特
徴とする。[0007] To achieve the above object, a voice information processing method of the present invention comprises the following steps. That is, based on the duration model of the global segment, a step of obtaining the duration of the phoneme sequence of a predetermined unit, and, based on the duration model of the local segment, the continuation of each phoneme constituting the phoneme sequence. Determining the time length, based on the duration of the phoneme sequence and the duration of each phoneme,
A setting step of setting a duration time of each of the phonemes; and a speech synthesis step of synthesizing speech based on the duration time of each of the phonemes set in the setting step.

【０００８】[0008]

【発明の実施の形態】以下、添付図面を参照して本発明
の好適な実施の形態を詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of the present invention will be described below in detail with reference to the accompanying drawings.

【０００９】［実施の形態１］図１は、本発明の実施の
形態１に係る音声合成装置の構成を示すブロック図であ
る。[First Embodiment] FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to a first embodiment of the present invention.

【００１０】図１において、１０１はＣＰＵで、ＲＯＭ
１０２に記憶された制御プログラム、或いは外部記憶装
置１０４からＲＡＭ１０３にロードされた制御プログラ
ムに従って、本実施の形態の音声合成装置における各種
制御を行う。ＲＯＭ１０２は、各種パラメータやＣＰＵ
１０１が実行する制御プログラムなどを格納している。
ＲＡＭ１０３は、ＣＰＵ１０１による各種制御の実行時
に作業領域を提供するとともに、ＣＰＵ１０１により実
行される制御プログラムを記憶する。１０４はハードデ
ィスク、フロッピー（登録商標）ディスク、ＣＤ−ＲＯ
Ｍ等の外部記憶装置で、この外部記憶装置がハードディ
スクの場合には、ＣＤ−ＲＯＭやフロッピィディスク等
からインストールされた各種プログラムが記憶されてい
る。１０５は入力部で、キーボード、マウス等のポイン
ティングデバイスを有している。又、この入力部１０５
は、例えば通信回線等を介してインターネット等からの
データを入力しても良い。１０６は液晶やＣＲＴ等の表
示部で、ＣＰＵ１０１の制御により各種データの表示を
行う。１０７はスピーカで、音声信号（電気信号）を可
聴音である音声に変換して出力する。１０８は上記各部
を接続するバスである。１０９は音声合成ユニットであ
る。In FIG. 1, reference numeral 101 denotes a CPU and a ROM
According to the control program stored in the RAM 102 or the control program loaded into the RAM 103 from the external storage device 104, various controls in the speech synthesizer of the present embodiment are performed. ROM 102 stores various parameters and CPU
The control program 101 executes a control program.
The RAM 103 provides a work area when the CPU 101 executes various controls, and stores a control program executed by the CPU 101. 104 is a hard disk, a floppy (registered trademark) disk, a CD-RO
When the external storage device is a hard disk, various programs installed from a CD-ROM, a floppy disk or the like are stored. An input unit 105 has a pointing device such as a keyboard and a mouse. Also, the input unit 105
May input data from the Internet or the like via a communication line or the like. A display unit 106 such as a liquid crystal display or a CRT displays various data under the control of the CPU 101. A speaker 107 converts an audio signal (electric signal) into an audible sound and outputs the sound. Reference numeral 108 denotes a bus that connects the above components. Reference numeral 109 denotes a speech synthesis unit.

【００１１】図２は、本実施の形態１に係る音声合成ユ
ニット１０９の動作を示すフローチャートである。以下
に示される各ステップは、ＲＯＭ１０２に格納された制
御プログラム、或いは外部記憶装置１０４からＲＡＭ１
０３にロードされた制御プログラムをＣＰＵ１０１が実
行することによって実現される。FIG. 2 is a flowchart showing the operation of the speech synthesis unit 109 according to the first embodiment. Each step described below is performed by using a control program stored in the ROM 102 or the RAM 1 from the external storage device 104.
This is realized by the CPU 101 executing the control program loaded in the CPU 03.

【００１２】まずステップＳ２０１で、漢字かな混じり
の日本語テキストデータが入力部１０５から入力される
とステップＳ２０２に進み、この入力されたテキストデ
ータを、言語解析辞書２０１を用いて解析し、入力テキ
ストデータに対する音韻系列（読み）やアクセントなど
の情報を抽出する。次にステップＳ２０３に進み、これ
らの情報を用いて、ステップＳ２０２で求めた音韻系列
を構成する各音韻の継続時間長、基本周波数（ピッチパ
ターン）、パワー等のプロソディ（韻律情報という）を
生成する。この際、音韻の継続時間長は継続時間長モデ
ル２０２を用いて、基本周波数、パワー等は韻律制御モ
デル２０３を用いて決定される。First, in step S201, when Japanese text data mixed with kanji and kana is input from the input unit 105, the process proceeds to step S202, where the input text data is analyzed using the linguistic analysis dictionary 201, and the input text is input. It extracts information such as phonemic sequences (reading) and accents for the data. Next, the process proceeds to step S203, and a prosody (referred to as prosodic information) such as a duration time, a fundamental frequency (pitch pattern), and power of each phoneme constituting the phoneme sequence obtained in step S202 is generated using the information. . At this time, the duration of the phoneme is determined using the duration model 202, and the fundamental frequency, power, and the like are determined using the prosody control model 203.

【００１３】次にステップＳ２０４に進み、ステップＳ
２０２で解析して抽出された音韻系列、及びステップＳ
２０３で生成されたプロソディに基づいて、音声素片辞
書２０４から、その音韻系列に対応する合成音声を生成
するための音声素片（波形もしくは特徴パラメータ）を
複数個選択する。次にステップＳ２０５に進み、それら
選択された音声素片を用いて合成音声信号を生成し、ス
テップＳ２０６において、その生成された合成音声信号
に基づいて音声をスピーカ１０７から出力する。最後に
ステップＳ２０７において、入力されたテキストデータ
に対する処理が全て終了したか否かの判断を行い、終了
していない場合はステップＳ２０１に戻り、前述の処理
が続けられる。Next, proceeding to step S204,
Phonemic sequence analyzed and extracted in step 202, and step S
Based on the prosody generated in 203, a plurality of speech units (waveforms or characteristic parameters) for generating a synthesized speech corresponding to the phoneme sequence are selected from the speech unit dictionary 204. Next, the process proceeds to step S205, in which a synthesized voice signal is generated using the selected voice segments, and in step S206, a voice is output from the speaker 107 based on the generated synthesized voice signal. Finally, in step S207, it is determined whether or not all the processes for the input text data have been completed. If not, the process returns to step S201 to continue the above-described processes.

【００１４】図３は、図２のステップＳ２０３のプロソ
ディ生成処理の一部を詳細に説明するフローチャートで
ある。図３では、継続的時間長モデル２０２を用いて、
所定単位の音韻系列（以下、大局的セグメントと称す
る）の継続時間長とこの音韻系列を構成する各音韻（以
下、局所的セグメントと称する）の継続時間長とを設定
する手順を示す。ここで、継続時間長モデル２０２は、
大局的セグメントに対する継続時間長モデル（大局的継
続時間長モデルともいう）３０１と局所的セグメントに
対する継続時間長モデル（局所的継続時間長モデルとも
いう）３０２とを含む。FIG. 3 is a flowchart for explaining in detail a part of the prosody generation processing in step S203 in FIG. In FIG. 3, using the continuous time length model 202,
A procedure for setting the duration of a phoneme sequence of a predetermined unit (hereinafter, referred to as a global segment) and the duration of each phoneme (hereinafter, referred to as a local segment) constituting the phoneme sequence will be described. Here, the duration model 202 is
A duration model for a global segment (also referred to as a global duration model) 301 and a duration model for a local segment (also referred to as a local duration model) 302 are included.

【００１５】まずステップＳ３０１において、図２のス
テップＳ２０２のテキスト処理によって得られる入力テ
キストデータに対する解析結果を入力する。ここで、こ
の解析結果としては、音素などの音韻情報から得た音韻
環境、モーラ数、アクセント句数、品詞などの言語情報
から得た言語環境に関する情報などがある。次にステッ
プＳ３０２に進み、まず大局的なセグメントに対する継
続時間長を大局的セグメントに対する大局的継続時間長
モデル３０１に基づいて設定する。ここで、大局的なセ
グメントは、アクセント句、単語、フレーズ、文など
の、発話上ひとまとまりにして処理できる（発話単位と
いう）からなる。First, in step S301, an analysis result for input text data obtained by the text processing in step S202 of FIG. 2 is input. Here, as the analysis result, there is a phoneme environment obtained from phoneme information such as phonemes, information on a language environment obtained from language information such as the number of mora, the number of accent phrases, and the part of speech. Next, the process proceeds to step S302, where the duration of the global segment is set based on the global duration model 301 for the global segment. Here, the global segment is composed of utterances such as accent phrases, words, phrases, sentences, etc., which can be processed as a unit (called an utterance unit).

【００１６】次にステップＳ３０３に進み、局所的なセ
グメントに対する継続時間長を、局所的セグメントに対
する局所的継続時間長モデル３０２に基づいて設定す
る。ここで、局所的なセグメントは、音素、音節、モー
ラなどの発話単位を構成する音韻単位からなる。Next, the process proceeds to step S303, where the duration of the local segment is set based on the local duration model 302 for the local segment. Here, the local segment is composed of phoneme units constituting speech units such as phonemes, syllables, and mora.

【００１７】そして最後にステップＳ３０４に進み、ス
テップＳ３０３で得られる局所的なセグメントの継続時
間長の和によって得られる大局的なセグメントに対する
継続時間長と、ステップＳ３０２で設定される大局的な
セグメントに対する継続時間長との差分を、ステップＳ
３０２で設定される大局的継続時間長となるように、局
所的なセグメントの継続時間長を局所的継続時間伸縮モ
デル３０３を用いて伸縮することにより、各音韻の局所
的継続時間長を決定する。Finally, the process proceeds to step S304, where the duration of the global segment obtained by the sum of the durations of the local segments obtained in step S303 and the duration of the global segment set in step S302 are determined. The difference with the duration is determined by the step S
The local duration of each phoneme is determined by expanding and contracting the duration of a local segment using the local duration expansion / contraction model 303 so as to have the global duration set at 302. .

【００１８】具体例として、いまテキストデータとして
「花が」が入力された場合、個の文字列から解析された
音韻系列を大局的セグメントとし、これをモーラを音韻
単位とする局所的セグメントに分割すると「ha」「na」
「ga」となる。ここで各モーラの平均継続時間長(durat
ion)を、例えば１００ミリ秒とし、実際の測定されたこ
の大局的セグメントの時間長が６００ミリ秒であったと
すると、大局的セグメントの時間長が６００ミリ秒に対
して、局所的なセグメントの継続時間長の和によって得
られる大局的継続時間長は３００ミリ秒となり、３００
ミリ秒の差が生じることになる。As a specific example, when "flower" is input as text data, a phoneme sequence analyzed from individual character strings is set as a global segment, and this is divided into local segments using mora as a phoneme unit. Then "ha""na"
It becomes "ga". Where the average duration of each mora (durat
ion) is, for example, 100 milliseconds, and assuming that the actual measured time length of the global segment is 600 milliseconds, the time length of the global segment is 600 milliseconds, whereas the local segment has a time length of 600 milliseconds. The overall duration obtained from the sum of the durations is 300 milliseconds,
There will be a millisecond difference.

【００１９】ここで次に、大局的なセグメントに対する
大局的継続時間長モデル３０１の作成方法と、ステップ
Ｓ３０２の大局的なセグメントに対する継続時間長の設
定処理を図４のフローチャートを参照して説明する。Next, the method of creating the global duration model 301 for the global segment and the process of setting the duration for the global segment in step S302 will be described with reference to the flowchart of FIG. .

【００２０】図４は、大局的なセグメントに対する大局
的継続時間長モデル３０１の作成方法を示すフローチャ
ートである。FIG. 4 is a flowchart showing a method of creating a global duration model 301 for a global segment.

【００２１】まずステップＳ４０１において、大局的な
セグメントに対する大局的継続時間長モデルを作成する
ための複数個の学習サンプルを有する音声ファイル４０
１と、音素や音節などの開始、終了時間情報等の継続時
間長の抽出に必要な情報を有するサイド情報ファイル４
０２とを用いて、大局的継続時間長を抽出する。次にス
テップＳ４０２に進み、音素などの音韻情報から得た音
韻環境、モーラ数、アクセント句数、品詞などの言語情
報から得た言語環境に関する情報を有する音韻・言語環
境ファイル４０３と、ステップＳ４０１で抽出した大局
的継続時間長の情報とを用いて、所定の言語環境を考慮
した大局的継続時間長モデル３０１を作成する。First, in step S401, an audio file 40 having a plurality of learning samples for creating a global duration model for a global segment.
1 and a side information file 4 having information necessary for extracting a duration length such as start and end time information of phonemes and syllables.
02 to extract the global duration. Next, proceeding to step S402, a phoneme / language environment file 403 having information on a language environment obtained from phonetic information such as phonemes and linguistic information obtained from linguistic information such as the number of mora, the number of accent phrases, and part of speech, and a step S401. Using the extracted global duration information, a global duration model 301 that takes into account a predetermined language environment is created.

【００２２】具体的な処理手順は以下の通りである。大
局的セグメントの継続時間長モデル３０１を作成するた
めの音声ファイル４０１中の学習サンプル数をＫとし、
この内のｋ番目の学習サンプルにおける大局的セグメン
トの継続時間長をｄkとする。本実施の形態では、大局
的継続時間長ｄkを直接予測するモデルを作成するので
はなく、Ｋ個の学習サンプルから求めた大局的セグメン
トの平均継続時間長~ｄを用いて、大局的セグメントの
継続時間長ｄkを、ｓk＝ｄk／~ｄ …式(1) と正規化したｓkを予測するモデルを作成する。ここ
で、大局的セグメントの平均継続時間長~ｄは、様々な
方法で求めることができるが、例えば、ｄkを平均モー
ラ継続時間長（１モーラ当りの平均継続時間長）とした
場合、 ~ｄ＝（１／Ｋ）Σ（ｄk／Ｎk） (Σはk=1〜Kの総和) …式(2) として求めることができる。ここでＮkは、ｋ番目の学
習サンプルにおけるモーラ数である。The specific processing procedure is as follows. Let K be the number of learning samples in the audio file 401 for creating the duration model 301 of the global segment,
Let dk be the duration of the global segment in the k-th learning sample. In the present embodiment, instead of creating a model for directly predicting the global duration d k, the global segment average duration ~ d obtained from K learning samples is used to calculate the global segment. A model for predicting sk, in which the duration dk is normalized by sk = dk / kd (1), is created. Here, the average duration length d of the global segment can be obtained by various methods. For example, when dk is the average mora duration length (average duration length per mora), = (1 / K) Σ (dk / Nk) (Σ is the sum of k = 1 to K) ... It can be obtained as equation (2). Here, Nk is the number of moras in the k-th learning sample.

【００２３】このとき、大局的継続時間長ｄkを正規化
したｓkの予測値^ｓkは、線形重回帰分析法を用いれ
ば、次式のようにして求めることができる。At this time, the prediction value sk of sk, which is obtained by normalizing the global duration dk, can be obtained by the following equation using a linear multiple regression analysis method.

【００２４】 ^ｓk＝ａ0＋ΣΣａi,j×ｘk,i,j （最初のΣはi=1〜I、次のΣはj=1〜Jiの総和をそれぞれ示す） …式(3) ここで、Ｉは音韻・言語環境要因（アイテム）数、Ｊi
は要因ｉ（例えば、音素種類やアクセント句数）に対す
るカテゴリ数を表す。また、ｘk,i,jは、サンプルｋの
要因ｉのカテゴリｊ（例えば音素セットやアクセントタ
イプ等）における説明変数、ａi,jは、要因ｉのカテゴ
リｊに対する回帰係数、ａ0は定数項である。この予測
値^ｓkを用いて、ｋ番目のサンプルに対する大局的なセ
グメントの大局的継続時間長^ｄkは、式(1)より、 ^ｄk＝^ｓk×~ｄ …式(4) として求めることができる。この式（４）が大局的時間
長モデル３０１となる^ Sk = a0 + ΣΣai, j × xk, i, j (the first Σ indicates the sum of i = 1 to I, and the next Σ indicates the sum of j = 1 to Ji) Expression (3) where I Is the number of phonemes / language environment factors (items), Ji
Represents the number of categories for the factor i (for example, phoneme type or number of accent phrases). Xk, i, j is an explanatory variable in category j (for example, phoneme set or accent type) of factor i of sample k, ai, j is a regression coefficient for category j of factor i, and a0 is a constant term. . Using the predicted value ^ sk, the global duration length dk of the global segment for the k-th sample is calculated from Expression (1) as follows: ^ dk = ^ sk x ~ d ... Expression (4) Can be. This equation (4) becomes the global time length model 301.

【００２５】上記Ｉ及びＪiの値は実に様々な選び方が
考えられるが、例えば、要因ｉとして大局的セグメント
内の音素種類とアクセント句数を選び、それぞれのカテ
ゴリｊとして２６種類の音素セットと大局的セグメント
内のアクセント句数（１，２，３，４以上）を選んだ場
合、Ｉ＝２，Ｊ1＝２６，Ｊ2＝４となる。The values of I and Ji can be selected in various ways. For example, the phoneme type and the number of accent phrases in the global segment are selected as the factor i, and 26 types of phoneme sets and global When the number of accent phrases (1, 2, 3, 4 or more) in the target segment is selected, I = 2, J1 = 26, and J2 = 4.

【００２６】次に、局所的なセグメントに対する局所的
継続時間長モデル３０２の作成方法と、ステップＳ３０
３の局所的なセグメントに対する局所的継続時間長の設
定処理を図５のフローチャートを参照して説明する。こ
れらの処理は、大局的なセグメントと同様に以下のよう
に行う。Next, a method of creating a local duration model 302 for a local segment, and step S30
The process of setting the local duration for the three local segments will be described with reference to the flowchart in FIG. These processes are performed as follows in the same manner as in the global segment.

【００２７】図５は、局所的なセグメントに対する局所
的継続時間長モデル３０２の作成方法を示すフローチャ
ートである。FIG. 5 is a flowchart showing a method for creating a local duration model 302 for a local segment.

【００２８】まずステップＳ５０１において、局所的な
セグメントに対する継続時間長モデルを作成するための
複数個の学習サンプルを有する音声ファイル５０１と、
音素や音節などの開始、終了時間情報等のような継続時
間長の抽出に必要な情報を有するサイド情報ファイル５
０２とを用いて、局所的継続時間長を抽出する。次にス
テップＳ５０２に進み、音素などの音韻情報から得た音
韻環境、モーラ数、アクセント句数、品詞などの言語情
報から得た言語環境に関する情報を有する音韻・言語環
境ファイル５０３と、ステップＳ５０１で抽出した局所
的継続時間長の情報とを用いて、所定の音韻環境を考慮
した局所的セグメント継続時間長モデル３０２を作成す
る。First, in step S501, an audio file 501 having a plurality of learning samples for creating a duration model for a local segment;
A side information file 5 having information necessary for extracting a duration such as start and end time information of phonemes and syllables.
02 to extract the local duration. Next, the process proceeds to step S502, where a phoneme / language environment file 503 having information about the phoneme environment obtained from phoneme information such as phonemes, the number of mora, the number of accent phrases, and the language environment obtained from language information such as part of speech, and step S501. Using the extracted information on the local duration, a local segment duration model 302 is created in consideration of a predetermined phonemic environment.

【００２９】具体的な処理手順は、前述の大局的なセグ
メントの大局的継続時間長モデル３０１と同様の方法を
用いてもよい。つまり、Ｋ個の学習サンプルから求めた
局所的セグメントの平均継続時間長を用いて局所的継続
時間長を正規化したモデルを作成し、このモデルに基づ
いて局所的継続時間長モデル３０２を作成しても良い。As a specific processing procedure, a method similar to that of the global duration model 301 of the global segment described above may be used. That is, a model in which the local duration is normalized using the average duration of the local segment obtained from the K learning samples is created, and the local duration model 302 is created based on this model. May be.

【００３０】最後に、ステップＳ３０２で得られる大局
的なセグメントに対する大局的継続時間長と、ステップ
Ｓ３０３で得られる複数の局所的なセグメントに対する
局所的継続時間長との和から求まる大局的なセグメント
に対する大局的継続時間長との差（例えば前述の具体例
では(600-300=)３００ミリ秒）を、音韻の継続時間長に
関する統計量（平均値、分散）を用いて、大局的なセグ
メントに対する大局的継続時間長に等しくするように、
ステップＳ３０４において伸縮処理を行う。この具体的
な方法としては、例えば、特開平１１−２５９０９５号
公報で示されるような、音韻の継続時間長に関する統計
量を用いた伸縮方法などの手段を用いることによって実
現できる。Finally, for the global segment obtained from the sum of the global duration for the global segment obtained in step S302 and the local duration for a plurality of local segments obtained in step S303. The difference from the global duration (for example, (600-300 =) 300 milliseconds in the above example) is used to calculate the global segment using statistics (mean, variance) relating to the duration of the phoneme. To be equal to the global duration,
In step S304, expansion / contraction processing is performed. This specific method can be realized by using a means such as an expansion / contraction method using a statistic related to the duration of a phoneme as disclosed in Japanese Patent Application Laid-Open No. H11-259095.

【００３１】例えば、ある音韻に対する音韻時間長の決
定の一例として、音韻時間長の平均値、標準偏差、最小
値を、音韻の種類（αi）毎に求め、これらをメモリに
格納しておき、これらの値を用いて音韻αiに関する音
韻時間長ｄiの初期値ｄαiを決定する。そして、これに
基づいて、音韻時間長ｄiが決定される。For example, as an example of determining a phoneme time length for a certain phoneme, an average value, a standard deviation, and a minimum value of the phoneme time length are obtained for each phoneme type (αi), and these are stored in a memory. Using these values, the initial value dαi of the phoneme duration di for the phoneme αi is determined. Then, the phoneme duration di is determined based on this.

【００３２】ｄi＝ｄαi＋ρ（σαi）² ρ＝（Ｔ−Σｄαi）／Σ（σαi）² ここで、Ｔは発生時間（Ｔ＝Σｄi）を示し、σαiは音
韻時間長の標準偏差を示す。またΣはｉ＝１〜Ｎ（サン
プル数）の総和を示す。Di = dαi + ρ (σαi) ² ρ = (T−Σdαi) / Σ (σαi) ² where T indicates the generation time (T = Σdi), and σαi indicates the standard deviation of the phoneme time length. Σ indicates the sum of i = 1 to N (the number of samples).

【００３３】［実施の形態２］上記実施の形態１では、
大局的セグメントの継続時間長ｄkを大局的セグメント
の平均継続時間長~ｄで除した式(1)を推定するモデルを
学習し、このモデルから得られる大局的継続時間長を用
いて局所的な継続時間長を再設定したが、実施の形態２
では、大局的セグメントの継続時間長と平均継続時間長
の差分値に基づいて大局的時間長モデルを構成する。な
お、実施の形態２によるハードウェア構成、手順は第１
の実施の形態（図１〜図５）と同様であるので、それら
の説明を省略する。[Second Embodiment] In the first embodiment,
A model for estimating equation (1) obtained by dividing the duration dk of the global segment by the average duration d of the global segment to d is learned, and the local duration is obtained using the global duration obtained from this model. Although the duration time is reset, Embodiment 2
Then, a global time length model is constructed based on the difference between the duration of the global segment and the average duration. The hardware configuration and procedure according to the second embodiment are the same as those in the first embodiment.
Since these embodiments are the same as the first embodiment (FIGS. 1 to 5), the description thereof is omitted.

【００３４】本実施の形態２では、実施の形態１におけ
る式(1)をｓk＝ｄk−~ｄ …式(5) と変更し、学習サンプルごとの大局的なセグメントの継
続時間長から平均継続時間長~ｄを差し引くことによっ
て、継続時間長ｄkを正規化したｓkを求める。このよう
にして得られたｓkを用いて、前述の実施の形態１と同
様に、線形重回帰分析法を用いて、式(3)と同様にｓkの
予測モデルを作成することができる。このモデルから得
られる大局的なセグメントの継続時間長の予測値^ｓkを
用いれば、ｋ番目のサンプルに対する大局的なセグメン
トの継続時間長^ｄkは、式(5)より、ｄ^k＝^ｓk＋~ｄ …式(6) として求めることができる。この式（６）が実施の形態
２における大局的継続時間長モデルとなる。局所的継続
時間長モデルも同様の方法を用いてモデリングすること
ができる。In the second embodiment, the equation (1) in the first embodiment is changed to sk = dk− ~ d... (5), and the average duration is calculated from the duration of the global segment for each learning sample. By subtracting the time length ｄd, the sk obtained by normalizing the duration time dk is obtained. Using the sk obtained in this manner, a prediction model of sk can be created in the same manner as in the first embodiment, using a linear multiple regression analysis method in the same manner as in equation (3). Using the predicted value of the duration of the global segment ^ sk obtained from this model, the duration of the global segment ^ dk for the k-th sample is given by d ^ k = ^ from equation (5). sk + ~ d can be obtained as equation (6). Equation (6) is a global duration model in the second embodiment. The local duration model can be modeled using a similar method.

【００３５】なお、上記各実施の形態における構成は本
発明の一実施の形態を示したものであり、各種変形が可
能である。変形例を示せば以下の通りである。The configuration in each of the above embodiments shows an embodiment of the present invention, and various modifications are possible. A modified example is as follows.

【００３６】上述した各実施の形態において、大局的セ
グメントの平均継続時間長~ｄとして平均モーラ継続時
間長を用いたが、平均を求める際にモーラを単位として
いるのは一例であり、音節や音素といったこれ以外の音
韻単位を用いることができる。また、本発明は日本語以
外の言語にも適用可能である。In each of the above embodiments, the average duration of the mora is used as the average duration of the global segment to d. However, when the average is obtained, the mora is used as a unit. Other phoneme units, such as phonemes, can be used. The present invention is also applicable to languages other than Japanese.

【００３７】上述した各実施の形態において、大局的セ
グメントの線形重回帰モデルの要因とカテゴリは一例を
示すものであり、他の要因やカテゴリを用いてもよい。In the above-described embodiments, the factors and categories of the linear multiple regression model of the global segment are merely examples, and other factors and categories may be used.

【００３８】また本発明の目的は、前述した実施の形態
の機能を実現するソフトウェアのプログラムコードを記
録した記憶媒体を、システムあるいは装置に供給し、そ
のシステムあるいは装置のコンピュータ（又はＣＰＵや
ＭＰＵ）が記憶媒体に格納されたプログラムコードを読
出し実行することによっても達成される。この場合、記
憶媒体から読出されたプログラムコード自体が前述した
実施の形態の機能を実現することになり、そのプログラ
ムコードを記憶した記憶媒体は本発明を構成することに
なる。このようなプログラムコードを供給するための記
憶媒体としては、例えば、フロッピィディスク、ハード
ディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯ
Ｍ、ＣＤ−Ｒ、ＤＶＤ、磁気テープ、不揮発性のメモリ
カード、ＲＯＭなどを用いることができる。Another object of the present invention is to provide a storage medium storing a program code of software for realizing the functions of the above-described embodiments to a system or apparatus, and to provide a computer (or CPU or MPU) of the system or apparatus. Is also achieved by reading and executing the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the function of the above-described embodiment, and the storage medium storing the program code constitutes the present invention. As a storage medium for supplying such a program code, for example, a floppy disk, hard disk, optical disk, magneto-optical disk, CD-RO
M, CD-R, DVD, magnetic tape, nonvolatile memory card, ROM and the like can be used.

【００３９】また、コンピュータが読出したプログラム
コードを実行することにより、前述した実施の形態の機
能が実現されるだけでなく、そのプログラムコードの指
示に基づき、コンピュータ上で稼働しているＯＳ（オペ
レーティングシステム）などが実際の処理の一部又は全
部を行い、その処理によって前述した実施の形態の機能
が実現される場合も含まれる。When the computer executes the readout program code, not only the functions of the above-described embodiment are realized, but also the OS (Operating System) running on the computer based on the instruction of the program code. System) performs part or all of actual processing, and the processing realizes the functions of the above-described embodiments.

【００４０】さらに、記憶媒体から読出されたプログラ
ムコードが、コンピュータに挿入された機能拡張ボード
やコンピュータに接続された機能拡張ユニットに備わる
メモリに書込まれた後、そのプログラムコードの指示に
基づき、その機能拡張ボードや機能拡張ユニットに備わ
るＣＰＵなどが実際の処理の一部又は全部を行い、その
処理によって前述した実施の形態の機能が実現される場
合も含まれる。Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, based on the instructions of the program code, The case where the CPU of the function expansion board or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments is also included.

【００４１】以上説明したように本実施の形態によれ
ば、高精度に大局的及び局所的なセグメントの継続時間
長を設定する手段を用いることにより、より高精度に継
続時間長をモデル化できるようになり、音声合成装置に
おける合成音声の自然性の向上が可能になるという効果
がある。As described above, according to the present embodiment, the duration can be modeled with higher accuracy by using the means for setting the duration of the global and local segments with high accuracy. As a result, it is possible to improve the naturalness of synthesized speech in the speech synthesis device.

【００４２】[0042]

【発明の効果】以上説明したように本発明によれば、音
韻系列の継続時間長を精度良く設定することを可能と
し、音韻・言語環境に応じた自然な音韻時間長を与える
ことができる。As described above, according to the present invention, it is possible to accurately set the duration of a phoneme sequence, and to provide a natural phoneme time according to the phoneme / language environment.

[Brief description of the drawings]

【図１】本発明の実施の形態に係る音声合成装置のハー
ドウェア構成を示すブロック図である。FIG. 1 is a block diagram showing a hardware configuration of a speech synthesizer according to an embodiment of the present invention.

【図２】本発明の実施の形態に係る音声合成装置におけ
る音声合成の処理手順を示したフローチャートである。FIG. 2 is a flowchart showing a speech synthesis processing procedure in the speech synthesis device according to the embodiment of the present invention.

【図３】図２のステップＳ２０３のプロソディ生成処理
における、継続時間長モデルを用いた音韻系列の継続時
間長の設定手順を示すフローチャートである。FIG. 3 is a flowchart showing a procedure for setting a duration of a phoneme sequence using a duration model in a prosody generation process in step S203 of FIG. 2;

【図４】本実施の形態に係る大局的セグメントに対する
大局的継続時間長モデルの作成方法を示すフローチャー
トである。FIG. 4 is a flowchart illustrating a method for creating a global duration model for a global segment according to the present embodiment.

【図５】本実施の形態に係る局所的なセグメントに対す
る局所的継続時間長モデルの作成方法を示すフローチャ
ートである。FIG. 5 is a flowchart showing a method for creating a local duration model for a local segment according to the present embodiment.

Claims

[Claims]

A step of obtaining a duration of a phoneme sequence in a predetermined unit based on a duration model of a global segment; and a step of obtaining each of the phoneme sequences based on a duration model of a local segment. Setting the duration of each phoneme based on the duration of the phoneme sequence and the duration of each phoneme, and setting the duration of each phoneme. A voice synthesizing step of synthesizing voice based on the duration of each phoneme.

2. The method according to claim 1, wherein the local segment comprises at least one of a phoneme, a syllable, and a mora, and the global segment comprises at least one of an accent phrase, a word, a phrase, and a sentence. The voice information processing method described in the above.

3. The global segment duration model is a model modeled on the basis of a ratio of the duration of the global segment to the average duration of the global segment. The voice information processing method according to claim 1.

4. The duration model of the global segment is a model modeled based on a difference between the duration of the global segment and the average duration of the global segment. The voice information processing method according to claim 1.

5. The method according to claim 1, wherein the duration model of the global segment is a model modeled by a linear multiple regression model.
The voice information processing method according to the paragraph.

6. A computer-readable storage medium storing a program for executing the voice information processing method according to claim 1. Description:

7. A means for determining the duration of a phoneme sequence of a predetermined unit based on a duration model of a global segment; and each of the units forming the phoneme sequence based on a duration model of a local segment. Means for determining the duration of a phoneme; setting means for setting the duration of each phoneme based on the duration of the phoneme sequence and the duration of each phoneme; and setting by the setting means. Speech synthesis means for synthesizing speech based on the duration of each phoneme.

8. The method of claim 7, wherein the local segment comprises a phoneme or at least one of a syllable and a mora, and the global segment comprises at least one of an accent phrase, a word, a phrase, and a sentence. An audio information processing apparatus according to claim 1.

9. The global segment duration model is a model modeled based on a ratio of a duration of the global segment to an average duration of the global segment. The voice information processing apparatus according to claim 7, wherein

10. The duration model of the global segment is a model modeled based on a difference between the duration of the global segment and the average duration of the global segment. The voice information processing apparatus according to claim 7, wherein

11. The speech information processing apparatus according to claim 7, wherein the duration model of the global segment is a model modeled by a linear multiple regression model.