JP3915019B2

JP3915019B2 - VLIW processor, program generation device, and recording medium

Info

Publication number: JP3915019B2
Application number: JP16787598A
Authority: JP
Inventors: 信哉宮地; 信生檜垣; 哲也田中
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1998-06-16
Filing date: 1998-06-16
Publication date: 2007-05-16
Anticipated expiration: 2018-06-16
Also published as: JP2000003279A

Description

【０００１】
【発明の属する技術分野】
本発明は、命令供給が十分に行えない環境で使用されても供給されたものから事項する事により、性能劣化を抑制するＶＬＩＷプロセッサ、プログラム生成装置および記録媒体に関するものである。
【０００２】
【従来の技術】
近年のマイクロプロセッサ応用製品の高機能化および高速化に伴い、高い処理能力を持つマイクロプロセッサ（以下、単に「プロセッサ」という。）が望まれている。このため、最近では、１サイクルに複数の命令を同時に実行することが行われている。
【０００３】
命令レベルの並列処理を実現する方法として、ダイナミックスケジューリングによるものとスタティックスケジューリングによるものがある。
【０００４】
ダイナミックスケジューリングによるものの代表例としてスーパースカラ方式がある。この方式では、実行時に命令コードを解読後、ハードウェアにて動的に命令間の依存関係を解析して並列実行可能か否かを判定し、適切な組み合わせの命令を並列実行する。スタティックスケジューリングによるものの代表例としてＶＬＩＷ（ＶｅｒｙＬｏｎｇＩｎｓｔｒｕｃｔｉｏｎＷｏｒｄ）方式がある。この方式は、実行コード生成時にコンパイラ等により静的に命令間の依存関係を解析し、命令コードの移動を行って実行効率の良い命令ストリームを生成する。一般のＶＬＩＷ方式では、同時実行可能な複数の命令（ここでは「単位命令」と呼ぶ。）を一つの固定長命令供給単位（ここでは「一語」と呼ぶ。）に記述する。この方式を採ると、ハードウェアで命令間の依存解析を行う必要が無いため、ハードウェアを単純化できるというメリットがある。
【０００５】
以下、従来技術におけるＶＬＩＷプロセッサの動作を図１３を用いて説明する。
【０００６】
図１３は、従来技術におけるＶＬＩＷプロセッサの構成図であり、１０はデータ、命令等が格納されているメモリ、２０はメモリ１０から命令等を取り出す命令供給発行部、３０は命令供給発行部２０で取り出された命令を解読し解読結果を命令実行部へ与える命令解読部である。命令供給発行部２０は、メモリ１０からの命令等の取り出しを制御する命令フェッチ制御部２１とメモリ１０から取り出した命令等を格納する命令レジスタ２２からなる。また、命令解読部３０は、命令の発行を制御する命令発行制御部３１とデコーダ３２と解読結果を格納するレジスタ３３からなる。このプロセッサは、３２ビットの単位命令４つから構成される一語を同時に実行することが可能なＶＬＩＷプロセッサで、１２８ビット単位で命令フェッチされる。
【０００７】
まず、命令供給発行部２０内の命令フェッチ制御部２１は、ＰＣ２（プログラムカウンタ）、クロック１に基づいて実行する命令のアドレスをアドレスバス１１からメモリ１０に与える。これにより、メモリ１０は指定されたアドレスに対応する命令を１２８ビットのデータバスによって、命令レジスタ２２内の４つの命令レジスタに３２ビットずつ命令を供給する。命令レジスタ２２は、クロック１に基づいてメモリ１０から供給されたデータを格納する。これとともに、命令フェッチが完了したことを意味する命令フェッチフラグ２３を”１”とする。このとき、４つの命令レジスタ２２には、常に命令が格納される。なお、命令フェッチを開始したとき（ジャンプ命令や割り込みが生じた場合等）、誤った命令の解読を防止するため命令フェッチフラグ２３は”０”とされ、キャンセル信号３４によりデコーダからＮＯＰ（ＮｏＯｐｅｒａｔｉｏｎ）が出力される。
【０００８】
次に、命令解読部３０におけるデコーダ３２は、命令フェッチフラグ２３により命令レジスタ２２に命令が格納されたという情報を得て、命令を解読した結果を出力する。そして、レジスタ３３はクロック１によって解読した結果を格納する。
【０００９】
最後に、レジスタ３３に格納された解読結果は、命令実行部に供給され（図示せず）、命令が実行されることとなる。
【００１０】
【発明が解決しようとする課題】
しかしながら、上記従来のＶＬＩＷプロセッサでは、命令フェッチを一語長よりも小さい単位で行った場合や命令を可変長とした場合、命令レジスタに命令が供給されるタイミングに差異が生じるため性能が劣化してしまうことがあった。
【００１１】
すなわち、従来のＶＬＩＷプロセッサは一語長と命令フェッチ単位とが一致しているが、ＶＬＩＷを組み込みマイコンに適応するとコストの理由から命令フェッチ幅が一語の幅よりも小さくせざるを得ない場合がある。
【００１２】
また、たとえ最大語長と命令フェッチ単位とが一致していても可変長命令の場合、２回の命令フェッチによって初めて１つの命令を取り込むことができる場合もある。
【００１３】
以下、具体的に図面を用いて説明する。
（１）命令フェッチを一語長よりも小さい単位で行った場合
図１４はプログラム例であり、図１５は同プログラムを実行した場合のパイプラインの流れを説明したものである。
【００１４】
図１４では、（１０００００００）₁₆番地に、メモリから読み込んだ結果をｒ０レジスタに格納させる命令”ｍｏｖ（ｍｅｍ）、ｒ０”が、（１００００００４）₁₆番地にはｒ１レジスタの値を１つ増加させる命令”ａｄｄ＃１、ｒ１、ｒ１”が、以下同様に（１０００００１Ｆ）₁₆番地まで命令が配置されている。
【００１５】
この場合、図１５に示すように、タイミングｔ１で（１０００００００）₁₆番地の３２ビット長の２つの命令が、タイミングｔ２で（１００００００８）₁₆番地の３２ビット長の２つの命令が命令フェッチされ、タイミングｔ３で４つの命令が同時にデコード、ｔ４で実行される。しかし、（１０００００００）₁₆番地の命令”ｍｏｖ（ｍｅｍ）、ｒ０”は、ＭＥＭステージでメモリを読み込んだ結果をレジスタｒ０に書き込むものであるのに対して、後続する命令である（１００００００Ｃ）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”はレジスタｒ０の内容を使用するものであるためＷＢステージでレジスタの書込を行うまで内容を参照出来ない。このため、レジスタ干渉が発生し、（１００００００Ｃ）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”はタイミングｔ６で実行できず、タイミングｔ７で実行されることになる。
【００１６】
結果として、命令供給不足とレジスタ干渉の為に、すべての命令を実行するまでに９サイクル必要となる。
（２）命令を可変長とした場合
図１６はプログラム例であり、図１７は同プログラムを実行した場合のパイプラインの流れを説明したものである。
【００１７】
図１６では、（１０００００００）₁₆番地に、メモリから読み込んだ結果をｒ０レジスタに格納させる命令”ｍｏｖ（ｍｅｍ）、ｒ０”が、（１００００００４）₁₆番地にはレジスタｒ１の値を１つ増加させる命令”ａｄｄ＃１、ｒ１、ｒ１”が、以下、同様に（１０００００１Ｆ）₁₆番地まで命令が配置されている。なお、本命令中で、”ａｄｄ＃１２３４５６７８、ｒ３、ｒ３”命令は６４ビット単位命令であり、他は３２ビット単位命令である。
【００１８】
この場合、図１７に示すように、（１００００００Ｃ）₁₆番地の命令は６４ビット長の命令であるため、タイミングｔ１、ｔ２の２回の命令フェッチによって初めて４つの命令が揃い、タイミングｔ３で４つの命令が同時にデコードされ、ｔ４で実行される。しかし、（１０００００００）₁₆番地の命令”ｍｏｖ（ｍｅｍ）、ｒ０”は、ＭＥＭステージでメモリを読み込んだ結果を書き込んだものであるのに対して、後続する（１０００００１０）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”はレジスタｒ０の内容を使用するものであるため、ＷＢステージでレジスタの書込を行うまで、内容を参照出来ない。このため、レジスタ干渉が発生し、（１０００００１０）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”はタイミングｔ６で実行できず、タイミングｔ７で実行されることになる。
【００１９】
結果として、命令供給不足とレジスタ干渉の為に、すべての命令を実行するまでに９サイクル必要となる。
【００２０】
このように、上記従来のＶＬＩＷプロセッサは、なるべくハードウェアを簡略化することにより高速化を図るものであるため、並列処理できる全ての命令が揃った段階でこれらの命令を同時に実行するものであり、この前提が成り立たない場合には十分な性能を発揮できないという問題点があった。
【００２１】
本願発明は、上記従来の課題を解決するもので、命令フェッチを一語長よりも小さい単位で行った場合や命令を可変長とした場合であっても十分な性能を発揮することができるプロセッサを提供するものである。
【００２２】
【課題を解決するための手段】
本願発明は、並列実行できる全ての命令が命令フェッチされなくても、命令フェッチされた命令から先に実行することを特徴とするＶＬＩＷプロセッサである。
【００２３】
【発明の実施の形態】
以下、本発明について、図面を用いて詳細に説明する。
【００２４】
（第１の実施の形態）
本実施の形態は、一語長よりも小さい単位で命令フェッチをした場合でも、効率よく命令を実行可能とするプロセッサ等に関するものである。すなわち、４つの命令を同時に実行できるＶＬＩＷプロセッサであっても、２つの命令が揃った段階で、デコード、実行を開始することにより、極力レジスタ干渉によるパイプラインインタロックを軽減するものである。また、先行的に実行した命令がＩ／Ｏに関する命令である場合、より早くデータを得ることができる。
（１）プロセッサ
図１は本発明の第１の実施の形態におけるプロセッサのブロック図である。図１３に示した従来のＶＬＩＷプロセッサと比較すると、（ａ）データバス１１２が一語長よりも小さい６４ビットである点、（ｂ）４つの命令レジスタのうち左側の２つの命令レジスタに命令が格納されたか、右側の２つの命令レジスタに命令が格納されたかを示す位置情報１２４を持つ点、（ｃ）ＮＯＰを出力させるためのキャンセル信号１３４、１３５がある点で異なる。
【００２５】
このプロセッサは、位置情報１２４により命令レジスタ１２２のどこに命令が格納されたかを認識し、この情報を元にキャンセル信号１３４、１３５を生成しＮＯＰを出力することにより、命令レジスタ１２２に命令が格納されたものから順に解読・実行することを実現している。
【００２６】
まず、命令供給発行部１２０内の命令フェッチ制御部１２１は、ＰＣ１０２、クロック１０１に基づいて実行する命令のアドレスをアドレスバス１１１からメモリ１１０に与える。これにより、メモリ１１０は６４ビットのデータバス１１２を介して、命令レジスタ１２２内の左側の２つの命令レジスタに３２ビットずつ命令を供給する。命令レジスタ１２２は、クロック１０１に基づいてメモリ１１０から供給されたデータを格納する。これとともに、命令フェッチが完了したことを表すため命令フェッチフラグ１２３を”１”、さらに命令レジスタ１２２内の左側の２つに命令が格納されたことを表すため位置情報１２４を”０”とする。このとき、４つの命令レジスタ１２２のうち、左側の２つめの命令レジスタには命令が格納されているが、右側の２つの命令レジスタには命令が格納されていないことになる。なお、従来と同様に命令フェッチが完了していない場合、命令フェッチフラグ１２３は”０”であり、このためキャンセル信号１３４、１３５は”０”となり、ＮＯＰ信号生成器１３７はＮＯＰを出力する。
【００２７】
次に、命令解読部１３０におけるデコーダ１３２は、命令フェッチフラグ１２３により命令レジスタ１２２に命令が格納されたという情報を得て、命令を解読した結果を出力する。このとき、位置情報１２４が”０”であり命令レジスタ１２２のうち左側の２つの命令レジスタにしか命令が格納されていないことを表しているので、キャンセル信号生成器１３１はキャンセル信号１３４を”１”に、キャンセル信号１３５を”０”にする。これにより、デコーダ１３２におけるＮＯＰ生成器１３７のうち左側の２つからは命令レジスタ１２２に格納された命令の解読結果が出力され、右側の２つからはＮＯＰが出力される。そして、レジスタ１３３はクロック１０１によって解読した結果を格納する。なお、ＮＯＰ生成器１３７は、命令解読器１３６の出力とキャンセル信号との論理積を演算するＡＮＤ回路である。すなわち、キャンセル信号１３４、１３５が”０”となっているときは、解読器１３６の出力に関わらず、ＮＯＰを意味する”０”を出力する。
【００２８】
最後に、レジスタ１３３に格納された解読結果は、命令実行部に供給され（図示せず）、命令が実行されることとなる。
【００２９】
なお、次の命令フェッチの際には、フェッチされた命令等は命令レジスタ１２２の右側の２つに格納され、位置情報１２４もこれに対応して更新され、そしてキャンセル信号１３４は”０”、キャンセル信号１３５は”１”となる。
【００３０】
次に、図１４に示すプログラムを実行した場合のパイプラインの流れについて、図２を用いて説明する。
【００３１】
本プロセッサのパイプラインは、命令供給発行部１２０によって命令フェッチを行うステージ（ＩＦステージ）、命令解読部１３０によって命令フェッチした命令を解読するステージ（ＤＥＣステージ）、解読した命令を演算器を使って実行する実行ステージ（以下ＥＸステージ）、解読した命令がメモリアクセス命令であった場合にメモリアクセスを行うメモリステージ（ＭＥＭステージ）、演算やメモリアクセス結果をレジスタに反映させる書込ステージ（以下ＷＢステージ）の５段パイプラインとなっている。さらに、レジスタ間演算の様なＥＸステージで演算した実行結果を書き込んだレジスタの値は、ＷＢステージでレジスタで実際の書込を行わなくともＥＸステージ、或いはＭＥＭステージから後続する命令のＥＸステージへバイパスする事によって、直後に配置した命令でも参照可能である。
【００３２】
図１４では、（１０００００００）₁₆番地に、メモリから読み込んだ結果をｒ０レジスタに格納させる命令”ｍｏｖ（ｍｅｍ）、ｒ０”が、（１００００００４）₁₆番地にはｒ１レジスタの値を１つ増加させる命令”ａｄｄ＃１、ｒ１、ｒ１”が、以下、同様に（１０００００１Ｆ）₁₆番地まで命令が配置されている。
【００３３】
この場合、図２に示すように、タイミングｔ１で（１０００００００）₁₆番地の３２ビット長の２つの命令が命令フェッチされ、タイミングｔ２で２つの命令が同時にデコード、ｔ３で実行される。そして、タイミングｔ６ではＷＢステージを終え、レジスタｒ０の内容は使用できる状態になっている。
【００３４】
一方、タイミングｔ４で（１０００００１８）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”の命令フェッチが行われ、タイミングｔ６でＥＸステージに入る。このとき、レジスタｒ０は使用できる状態になっているため、レジスタ干渉によるパイプラインインタロックは生じない。結果として、すべての命令を実行するまでに８サイクル必要となる。
【００３５】
図１６に示すパイプラインの流れと図２に示すパイプラインの流れとを比較すると、（１０００００１８）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”がＥＸステージに入るのはタイミングｔ６で同一である。しかし、（１０００００００）₁₆番地の命令”ｍｏｖ（ｍｅｍ）、ｒ０”がＷＢステージを完了するのが、図１６ではタイミングｔ６であるのに対し、図２ではタイミングｔ５である点で異なる。これは、図１５では６４ビットの命令フェッチが２回行われ、１２８ビットの命令フェッチが完了した段階でデコード、実行されているのに対し、図２では６４ビットの命令フェッチが行われると次の６４ビットの命令フェッチを待たずにデコード、実行を行っているからである。このため、図１６ではすべての命令を実行するまでに９サイクル必要であるのに対し、図２では８サイクルで実行が完了している。
【００３６】
なお、本実施の形態では、命令の一語長が１２８ビットであるのに対して、データバスが６４ビットである場合を例としているがこれに限られるものではない。例えば、命令の一語長は６４ビットでも２５６ビットでも良く、データバスは３２ビット、１６ビット等２のべき乗であれば足りる。すなわち、命令の一語長よりもデータバスの幅が小さく、一回の命令フェッチで命令の一語長をフェッチできないケースであれば足りる。この場合、命令の一語長を何回の命令フェッチでフェッチできるかによって、位置情報１２４、キャンセル信号１３４、１３５の数が変わる。本実施の形態では、２回の命令フェッチによって命令の一語長をフェッチしているので、位置情報１２４は１ビット（１ビットで２つの情報を表すことができる）で、キャンセル信号は２種類設けている。また、４つの命令を同時に実行するＶＬＩＷを前提としているがこれに限られない。
【００３７】
また、本実施の形態では、メモリ１１０のみが接続されている場合について説明したが、さらに１２８ビットで命令フェッチされるメモリが接続されている場合であっても良い。例えば、内蔵メモリは速度重視で１２８ビットで命令フェッチされるものとし、外部メモリはコストの関係で６４ビットで命令フェッチされるものとし、データバス１１２を介して同列にメモリを接続しメモリ領域によっていずれのメモリを使用するかを切り換えてもよい。この場合、１２８ビットで命令フェッチされるメモリから読み出された場合はもちろんのこと、６４ビット単位で命令フェッチされるメモリから読み出された場合も性能の劣化をなるべく起こさないようにできる。
（２）プログラム生成装置
以上、第１の実施の形態のプロセッサについて述べたが、従来のＶＬＩＷプロセッサ用のプログラム生成装置を本第１の実施の形態のプロセッサに適応しようとすると、例えば、一語中に、命令”ａｄｄ＃１、ｒ０、ｒ０”が４つ連続した命令を実行する場合、命令供給が十分で一語中の命令を同時に実行した場合にはｒ０レジスタの値が”１”増加するのに対して、命令供給が不十分で一語中の命令を１単位命令毎に逐次実行した場合にはｒ０レジスタの値が”４”増加し、命令供給の状態によって実行結果が異なってしまうという問題点が発生する。
【００３８】
（第１のプログラム生成装置の構成）
図６は本発明の第１の実施の形態における第１のプログラム生成装置のブロック図である。
【００３９】
３００は命令列を格納しているメモリ、３２０は一語内の単位命令を同時実行した場合と一語内の単位命令を逐次実行した場合で実行結果が異なる命令列を抽出する回避対象コード検出手段、３３０は問題となる命令列を回避する命令列を生成する逐次実行保証コード生成手段、３４０は逐次実行保証コード生成手段が生成したプログラムを格納する命令列格納手段である。
【００４０】
以上の様に構成された本発明の第１の実施の形態の第１のプログラム生成装置について、以下、その動作を説明する。
【００４１】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された命令列を入力すると、その命令列中で、一語内の単位命令を同時実行した場合と、一語内の単位命令を逐次実行した場合で実行結果が異なる命令列を回避対象命令列として抽出する。実行結果が異なる命令列とは、具体的には、一語中の任意の単位命令が出力する結果を後続する単位命令が参照する場合の出力命令と参照命令の組み合わせであり、例えば、一語中に含まれる命令”ａｄｄｒ０、ｒ１、ｒ１”と後続する命令”ａｄｄｒ１、ｒ２、ｒ３”の組み合わせである。
【００４２】
図７は回避対象コード検出手段が回避対象命令列を生成するアルゴリズムを示したものである。
【００４３】
ステップ４０１はソースプログラムから１語を読み出すステップ、ステップ４０２は読み込んだ１語を先頭側から１命令単位ずつ読み出すステップ、ステップ４０３はステップ４０２で読み込んだ１命令単位中の出力レジスタ情報を登録するステップ、ステップ４０４は後続する命令単位を先頭側から１命令単位ずつ読み出すステップ、ステップ４０５はステップ４０４で読み込んだ１命令単位中の参照レジスタを登録するステップ、ステップ４０６はステップ４０２で登録した出力レジスタとステップ４０５で登録した参照レジスタが一致しているかどうかを判断するステップ、ステップ４０７はステップ４０５で一致していた場合に後続する命令単位を登録するステップ、ステップ４０８は後続する命令単位があるかを判断し存在する場合にはステップ４０４以降を実行する判断ステップ、ステップ４０９は登録された出力命令と参照命令の組み合わせが存在する場合には回避対象コードとして出力するステップ、ステップ４１０は後続する命令単位があるかを判断し存在する場合にはステップ４０２以降を実行する判断ステップ、ステップ４１１は後続する１語があるかを判断し存在する場合にはステップ４０１以降を実行する判断ステップである。
【００４４】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０の出力する回避対象命令列の情報を用いて、ソースコード格納手段３００に格納された命令列を、同時実行した場合と逐次実行した場合で動作が同一になる命令列への変換を行う。具体的には、命令列中で使用されていないレジスタを検索し、問題となる命令列中の問題となるレジスタを出力する命令の出力レジスタを使用されていないレジスタで置き換えると共に、後続する語で問題となるレジスタを参照する命令の参照レジスタを置き換えたレジスタに置き換える。例えば、一語中に命令”ａｄｄｒ０、ｒ１、ｒ１”と後続する命令”ａｄｄｒ１、ｒ２、ｒ３”が存在し、後続する語に命令”ａｄｄ＃１、ｒ１、ｒ１”が存在する場合（以降、”ａｄｄｒ０、ｒ１、ｒ１＆ａｄｄｒ１、ｒ２、ｒ３；ａｄｄ＃１、ｒ１、ｒ１”と記述する。ここで”＆”は同一語に含まれ、逐次実行の場合には左から右へ実行する事を、”；”は、後続する語との境界であることを示す）は、命令列中で使用していないレジスタをｒ４とすると、問題となる命令列中の問題となるレジスタｒ１を出力する命令”ａｄｄｒ０、ｒ１、ｒ１”の出力レジスタを使用されていないレジスタで置き換え”ａｄｄｒ０、ｒ１、ｒ４”にすると共に、後続する語で問題となるレジスタを参照する命令”ａｄｄ＃１、ｒ１、ｒ１”の参照レジスタを置き換えたレジスタに置き換え”ａｄｄ＃１、ｒ４、ｒ１”にする。変換された命令列は命令列格納手段３４０に出力される。
【００４５】
使用されていないレジスタの検索は、検索を全く行わずに問題となる命令語の前後にスタックへの退避復帰処理を装入することによってレジスタを確保することも可能であるし、最適化コンパイラのレジスタ割付けの要素技術を流用することによって基本ブロック内部や基本ブロックを越えた検索を行い、使用されていないレジスタが存在しない場合には問題となる命令語の前後にスタックへの退避復帰処理を装入することによってレジスタを確保するという方法も可能である。
【００４６】
（命令列生成装置の動作）
次に具体的な命令を解読実行した場合の本命令列生成装置の動作について説明する。
【００４７】
図８（ａ）は、ソースコード格納手段３００に格納された従来のＶＬＩＷプロセッサ用のプログラム生成装置が生成した命令列である。
【００４８】
まず、（１０００００００）₁₆番地から始まる一語の処理を行う。
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄ＃１、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と一語内の単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。この命令列中には問題となる命令列は存在しないので、回避対象コード検出手段３２０は問題となる命令列を出力しない。
【００４９】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０が回避対象命令列を出力しないので、ソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分をそのまま命令列格納手段３４０へ出力する。
【００５０】
次に、後続する（１０００００１０）₁₆番地から始まる一語の処理を行う。
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００１０）₁₆番地から始まる命令列一語分”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂｒ０、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と一語内の単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。この命令列中には、”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂｒ０、ｒ１、ｒ１”が該当する命令となる。
【００５１】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０の出力する回避対象命令列”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂｒ０、ｒ１、ｒ１”の情報を用いて、ソースコード格納手段３００に格納された命令列を、同時実行した場合と逐次実行した場合で動作が同一になる命令列への変換を行う。後続する命令列を参照し、使用していないレジスタとしてｒ４レジスタを使い、回避対象命令列中の命令”ａｄｄｒ０、ｒ１、ｒ０”を命令”ａｄｄｒ０、ｒ１、ｒ４”に変換すると共に、後続するｒ０を参照する命令を検索し、命令”ａｄｄ＃１、ｒ０、ｒ０を命令ａｄｄ＃１、ｒ４、ｒ０”に変換した後、命令列格納手段３４０に出力する。
【００５２】
同様にして、（１０００００２０）₁₆番地から始まる命令列一語を処理する事によって、”ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂｒ１、ｒ２、ｒ２；ａｄｄ＃１、ｒ１、ｒ１”を ”ａｄｄｒ１、ｒ２、ｒ５＆ｓｕｂｒ１、ｒ２、ｒ２；ａｄｄ＃１、ｒ５、ｒ１”に変換する。
【００５３】
また、（１０００００３０）₁₆番地から始まる命令列一語を処理し、回避対象コードが存在するが使用していないレジスタが存在しない場合には、たとえばｒ６レジスタをスタックへの退避命令”ｐｕｓｈｒ６”により確保し、スタックからの復帰命令”ｐｏｐｒ６”により復元する事により、”ａｄｄｒ２、ｒ３、ｒ２＆ｓｕｂｒ２、ｒ３、ｒ３”を”ｐｕｓｈｒ６；ａｄｄｒ２、ｒ３、ｒ６＆ｓｕｂｒ２、ｒ３、ｒ３；ｍｏｖｒ６、ｒ２＆ｐｏｐｒ６”に変換する。
【００５４】
以上の処理によって、回避対象コード検出手段３２０は、図８（ｂ）の様に、斜線部分の命令列を検出し、逐次実行保証コード生成手段３３０は、図８（ｃ）の様に、回避対象コード検出手段３２０の出力する斜線部分の命令列の出力レジスタを変更すると共に、後続する語に含まれる、濃い斜線部分の出力レジスタを参照する参照レジスタを変更した命令列や追加したスタックへのアクセス命令やＮＯＰ命令の命令列を命令列格納手段３４０へ出力する。
【００５５】
（第２のプログラム生成装置の構成）
図９は本発明の第１の実施の形態における第２のプログラム生成装置のブロック図である。
【００５６】
３００は命令列を格納しているメモリシステム、
３１０はプロセッサの命令フェッチ境界を検出する命令フェッチ境界検出手段、３２０は一語内の単位命令を同時実行した場合と一語内の単位命令を命令フェッチ境界を単位に逐次実行した場合で実行結果が異なる命令列を抽出する回避対象コード検出手段、３３０は問題となる命令列を回避する命令列を生成する逐次実行保証コード生成手段、３４０は逐次実行保証コード生成手段が生成したプログラムを格納する命令列格納手段である。
【００５７】
以上の様に構成された本発明の第１の実施の形態における第２のプログラム生成装置について、以下、その動作を説明する。
【００５８】
命令フェッチ境界検出手段３１０はソースコード格納手段３００に格納された命令列を入力すると、その命令列中で、プロセッサの命令フェッチの境界がどこに存在するかを検出する。本実施の形態ではプロセッサの命令フェッチ幅は６４ビットであるので、プロセッサの命令フェッチ境界は、（１０００００００）₁₆、（１００００００８）₁₆、（１０００００１０）₁₆番地という様なアドレスの下位が０または８の番地となる。
【００５９】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された命令列、および、命令フェッチ境界検出手段３１０から出力される命令フェッチ境界情報を入力すると、その命令列中で、一語内の単位命令を同時実行した場合と一語内の単位命令を命令フェッチ境界を単位に逐次実行した場合で実行結果が異なる命令列を抽出する。実行結果が異なる命令列とは、具体的には、一語中の任意の単位命令が出力する結果を後続する単位命令が参照する場合の出力命令と参照命令の組み合わせのうち、命令フェッチ境界を跨いでいるものであり、例えば、一語中に含まれる命令”ａｄｄｒ０、ｒ１、ｒ１”と後続する命令”ａｄｄｒ１、ｒ２、ｒ３”の組み合わせで、命令フェッチ境界を跨いでいるものである。
【００６０】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０の出力する回避対象命令列の情報を用いて、ソースコード格納手段３００に格納された命令列を、同時実行した場合と逐次実行した場合で動作が同一になる命令列への変換を行う。具体的には、命令列中で使用されていないレジスタを検索し、問題となる命令列中の問題となるレジスタを出力する命令の出力レジスタを使用されていないレジスタで置き換えると共に、後続する語で問題となるレジスタを参照する命令の参照レジスタを置き換えたレジスタに置き換える。例えば、一語中に命令”ａｄｄｒ０、ｒ１、ｒ１”と後続する命令”ａｄｄｒ１、ｒ２、ｒ３”が存在し、後続する語に命令”ａｄｄ＃１、ｒ１、ｒ１”が存在する場合（以降、”ａｄｄｒ０、ｒ１、ｒ１＆ａｄｄｒ１、ｒ２、ｒ３；ａｄｄ＃１、ｒ１、ｒ１”と記述する。ここで”＆”は同一語に含まれ、逐次実行の場合には左から右へ実行する事を、”；”は、次の語との境界であることを示す）は、命令列中で使用していないレジスタをｒ４とすると、問題となる命令列中の問題となるレジスタｒ１を出力する命令”ａｄｄｒ０、ｒ１、ｒ１”の出力レジスタを使用されていないレジスタで置き換え”ａｄｄｒ０、ｒ１、ｒ４”にすると共に、後続する語で問題となるレジスタを参照する命令”ａｄｄ＃１、ｒ１、ｒ１”の参照レジスタを置き換えたレジスタに置き換え”ａｄｄ＃１、ｒ４、ｒ１”にする。変換された命令列は命令列格納手段３４０に出力される。
【００６１】
（命令列生成装置の動作）
次に具体的な命令を解読実行した場合の本命令列生成装置の動作について説明する。
【００６２】
図１０（ａ）は、ソースコード格納手段３００に格納された従来のＶＬＩＷプロセッサ用のプログラム生成装置が生成した命令列である。
【００６３】
まず、（１０００００００）₁₆番地から始まる一語の処理を行う。
命令境界検出手段３１０はソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分中の命令境界である、（１００００００８）₁₆番地を検出する。
【００６４】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄ＃１、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と、一語内の命令境界検出手段３１０の出力する命令フェッチ境界を単位として単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。つまり、命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄ＃１、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を同時実行した場合と、”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄ＃１、ｒ１、ｒ１”の２つの単位命令と ”ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”の２つの単位命令を逐次実行した場合に実行結果が異なる事はないかを検査する。この命令列中には問題となる命令列は存在しないので、回避対象コード検出手段３２０は問題となる命令列を出力しない。
【００６５】
逐次実行保証コード生成手段は３３０は、回避対象コード検出手段３２０が回避対象命令列を出力しないので、ソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分をそのまま命令列格納手段３４０へ出力する。
【００６６】
次に、後続する（１０００００１０）₁₆番地から始まる一語の処理を行う。
命令境界検出手段３１０はソースコード格納手段３００に格納された（１０００００１０）₁₆番地から始まる命令列一語分中の命令境界である、（１０００００１８）₁₆番地を検出する。
【００６７】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００１０）₁₆番地から始まる命令列一語分”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂｒ０、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と、一語内の命令境界検出手段３１０の出力する命令フェッチ境界を単位として単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。つまり、命令列一語分”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂｒ０、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を同時実行した場合と、”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂｒ０、ｒ１、ｒ１”の２つの単位命令と”ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”の２つの単位命令を逐次実行した場合に実行結果が異なる事はないかを検査する。この命令列中にも問題となる命令列は存在しないので、回避対象コード検出手段３２０は問題となる命令列を出力しない。
【００６８】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０が回避対象命令列を出力しないので、ソースコード格納手段３００に格納された（１０００００１０）₁₆番地から始まる命令列一語分をそのまま命令列格納手段３４０へ出力する。
【００６９】
次に、後続する（１０００００２０）₁₆番地から始まる一語の処理を行う。
命令境界検出手段３１０はソースコード格納手段３００に格納された（１０００００２０）₁₆番地から始まる命令列一語分中の命令境界である、（１０００００２８）₁₆番地を検出する。
【００７０】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００２０）₁₆番地から始まる命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂｒ１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と、一語内の命令境界検出手段２１０の出力する命令フェッチ境界を単位として単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。つまり、命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂｒ１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を同時実行した場合と、”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄｒ１、ｒ２、ｒ１”の２つの単位命令と”ｓｕｂｒ１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”の２つの単位命令を逐次実行した場合に実行結果が異なる事はないかを検査する。この場合、”ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂｒ１、ｒ２、ｒ２”命令が該当する命令となる。
【００７１】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０の出力する回避対象命令列”ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂｒ１、ｒ２、ｒ２”の情報を用いて、ソースコード格納手段３００に格納された命令列を、同時実行した場合と逐次実行した場合で動作が同一になる命令列への変換を行う。後続する命令列を参照し、使用していないレジスタとしてｒ４レジスタを使い、回避対象命令列中の命令”ａｄｄｒ１、ｒ２、ｒ１”を命令”ａｄｄｒ１、ｒ２、ｒ５”に変換すると共に、後続するｒ１を参照する命令を検索し、命令”ａｄｄ＃１、ｒ１、ｒ１”を命令”ａｄｄ＃１、ｒ５、ｒ１”に変換した後、命令列格納手段３４０に出力する。
【００７２】
以降、（１０００００３０）₁₆番地から始まる命令列一語は問題が無いのでそのまま命令列格納手段３４０に出力する。
【００７３】
以上の処理によって、命令フェッチ境界検出手段３１０は図１０（ａ）の太線で示す命令フェッチ境界情報を出力し、回避対象コード検出手段３２０は、図１０（ａ）の様に、斜線部分の命令列を検出し、逐次実行保証コード生成手段３３０は、図１０（ｂ）の様に、回避対象コード検出手段３２０の出力する斜線部分の命令列の出力レジスタを変更すると共に、後続する語に含まれる、出力レジスタを参照する濃い斜線部分の命令列の参照レジスタを変更し、命令列を命令列格納手段３４０へ出力する。
【００７４】
なお、本実施の形態では、命令フェッチ幅６４ビット、１２８ビット固定長、最大同時実行４命令のＶＬＩＷプロセッサを想定しているが、これらの値は特に限定しない。例えば、命令の一語長は６４ビットでも２５６ビットでも良く、データバスの幅は１６ビットでも３２ビットでも良く、すなわち、命令の一語長よりもデータバスの幅が小さいケースが存在すれば足りる。
【００７５】
また、逐次実行保証コード生成手段は、命令列中で使用されていないレジスタを検索し、問題となる命令列中の問題となるレジスタを出力する命令の出力レジスタを使用されていないレジスタで置き換えると共に、後続する語で問題となるレジスタを参照する命令の参照レジスタを置き換えたレジスタに置き換えるアルゴリズムで説明を行ったが、あらかじめ問題となるレジスタを使用されていないレジスタに転送し、問題となるレジスタを参照する命令の参照レジスタを置き換えたレジスタに置き換えるアルゴリズムを行っても構わない。具体的には、実施例では、”ａｄｄｒ０，ｒ１，ｒ０＆ｓｕｂｒ０，ｒ１，ｒ１；ａｄｄ＃１，ｒ０，ｒ０”の命令列を”ｍｏｖｒ０、ｒ４；ａｄｄｒ０，ｒ１，ｒ０＆ａｄｄｒ４，ｒ１，ｒ１；ａｄｄ＃１，ｒ０，ｒ０”としてもよい。
【００７６】
また、回避対象コード検出手段が出力する命令列は、出力命令と参照命令の組み合わせであるので、２命令とは限らない。参照命令が複数ある場合には３命令以上の組み合わせになる場合も存在する。
【００７７】
また、命令列格納手段は、フロッピーディスクやテープやハードディスクやメモリなどの記録媒体でも構わないし、コンパイラやアセンブラオプティマイザ等の最適化プログラムへの入力ファイルであっても構わない。最適化プログラムで処理を繰り返すことにより出力ファイルの更なる最適化を図ることが可能となる。
【００７８】
また、命令フェッチ境界検出手段の認識する命令フェッチ幅は、固定である必要はなく、例えば、それぞれのメモリ領域毎に異なる値を設定しても構わない。その場合には、命令フェッチ境界検出手段は、アドレス情報で命令フェッチ幅を判断する。
【００７９】
また、命令フェッチ幅情報は、プログラム生成装置に組み込んでも構わないし、外部から情報を与えても構わない。具体的には、コンパイラやアセンブラやリンカに、定数として組み込んだ形で指定しても構わないし、引き数や環境ファイルの形で指定しても構わない。また、指定する命令フェッチ幅は一定でも構わないし、空間毎に個別に与えても構わない。
【００８０】
（第２の実施の形態）
本実施の形態は、可変長命令についても効率よく命令を実行できるプロセッサ等に関するものである。
（１）プロセッサ
図３は本発明第２の実施の形態におけるＶＬＩＷプロセッサのブロック図である。このプロセッサは、３２ビットと６４ビットの２通りの単位命令を持ち、最大４つの単位命令から構成される可変長の一語を同時に実行可能なＶＬＩＷプロセッサである。
【００８１】
基本的な構造は図１のＶＬＩＷプロセッサと同じであるが、可変長命令を扱うために、（ａ）命令供給発行部２２０において、メモリ１１０から１２８バイト単位で命令フェッチした命令を命令バッファ２２５を用いて命令バッファ中に３２ビットを１単位とし最大８個のレジスタに格納している点、（ｂ）３２ビット命令または６４ビット命令を切り換えるためにセレクタ２２９を有している点で異なる。
【００８２】
このＶＬＩＷプロセッサは同時に実行できる４つの命令が２回の命令フェッチによって初めて供給されるものであっても、４つの命令の命令フェッチを待たずにデコード、実行するものである。なお、同時に実行できる最大の命令数は４つであるが、命令中に埋め込まれた同時実行できる命令の境界情報により、４以下の同時実行できる命令の数を指定できるが、この機構については図面を省略している。
【００８３】
以上の様に構成された本発明の第２の実施の形態のプロセッサについて、以下、その動作を説明する。
（命令供給部２２０）
まず、命令供給発行部２２０内の命令フェッチ制御部２２１は、ＰＣ２０２、クロック２０１に基づいて実行する命令のアドレスをアドレスバス２１１からメモリ２１０に与える。これにより、メモリ２１０は命令を１２８ビットのデータバス２１２を介して、命令レジスタ２２２内の４つの命令レジスタに３２ビットづつ命令を供給する。命令レジスタ２２２は、クロック２０１に基づいてメモリ２１０から供給されたデータを格納する。これとともに、４つの命令レジスタに命令を格納したことを表すため、格納フラグ２２３を（００００１１１１）₂とする。なお、命令バッファ２２５は１２８バイトで命令フェッチされた命令を一旦格納しておくことにより、命令レジスタ２２２に最大２５６ビットの命令を格納するためのものである。
（命令解読部２３０）
次に、命令解読部２３０におけるデコーダ２３２のうち第１命令解読器は一番左端のセレクタ２２９の出力をデコードする。デコードの際には、命令が３２ビット命令である６４ビット命令かを認識し命令長情報２４１とデコード結果２４２とを出力する。具体的には、図４に示すように３２ビットを１単位する先頭に３２ビット命令か６４ビット命令かを示すフォーマット情報が割り当てられているので、この情報をそのまま命令長情報２４１として出力する。なお、セレクタ２２９はそれぞれ、命令が３２ビット命令であるか６４ビット命令であるかに関係なく常に６４ビットのデータを出力する。
【００８４】
デコーダ２３２のうち第１命令発行器は、格納フラグ２２３の値（００００１１１１）₂を用いて命令が供給されているか否かを判断する。具体的には、命令が３２ビット命令であった場合には、使用フラグ更新部２４０が（００００００００）₂を命令長情報２４１に基づいて左から”１”を入れつつ右に１ビットシフトし（１０００００００）₂を得る。そして、これと格納フラグ２２３の値（００００１１１１）₂とについてそれぞれのビット単位で論理積を演算し、（００００００００）₂となった場合（すべてのビットが”０”）には命令が供給されていると判断し”１”をキャンセル信号２３４として出力する。なお、６４ビット命令の場合、使用フラグ更新部２４０は左から”１”を入れつつ右に２ビットシフトし（（１１００００００）₂を得て、格納フラグ２２３の値（００００１１１１）₂ついてそれぞれのビットの論理積を演算し、（００００００００）₂を得て命令が供給されていると判断し”１”をキャンセル信号２３４として出力する。なお、使用フラグ更新部２４０は、キャンセル信号２３４が”０”すなわち命令供給不足であった場合、シフトはしない。
【００８５】
一番左端の格納フラグシフタ２３９は、命令長情報２４１に基づいて、右から”１”を入れつつ格納フラグ２２３を左シフトする。具体的には、第１命令解読部で３２ビット命令を解読した場合は格納フラグ２２３（００００１１１１）₂を１ビット左にシフトして（０００１１１１１）₂を得てこれを第２命令発行器に渡す。６４ビット命令であった場合は、２ビット左にシフトして（００１１１１１１）₂を得てこれを第２命令発行器に渡す。例えば、格納フラグ２２３が（００００１１１１）₂であるにも関わらず、第１、２命令解読部でそれぞれ６４ビット命令が解読された場合、第３命令発行器は格納フラグシフタ２３９から（１１１１１１１１）₂を受け取り、命令供給不足と判断する。これとともに、第２命令解読器に対応したセレクタ２３９で選択すべき命令レジスタ２２２を切り換える。なお、第１〜第４命令解読器で使用したビット数は使用フラグ更新部２４０で計算され、使用フラグ２２４として格納される。
【００８６】
そして、ＮＯＰ生成器２３７はデコード結果を出力する。ＮＯＰ生成器２３７は図１のＮＯＰ生成器１３７と同じで、解読器２３６の出力とキャンセル信号２３４との論理積を演算するＡＮＤ回路である。すなわち、キャンセル信号２３４が”０”となっているときは、解読器２３６の出力に関わらず、ＮＯＰを意味する”０”を出力する。
【００８７】
次に、図１６のプログラムを実行した場合のパイプラインの流れについて、図５を用いて説明する。
【００８８】
図１６では、（１０００００００）₁₆番地に、メモリから読み込んだ結果をｒ０レジスタに格納させる命令”ｍｏｖ（ｍｅｍ）、ｒ０”が、（１００００００４）₁₆番地にはレジスタｒ１の値を１つ増加させる命令”ａｄｄ＃１、ｒ１、ｒ１”が、以下、同様に（１０００００１Ｆ）₁₆番地まで命令が配置されている。なお、本命令中で、”ａｄｄ＃１２３４５６７８、ｒ３、ｒ３”命令は６４ビット単位命令であり、他は３２ビット単位命令である。
【００８９】
この場合、図５に示すように、（１０００００１０）₁₆番地の命令は６４ビット長の命令であるため、タイミングｔ１、ｔ２の２回の命令フェッチによって初めて４つの命令が揃うが、このプロセッサでは図５に示すように２回目の命令フェッチをまたずに（１０００００００）₁₆番地の命令”ｍｏｖ（ｍｅｍ）、ｒ０”を含む３つの命令をデコード、実行する。そして、タイミングｔ６でレジスタｒ０が使用できる状態になる。
【００９０】
一方、タイミングｔ３で（１０００００２９）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”の命令フェッチが行われ、タイミングｔ５でＥＸステージに入るが、レジスタｒ０が使用できる状態にまだなっていないためレジスタ干渉によるパイプラインインタロックが発生する。そして、タイミングｔ６でレジスタｒ０は使用できる状態になっているため、”ａｄｄ＃１、ｒ０、ｒ０”が実行される。結果として、すべての命令を実行するまでに８サイクル必要となる。
【００９１】
図１７に示すパイプラインの流れと図５に示すパイプラインの流れとを比較すると、（１０００００２０）₁₆番地の命令”ａｄｄ＃１、ｒ０、ｒ０”がＥＸステージに入るのはタイミングｔ５で同一である。しかし、（１０００００００）₁₆番地の命令”ｍｏｖ（ｍｅｍ）、ｒ０”がＷＢステージを完了するのが、図１７ではタイミングｔ７であるのに対し、図５ではタイミングｔ６である点で異なる。これは、図１７では並列実行する４つの命令全てがそろった段階でデコード、実行されているのに対し、図５では２回目の命令フェッチを待たず（４つ目の命令が命令フェッチされるのを待たずに）にデコード、実行を行っているからである。このため、図１７ではすべての命令を実行するまでに９サイクル必要（タイミングｔ５、ｔ６でパイプラインインタロックが発生）であるのに対し、図５では８サイクルで実行が完了（タイミングｔ５でのみパイプラインインタロックが発生）している。
【００９２】
なお、タイミングｔ２で、（１０００００１０）₁₆番地の命令”ａｄｄ＃１２３４５６７８、ｒ３、ｒ３”命令がフェッチされると同時に、（１０００００２０）₁₆番地までの命令もフェッチされるが、”ａｄｄ＃１２３４５６７８、ｒ３、ｒ３”命令が同時に実行できる命令の境界であるため、この命令のみをタイミングｔ３で実行する。
【００９３】
また、本実施の形態では、４つの命令を同時に実行できるハードウェアを持つＶＬＩＷプロセッサに対し、常に４つの命令を供給することを前提としているが、同じハードウェアに対して、同時実行できる命令の境界を示す技術を用いて４つ未満の命令を供給するものとしても良い。この場合であっても、同時実行できる命令の数に満たない場合であっても、１回の命令フェッチごとにデコード、実行を行う。
（プログラム生成装置）
（第１のプログラム生成装置の構成）
図６は本発明の第２の実施の形態における第１のプログラム生成装置のブロック図である。
【００９４】
基本的な構造は第１の実施の形態の第１のプログラム生成装置と同じであるが、単位命令や一語のビット幅が可変であることに起因して、回避対象コード検出手段３２０、および、逐次実行保証コード生成手段３３０が、単位命令中の並列実行境界情報３０１、および、フォーマット情報３０２を認識する点が異なる。
【００９５】
（命令列生成装置の動作）
以上の様に構成された本発明の第２の実施の形態の第１のプログラム生成装置について、以下、具体的な命令を解読実行した場合の動作を説明する。
【００９６】
図１１（ａ）は、ソースコード格納手段３００に格納された従来のＶＬＩＷプロセッサ用のプログラム生成装置が生成した命令列である。
【００９７】
まず、（１０００００００）₁₆番地から始まる一語の処理を行う。
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄ＃１、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１２３４５６７８、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と一語内の単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。この命令列中には問題となる命令列は存在しないので、回避対象コード検出手段３２０は問題となる命令列を出力しない。
【００９８】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０が回避対象命令列を出力しないので、ソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分をそのまま命令列格納手段３４０へ出力する。
【００９９】
次に、後続する（１０００００１４）₁₆番地から始まる一語の処理を行う。
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００１４）₁₆番地から始まる命令列一語分”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂ＃１２３４５６７８、ｒ０、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で一語を同時実行した場合と一語内の単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。この命令列中には、”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂ＃１２３４５６７８、ｒ０、ｒ１”が該当する命令となる。
【０１００】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０の出力する回避対象命令列”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂ＃１２３４５６７８、ｒ０、ｒ１”の情報を用いて、ソースコード格納手段３００に格納された命令列を、同時実行した場合と逐次実行した場合で動作が同一になる命令列への変換を行う。後続する命令列を参照し、使用していないレジスタとしてｒ４レジスタを使い、回避対象命令列中の命令”ａｄｄｒ０、ｒ１、ｒ０”を命令”ａｄｄｒ０、ｒ１、ｒ４”に変換すると共に、後続するｒ０を参照する命令を検索し、命令”ａｄｄ＃１、ｒ０、ｒ０”を命令”ａｄｄ＃１、ｒ４、ｒ０”に変換した後、命令列格納手段３４０に出力する。
【０１０１】
以降、（１０００００２８）₁₆番地から始まる命令列一語を処理する事によって、”ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂ＃１２３４５６７８、ｒ１、ｒ２；ａｄｄ＃１、ｒ１、ｒ１をａｄｄｒ１、ｒ２、ｒ５＆ｓｕｂ＃１２３４５６７８、ｒ１、ｒ２；ａｄｄ＃１、ｒ５、ｒ１”に、（１０００００３ｃ）₁₆番地から始まる命令列一語を処理することによって、”ａｄｄｒ２、ｒ３、ｒ２＆ｓｕｂ＃１２３４５６７８、ｒ２、ｒ３”を”ａｄｄｒ２、ｒ３、ｒ６＆ｓｕｂ＃１２３４５６７８、ｒ２、ｒ３”に変換する。
【０１０２】
以上の処理によって、回避対象コード検出手段３２０は、図１１（ｂ）の様に、網かけ部分の命令列を検出し、逐次実行保証コード生成手段３３０は、図１１（ｃ）の様に、回避対象コード検出手段３２０の出力する網かけ部分の命令列の出力レジスタを変更すると共に、後続する語に含まれる、出力レジスタを参照する濃い網かけ部分の命令列の参照レジスタを変更し、命令列を命令列格納手段３４０へ出力する。
【０１０３】
（第２のプログラム生成装置の構成）
図９は本発明の第２の実施の形態における第２のプログラム生成装置のブロック図である。
【０１０４】
基本的な構造は第１の実施の形態の第２のプログラム生成装置と同じであるが、単位命令や一語のビット幅が可変であることに起因して、回避対象コード検出手段３２０、および、逐次実行保証コード生成手段３３０が、フォーマット情報３０２を認識する点、及び、回避対象コード検出手段３２０において、命令フェッチ境界が単位命令中にあった場合には、命令フェッチ境界が該当する単位命令の先頭に存在すると見なして評価する点、及び、命令フェッチ境界検出手段の検出する命令フェッチ幅が目的とするプロセッサの命令フェッチ幅である１２８ビットとなっている点が異なる。
【０１０５】
（命令列生成装置の動作）
次に具体的な命令を解読実行した場合の本命令列生成装置の動作について説明する。
【０１０６】
図１２（ａ）は、ソースコード格納手段３００に格納された従来のＶＬＩＷプロセッサ用のプログラム生成装置が生成した命令列である。
【０１０７】
まず、（１０００００００）₁₆番地から始まる一語の処理を行う。
命令境界検出手段３１０はソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分中の命令境界である、（１０００００１０）₁₆番地を検出する。
【０１０８】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄ＃１、ｒ１、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１２３４５６７８、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と一語内の命令境界検出手段３１０の出力する命令フェッチ境界を単位として単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。この命令列中には問題となる命令列は存在しないので、回避対象コード検出手段３２０は問題となる命令列を出力しない。
【０１０９】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０が回避対象命令列を出力しないので、ソースコード格納手段３００に格納された（１０００００００）₁₆番地から始まる命令列一語分をそのまま命令列格納手段３４０へ出力する。
【０１１０】
次に、後続する（１０００００１４）₁₆番地から始まる一語の処理を行う。
命令境界検出手段３１０はソースコード格納手段３００に格納された（１０００００１４）₁₆番地から始まる命令列一語分中の命令境界である、（１０００００２０）₁₆番地を検出する。
【０１１１】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００１４）₁₆番地から始まる命令列一語分”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂ＃１２３４５６７８、ｒ０、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と、一語内の命令境界検出手段３１０の出力する命令フェッチ境界を単位として単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。つまり、命令列一語分”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂ＃１２３４５６７８、ｒ０、ｒ１＆ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を同時実行した場合と、”ａｄｄｒ０、ｒ１、ｒ０＆ｓｕｂ＃１２３４５６７８、ｒ０、ｒ１”の２つの単位命令と”ａｄｄ＃１、ｒ２、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”の２つの単位命令を逐次実行した場合に実行結果が異なる事はないかを検査する。この命令列中にも問題となる命令列は存在しないので、回避対象コード検出手段３２０は問題となる命令列を出力しない。
【０１１２】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０が回避対象命令列を出力しないので、ソースコード格納手段３００に格納された（１０００００１４）₁₆番地から始まる命令列一語分をそのまま命令列格納手段３４０へ出力する。
【０１１３】
次に、後続する（１０００００２８）₁₆番地から始まる一語の処理を行う。
命令境界検出手段３１０はソースコード格納手段３００に格納された（１０００００２８）₁₆番地から始まる命令列一語分中の命令境界である、（１０００００３０）₁₆番地を検出する。
【０１１４】
回避対象コード検出手段３２０はソースコード格納手段３００に格納された（１０００００２８）₁₆番地から始まる命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂ＃１２３４５６７８、ｒ１、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を入力し、その命令列中で、一語を同時実行した場合と、一語内の命令境界検出手段３１０の出力する命令フェッチ境界を単位として単位命令を逐次実行した場合で実行結果が異なる命令列がないかを検査する。つまり、命令列一語分”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂ＃１２３４５６７８、ｒ１、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”を同時実行した場合と、”ａｄｄ＃１、ｒ０、ｒ０＆ａｄｄｒ１、ｒ２、ｒ１”の２つの単位命令と”ｓｕｂ＃１２３４５６７８、ｒ１、ｒ２＆ａｄｄ＃１、ｒ３、ｒ３”の２つの単位命令を逐次実行した場合に実行結果が異なる事はないかを検査する。この場合、”ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂ＃１２３４５６７８、ｒ１、ｒ２”命令が該当する命令となる。
【０１１５】
逐次実行保証コード生成手段３３０は、回避対象コード検出手段３２０の出力する回避対象命令列”ａｄｄｒ１、ｒ２、ｒ１＆ｓｕｂ＃１２３４５６７８、ｒ１、ｒ２”の情報を用いて、ソースコード格納手段３００に格納された命令列を、同時実行した場合と逐次実行した場合で動作が同一になる命令列への変換を行う。後続する命令列を参照し、使用していないレジスタとしてｒ４レジスタを使い、回避対象命令列中の命令”ａｄｄｒ１、ｒ２、ｒ１”を命令”ａｄｄｒ１、ｒ２、ｒ５”に変換すると共に、後続するｒ１を参照する命令を検索し、命令”ａｄｄ＃１、ｒ１、ｒ１”を命令”ａｄｄ＃１、ｒ５、ｒ１”に変換した後、命令列格納手段３４０に出力する。
【０１１６】
以降、（１０００００３０）₁₆番地から始まる命令列一語は問題が無いのでそのまま命令列格納手段３４０に出力する。
【０１１７】
以上の処理によって、命令フェッチ境界検出手段３１０は図１２（ａ）の太線で示す命令フェッチ境界情報を出力し、回避対象コード検出手段３２０は、図１２（ａ）の様に、網かけ部分の命令列を検出し、逐次実行保証コード生成手段３３０は、図１２（ｂ）の様に、回避対象コード検出手段３２０の出力する網かけ部分の命令列の出力レジスタを変更すると共に、後続する語に含まれる、出力レジスタを参照する濃い網かけ部分の命令列の参照レジスタを変更し、命令列を命令列格納手段３４０へ出力する。
【０１１８】
なお、本実施の形態では、命令フェッチ幅１２８ビット、３２ビットと６４ビットの可変長、最大同時実行４命令のＶＬＩＷプロセッサを想定しているが、これらの値は特に限定しない。
【０１１９】
また、逐次実行保証コード生成手段は、命令列中で使用されていないレジスタを検索し、問題となる命令列中の問題となるレジスタを出力する命令の出力レジスタを使用されていないレジスタで置き換えると共に、後続する語で問題となるレジスタを参照する命令の参照レジスタを置き換えたレジスタに置き換えるアルゴリズムで説明を行ったが、第１の実施例における第２のプログラム生成装置と同じく、あらかじめ問題となるレジスタを使用されていないレジスタに転送し、問題となるレジスタを参照する命令の参照レジスタを置き換えたレジスタに置き換えるアルゴリズムを行っても構わない。
【０１２０】
また、回避対象コード検出手段が出力する命令列は、出力命令と参照命令の組み合わせであるので、２命令とは限らない。参照命令が複数ある場合には３命令以上の組み合わせになる場合も存在する。
【０１２１】
また、命令列格納手段は、フロッピーディスクやテープやハードディスクやメモリなどの記録媒体でも構わないし、コンパイラやアセンブラオプティマイザ等の最適化プログラムへの入力ファイルであっても構わない。最適化プログラムで処理を繰り返すことにより出力ファイルの更なる最適化を図ることが可能となる。
【０１２２】
また、命令フェッチ境界検出手段の認識する命令フェッチ幅は、固定である必要はなく、例えば、それぞれのメモリ領域毎に異なる値を設定しても構わない。その場合には、命令フェッチ境界検出手段は、アドレス情報で命令フェッチ幅を判断する。
【０１２３】
また、命令フェッチ幅情報は、プログラム生成装置に組み込んでも構わないし、外部から情報を与えても構わない。具体的には、コンパイラやアセンブラやリンカに、定数として組み込んだ形で指定しても構わないし、引き数や環境ファイルの形で指定しても構わない。また、指定する命令フェッチ幅は一定でも構わないし、空間毎に個別に与えても構わない。
【０１２４】
【発明の効果】
以上のように、本願発明によれば、命令供給が十分に行えない環境で使用されても供給されたものから実行する事により、性能劣化を抑制することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態におけるプロセッサのブロック構成図
【図２】本発明の第１の実施の形態における第１のプログラム例及びパイプライン図
【図３】本発明の第１、第２の実施の形態における第２のプログラム例及びパイプライン図
【図４】本発明の第１、第２の実施の形態における第１のプログラム生成装置のブロック図
【図５】本発明の第１、第２の実施の形態における第１のプログラム生成装置におけるプログラム図
【図６】本発明の第１、第２の実施の形態における第１のプログラム生成装置のブロック図
【図７】本発明の第１の実施の形態における第１のプログラム生成装置における回避対象コード検出手段の検出アルゴリズムを示す図
【図８】本発明の第１の実施の形態における第１のプログラム生成装置のプログラム図
【図９】本発明の第１、第２の実施の形態における第２のプログラム生成装置のブロック図
【図１０】本発明の第１の実施の形態における第２のプログラム生成装置のプログラム図
【図１１】本発明の第２の実施の形態における第１のプログラム生成装置のプログラム図
【図１２】本発明の第２の実施の形態における第２のプログラム生成装置のプログラム図
【図１３】第１の従来例におけるプロセッサのブロック構成図
【図１４】第１のプログラム例を示す図
【図１５】従来例における第１のプログラム例のパイプライン図
【図１６】第２のプログラム例を示す図
【図１７】従来例における第２のプログラム例のパイプライン図
【符号の説明】
１０１、２０１クロック
１０２、２０２ＰＣ
１１０、２１０メモリ
１１１、２１１アドレスバス
１１２、２１２データバス
１２０、２２０命令供給発行部
１２１、２２１命令フェッチ制御部
１２２、２２２命令レジスタ
１２３命令フェッチフラグ
１２４位置情報
１３０、２３０命令解読部
１３１キャンセル信号生成部
１３２、２３２デコーダ
１３３、２３３レジスタ
１３４、１３５、２３４キャンセル信号
１３６、２３６解読器
１３７、２３７ＮＯＰ信号生成器
２２３格納フラグ
２２４使用フラグ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a VLIW processor, a program generation device, and a recording medium that suppress performance deterioration by making matters from what is supplied even when used in an environment where instruction supply cannot be sufficiently performed.
[0002]
[Prior art]
2. Description of the Related Art Along with high functionality and high speed of recent microprocessor application products, a microprocessor with high processing capability (hereinafter simply referred to as “processor”) is desired. For this reason, recently, a plurality of instructions are simultaneously executed in one cycle.
[0003]
There are two methods for realizing instruction level parallel processing: dynamic scheduling and static scheduling.
[0004]
A superscalar method is a typical example of the dynamic scheduling. In this method, after decoding an instruction code at the time of execution, the dependency relationship between instructions is dynamically analyzed by hardware to determine whether or not parallel execution is possible, and an appropriate combination of instructions is executed in parallel. A typical example of the static scheduling is a VLIW (Very Long Instruction Word) system. In this method, a dependency relationship between instructions is statically analyzed by a compiler or the like when an execution code is generated, and an instruction stream with high execution efficiency is generated by moving the instruction code. In the general VLIW system, a plurality of instructions (here called “unit instructions”) that can be executed simultaneously are described in one fixed-length instruction supply unit (here called “one word”). Adopting this method has the advantage that the hardware can be simplified because it is not necessary to perform dependency analysis between instructions in hardware.
[0005]
Hereinafter, the operation of the VLIW processor in the prior art will be described with reference to FIG.
[0006]
FIG. 13 is a configuration diagram of a VLIW processor according to the prior art. 10 is a memory storing data, instructions, etc., 20 is an instruction supply issuing unit for retrieving instructions from the memory 10, and 30 is an instruction supply issuing unit 20. This is an instruction decoding unit that decodes the fetched instruction and gives the decoding result to the instruction execution unit. The instruction supply / issuance unit 20 includes an instruction fetch control unit 21 that controls fetching of instructions and the like from the memory 10 and an instruction register 22 that stores instructions and the like fetched from the memory 10. The instruction decoding unit 30 includes an instruction issuance control unit 31 that controls the issuance of instructions, a decoder 32, and a register 33 that stores a decoding result. This processor is a VLIW processor capable of simultaneously executing one word composed of four 32-bit unit instructions, and fetches instructions in units of 128 bits.
[0007]
First, the instruction fetch control unit 21 in the instruction supply / issuance unit 20 gives the address of an instruction to be executed based on the PC 2 (program counter) and clock 1 from the address bus 11 to the memory 10. As a result, the memory 10 supplies an instruction corresponding to the designated address to the four instruction registers in the instruction register 22 by 32 bits via the 128-bit data bus. The instruction register 22 stores data supplied from the memory 10 based on the clock 1. At the same time, an instruction fetch flag 23 which means that instruction fetch is completed is set to “1”. At this time, instructions are always stored in the four instruction registers 22. When instruction fetch is started (when a jump instruction or an interrupt occurs), the instruction fetch flag 23 is set to “0” in order to prevent an erroneous instruction from being decoded, and a NOP (No Operation) is issued from the decoder by a cancel signal 34. ) Is output.
[0008]
Next, the decoder 32 in the instruction decoding unit 30 obtains information that the instruction is stored in the instruction register 22 by the instruction fetch flag 23 and outputs the result of decoding the instruction. The register 33 stores the result decoded by the clock 1.
[0009]
Finally, the decoding result stored in the register 33 is supplied to the instruction execution unit (not shown), and the instruction is executed.
[0010]
[Problems to be solved by the invention]
However, in the above-described conventional VLIW processor, when instruction fetch is performed in units smaller than one word length or when the instruction is variable length, the timing at which the instruction is supplied to the instruction register is different, so the performance deteriorates. There was a case.
[0011]
In other words, the conventional VLIW processor has the same word length and instruction fetch unit, but if the VLIW is applied to an embedded microcomputer, the instruction fetch width must be smaller than the width of one word for cost reasons. There is.
[0012]
Even if the maximum word length matches the instruction fetch unit, in the case of a variable-length instruction, it may be possible to fetch one instruction for the first time by two instruction fetches.
[0013]
Hereinafter, it demonstrates concretely using drawing.
(1) When instruction fetch is performed in units smaller than one word length
FIG. 14 shows an example of a program, and FIG. 15 explains the flow of the pipeline when the program is executed.
[0014]
In FIG. 14, (10000000) ₁₆ At the address, the instruction “mov (mem), r0” for storing the result read from the memory in the r0 register is (10000004). ₁₆ At the address, an instruction “add # 1, r1, r1” for incrementing the value of the r1 register by one is similarly applied (1000001F). ₁₆ Instructions are arranged up to the address.
[0015]
In this case, as shown in FIG. 15, at timing t1 (10000000) ₁₆ Two instructions with a 32-bit length at the address are at timing t2 (10000008) ₁₆ Two instructions having a 32-bit length at the address are fetched. At timing t3, four instructions are simultaneously decoded and executed at t4. But (10000000) ₁₆ The address instruction “mov (mem), r0” is a subsequent instruction (1000000C) while the result of reading the memory in the MEM stage is written to the register r0. ₁₆ Since the address instruction “add # 1, r0, r0” uses the contents of the register r0, the contents cannot be referred to until the register is written in the WB stage. For this reason, register interference occurs (1000000C) ₁₆ The address instruction “add # 1, r0, r0” cannot be executed at timing t6, but is executed at timing t7.
[0016]
As a result, nine cycles are required to execute all instructions due to insufficient supply of instructions and register interference.
(2) When the instruction is variable length
FIG. 16 shows an example of a program, and FIG. 17 explains the flow of a pipeline when the program is executed.
[0017]
In FIG. 16, (10000000) ₁₆ At the address, the instruction “mov (mem), r0” for storing the result read from the memory in the r0 register is (10000004). ₁₆ At the address, an instruction “add # 1, r1, r1” for incrementing the value of the register r1 by 1 is similarly given below (1000001F). ₁₆ Instructions are arranged up to the address. In this instruction, the “add # 12345678, r3, r3” instruction is a 64-bit unit instruction, and the others are 32-bit unit instructions.
[0018]
In this case, as shown in FIG. ₁₆ Since the address instruction is a 64-bit instruction, four instructions are prepared for the first time by two instruction fetches at timings t1 and t2, and the four instructions are simultaneously decoded at timing t3 and executed at t4. But (10000000) ₁₆ The address instruction “mov (mem), r0” is the result of reading the memory in the MEM stage and writing the result (10000010). ₁₆ Since the address instruction “add # 1, r0, r0” uses the contents of the register r0, the contents cannot be referred to until the register is written in the WB stage. This causes register interference and (10000010) ₁₆ The address instruction “add # 1, r0, r0” cannot be executed at timing t6, but is executed at timing t7.
[0019]
As a result, nine cycles are required to execute all instructions due to insufficient supply of instructions and register interference.
[0020]
As described above, the conventional VLIW processor is intended to increase the speed by simplifying the hardware as much as possible. Therefore, when all the instructions that can be processed in parallel are prepared, these instructions are executed simultaneously. When this assumption is not satisfied, there is a problem that sufficient performance cannot be exhibited.
[0021]
The present invention solves the above-described conventional problems, and a processor capable of exhibiting sufficient performance even when instruction fetch is performed in units smaller than one word length or even when the instruction is variable length. Is to provide.
[0022]
[Means for Solving the Problems]
The present invention is a VLIW processor that executes an instruction fetched instruction first, even if not all instructions that can be executed in parallel are fetched.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the drawings.
[0024]
(First embodiment)
The present embodiment relates to a processor or the like that can efficiently execute an instruction even when an instruction is fetched in a unit smaller than one word length. That is, even in a VLIW processor that can execute four instructions at the same time, the pipeline interlock due to register interference is reduced as much as possible by starting decoding and execution when the two instructions are ready. In addition, when the previously executed instruction is an instruction related to I / O, data can be obtained earlier.
(1) Processor
FIG. 1 is a block diagram of a processor according to the first embodiment of the present invention. Compared with the conventional VLIW processor shown in FIG. 13, (a) the data bus 112 is 64 bits smaller than one word length, and (b) instructions are stored in the left two instruction registers among the four instruction registers. It differs in that it has position information 124 indicating whether it has been stored or the instruction has been stored in the two instruction registers on the right side, and (c) there are cancel signals 134 and 135 for outputting NOP.
[0025]
The processor recognizes where the instruction is stored in the instruction register 122 based on the position information 124, generates a cancel signal 134, 135 based on this information, and outputs a NOP, whereby the instruction is stored in the instruction register 122. It is possible to decode and execute in order from the first.
[0026]
First, the instruction fetch control unit 121 in the instruction supply issue unit 120 gives the address of an instruction to be executed based on the PC 102 and the clock 101 from the address bus 111 to the memory 110. As a result, the memory 110 supplies a 32-bit instruction to the two left instruction registers in the instruction register 122 via the 64-bit data bus 112. The instruction register 122 stores data supplied from the memory 110 based on the clock 101. At the same time, the instruction fetch flag 123 is set to “1” to indicate that the instruction fetch is completed, and the position information 124 is set to “0” to indicate that the instruction is stored in the left two in the instruction register 122. . At this time, among the four instruction registers 122, the instruction is stored in the second instruction register on the left side, but no instruction is stored in the two instruction registers on the right side. When instruction fetch is not completed as in the conventional case, the instruction fetch flag 123 is “0”, so that the cancel signals 134 and 135 are “0”, and the NOP signal generator 137 outputs NOP.
[0027]
Next, the decoder 132 in the instruction decoding unit 130 obtains information that the instruction is stored in the instruction register 122 by the instruction fetch flag 123 and outputs the result of decoding the instruction. At this time, since the position information 124 is “0”, indicating that the instruction is stored only in the left two instruction registers of the instruction register 122, the cancel signal generator 131 sets the cancel signal 134 to “1”. The cancel signal 135 is set to “0”. As a result, the decoding result of the instruction stored in the instruction register 122 is output from the left two of the NOP generators 137 in the decoder 132, and the NOP is output from the right two. The register 133 stores the result decoded by the clock 101. The NOP generator 137 is an AND circuit that calculates the logical product of the output of the instruction decoder 136 and the cancel signal. That is, when the cancel signals 134 and 135 are “0”, “0” indicating NOP is output regardless of the output of the decoder 136.
[0028]
Finally, the decoding result stored in the register 133 is supplied to the instruction execution unit (not shown), and the instruction is executed.
[0029]
At the time of the next instruction fetch, the fetched instruction and the like are stored in the two on the right side of the instruction register 122, the position information 124 is also updated accordingly, and the cancel signal 134 is “0”. The cancel signal 135 is “1”.
[0030]
Next, the flow of the pipeline when the program shown in FIG. 14 is executed will be described with reference to FIG.
[0031]
The pipeline of this processor includes a stage for fetching an instruction by the instruction supply issuing unit 120 (IF stage), a stage for decoding the instruction fetched by the instruction decoding unit 130 (DEC stage), and using the arithmetic unit for the decoded instruction. An execution stage (hereinafter referred to as EX stage) to be executed, a memory stage (MEM stage) that performs memory access when the decoded instruction is a memory access instruction, and a write stage (hereinafter referred to as WB stage) that reflects an operation or memory access result in a register ) 5-stage pipeline. Furthermore, the value of the register in which the execution result calculated in the EX stage such as the operation between registers is written can be transferred to the EX stage or the EX stage of the subsequent instruction from the MEM stage without actually writing the register in the WB stage. By bypassing, it is possible to refer to the instruction placed immediately after.
[0032]
In FIG. 14, (10000000) ₁₆ At the address, the instruction “mov (mem), r0” for storing the result read from the memory in the r0 register is (10000004). ₁₆ At the address, an instruction “add # 1, r1, r1” for incrementing the value of the r1 register by 1 is similarly given below (1000001F). ₁₆ Instructions are arranged up to the address.
[0033]
In this case, as shown in FIG. 2, at timing t1 (10000000) ₁₆ Two instructions having a 32-bit length at the address are fetched. At timing t2, the two instructions are simultaneously decoded and executed at t3. At the timing t6, the WB stage is finished and the contents of the register r0 are ready for use.
[0034]
On the other hand, at timing t4 (10000018) ₁₆ The instruction fetch of the address instruction “add # 1, r0, r0” is performed, and the EX stage is entered at timing t6. At this time, since the register r0 is in a usable state, pipeline interlock due to register interference does not occur. As a result, 8 cycles are required to execute all instructions.
[0035]
When the pipeline flow shown in FIG. 16 and the pipeline flow shown in FIG. 2 are compared, (10000018) ₁₆ The address instruction “add # 1, r0, r0” enters the EX stage at the same time t6. But (10000000) ₁₆ The address instruction “mov (mem), r0” completes the WB stage at timing t6 in FIG. 16, but differs at timing t5 in FIG. In FIG. 15, the 64-bit instruction fetch is performed twice and is decoded and executed when the 128-bit instruction fetch is completed, whereas in FIG. 2, when the 64-bit instruction fetch is performed, This is because the decoding and execution are performed without waiting for the 64-bit instruction fetch. For this reason, in FIG. 16, nine cycles are required to execute all the instructions, whereas in FIG. 2, the execution is completed in eight cycles.
[0036]
In the present embodiment, the word length of the instruction is 128 bits while the data bus is 64 bits. However, the present invention is not limited to this. For example, the instruction word length may be 64 bits or 256 bits, and the data bus may be a power of 2, such as 32 bits or 16 bits. That is, it is sufficient if the width of the data bus is smaller than the word length of the instruction and the word length of the instruction cannot be fetched by one instruction fetch. In this case, the number of the position information 124 and the cancel signals 134 and 135 varies depending on how many instruction fetches the word length of the instruction can be fetched. In this embodiment, since the word length of an instruction is fetched by two instruction fetches, the position information 124 is 1 bit (two bits can represent two pieces of information), and there are two types of cancel signals. Provided. Further, although it is assumed that VLIW executes four instructions simultaneously, the present invention is not limited to this.
[0037]
In the present embodiment, the case where only the memory 110 is connected has been described. However, a case where a memory for fetching instructions with 128 bits may be connected. For example, the built-in memory is assumed to be fetched with 128 bits for speed, and the external memory is assumed to be fetched with 64 bits because of cost. Which memory is used may be switched. In this case, it is possible to prevent performance degradation as much as possible not only when reading from a memory fetched by 128 bits but also when reading from a memory fetched by 64 bits.
(2) Program generation device
The processor according to the first embodiment has been described above. When the conventional program generation apparatus for a VLIW processor is applied to the processor according to the first embodiment, for example, the instruction “add” is included in one word. When four consecutive instructions # 1, r0, and r0 ”are executed, the instruction supply is sufficient and if the instructions in one word are executed simultaneously, the value of the r0 register increases by“ 1 ”. When the instruction supply is insufficient and instructions in a word are executed sequentially for each unit instruction, the value of the r0 register increases by “4”, and the execution result varies depending on the instruction supply state. To do.
[0038]
(Configuration of first program generation device)
FIG. 6 is a block diagram of the first program generation device according to the first embodiment of the present invention.
[0039]
300 is a memory storing an instruction sequence, 320 is an avoidance target code detection that extracts an instruction sequence having different execution results when a unit instruction within one word is executed simultaneously and when a unit instruction within one word is executed sequentially Means 330 is a sequential execution guarantee code generating means for generating an instruction string that avoids a problematic instruction string, and 340 is an instruction string storage means for storing a program generated by the sequential execution guarantee code generating means.
[0040]
The operation of the first program generating apparatus according to the first embodiment of the present invention configured as described above will be described below.
[0041]
When the instruction sequence stored in the source code storage unit 300 is input to the avoidance target code detection unit 320, a unit instruction in one word is simultaneously executed in the instruction sequence, and a unit instruction in one word is sequentially executed. In this case, instruction sequences having different execution results are extracted as avoidance target instruction sequences. An instruction sequence having different execution results is specifically a combination of an output instruction and a reference instruction when a subsequent unit instruction refers to a result output by an arbitrary unit instruction in one word. This is a combination of the instruction “add r0, r1, r1” and the subsequent instruction “add r1, r2, r3” included therein.
[0042]
FIG. 7 shows an algorithm in which the avoidance target code detection means generates an avoidance target instruction sequence.
[0043]
Step 401 is a step of reading one word from the source program, Step 402 is a step of reading one word read from the head one by one instruction unit, and Step 403 is a step of registering output register information in one instruction unit read in Step 402 Step 404 is a step of reading subsequent instruction units one by one from the head side, Step 405 is a step of registering a reference register in one instruction unit read in Step 404, and Step 406 is an output register registered in Step 402. A step for determining whether or not the reference registers registered in step 405 match, a step 407 for registering a subsequent instruction unit if they match in step 405, and a step 408 for determining whether there is a subsequent instruction unit Judgment and presence Is a determination step for executing step 404 and subsequent steps, step 409 is a step of outputting as a code to be avoided when a combination of a registered output instruction and a reference instruction exists, and step 410 is a determination of whether there is a subsequent instruction unit. If it exists, a determination step for executing step 402 and subsequent steps. Step 411 is a determination step for determining whether or not there is a subsequent word and executing step 401 and subsequent steps.
[0044]
The sequential execution guarantee code generating means 330 uses the information of the avoidance target instruction sequence output from the avoidance target code detection means 320, and the instruction sequence stored in the source code storage means 300 is executed simultaneously with the case where it is executed simultaneously. To convert the instruction sequence to the same operation. Specifically, it searches for unused registers in the instruction sequence, replaces the output register of the instruction that outputs the problematic register in the problematic instruction sequence with an unused register, and uses the following word Replace the reference register of the instruction that refers to the register in question with the replaced register. For example, the instruction “add r0, r1, r1” and the following instruction “add r1, r2, r3” exist in one word, and the instruction “add # 1, r1, r1” exists in the following word ( Hereafter, “add r0, r1, r1 & add r1, r2, r3; add # 1, r1, r1” is described, where “&” is included in the same word, and in the case of sequential execution, from left to right ";" Indicates that this is a boundary with the following word). If the register not used in the instruction sequence is r4, the problematic register in the problematic instruction sequence Replace the output register of the instruction “add r0, r1, r1” that outputs r1 with an unused register “add r0, r1, r4”, and also refer to the register in question in the subsequent word “add” # 1, r1, r1 " Replaced by replacing the reference registers Register "add # 1, r4, r1" to. The converted instruction sequence is output to the instruction sequence storage means 340.
[0045]
It is possible to search for unused registers by allocating a save / restore process to the stack before and after the instruction word in question without performing any search. By diverting the element technology of register allocation, a search inside the basic block or beyond the basic block is performed, and if there is no unused register, the save / restore processing to the stack is implemented before and after the problematic instruction word. It is also possible to secure a register by inserting the register.
[0046]
(Operation of instruction sequence generator)
Next, the operation of this instruction sequence generation apparatus when a specific instruction is decoded and executed will be described.
[0047]
FIG. 8A shows an instruction sequence generated by a conventional program generation apparatus for a VLIW processor stored in the source code storage unit 300.
[0048]
First, (10000000) ₁₆ Process one word starting from the address.
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000000) ₁₆ Input "add # 1, r0, r0 & add # 1, r1, r1 & add # 1, r2, r2 & add # 1, r3, r3" for one instruction string starting from the address. It is checked whether there is an instruction sequence having different execution results when one word is executed simultaneously and when unit instructions within one word are executed sequentially. Since there is no problematic instruction sequence in this instruction sequence, the avoidance target code detecting unit 320 does not output the problematic instruction sequence.
[0049]
The sequential execution guarantee code generation means 330 is stored in the source code storage means 300 because the avoidance target code detection means 320 does not output the avoidance target instruction sequence (10000000). ₁₆ The instruction sequence starting from the address is output to the instruction sequence storage means 340 as it is.
[0050]
Next follows (10000010) ₁₆ Process one word starting from the address.
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000010). ₁₆ Input "add r0, r1, r0 & sub r0, r1, r1 & add # 1, r2, r2 & add # 1, r3, r3" for one instruction string starting from the address. It is checked whether there is an instruction sequence with different execution results between the case where the instruction is executed simultaneously and the case where the unit instructions within one word are executed sequentially. In this instruction sequence, “add r0, r1, r0 & sub r0, r1, r1” is a corresponding instruction.
[0051]
The sequential execution guarantee code generation means 330 stores the avoidance target instruction sequence “add r0, r1, r0 & sub r0, r1, r1” output from the avoidance target code detection means 320 in the source code storage means 300. The executed instruction sequence is converted into an instruction sequence whose operation is the same between the simultaneous execution and the sequential execution. Refer to the subsequent instruction sequence, use the r4 register as an unused register, convert the instruction “add r0, r1, r0” in the avoidance target instruction sequence to the instruction “add r0, r1, r4” and follow The instruction referring to r0 is retrieved, and the instruction “add # 1, r0, r0” is converted into the instruction add # 1, r4, r0, and then output to the instruction string storage means 340.
[0052]
Similarly, (10000020) ₁₆ By processing one word of the instruction sequence starting from the address, “add r1, r2, r1 & sub r1, r2, r2; add # 1, r1, r1” is added to “add r1, r2, r5 & sub r1, r2, r2; converted into add # 1, r5, r1 ″.
[0053]
Also, (10000030) ₁₆ If a single instruction sequence starting from the address is processed and there is a register that is not used but there is an avoidance target code, for example, the r6 register is secured by a save instruction “push r6” to the stack, By restoring by the return instruction “pop r6”, “add r2, r3, r2 & sub r2, r3, r3” is changed to “push r6; add r2, r3, r6 & sub r2, r3, r3; mov r6, r2 & Pop r6 ".
[0054]
Through the above processing, the avoidance target code detection unit 320 detects the instruction sequence in the shaded portion as shown in FIG. 8B, and the sequential execution guarantee code generation unit 330 avoids the operation as shown in FIG. 8C. The output register of the instruction sequence in the hatched portion output from the target code detection unit 320 is changed, and the instruction register in which the reference register that refers to the output register in the dark hatched portion included in the subsequent word is changed or added to the stack An instruction sequence of an access instruction or NOP instruction is output to the instruction sequence storage means 340.
[0055]
(Configuration of second program generation device)
FIG. 9 is a block diagram of the second program generation device according to the first embodiment of the present invention.
[0056]
300 is a memory system storing an instruction sequence;
310 is an instruction fetch boundary detecting means for detecting an instruction fetch boundary of the processor, 320 is an execution result when unit instructions in one word are simultaneously executed and when unit instructions in one word are sequentially executed in units of instruction fetch boundaries The avoidance target code detecting means for extracting instruction sequences with different numbers, 330 is a sequential execution guarantee code generating means for generating an instruction sequence for avoiding a problematic instruction sequence, and 340 stores a program generated by the sequential execution guarantee code generating means. Instruction sequence storage means.
[0057]
The operation of the second program generation device according to the first embodiment of the present invention configured as described above will be described below.
[0058]
When the instruction fetch boundary detection unit 310 inputs the instruction sequence stored in the source code storage unit 300, the instruction fetch boundary detection unit 310 detects where the instruction fetch boundary of the processor exists in the instruction sequence. In this embodiment, since the instruction fetch width of the processor is 64 bits, the instruction fetch boundary of the processor is (10000000) ₁₆ , (10000008) ₁₆ , (10000010) ₁₆ The lower address of an address such as an address is an address of 0 or 8.
[0059]
When the avoidance target code detection unit 320 inputs the instruction sequence stored in the source code storage unit 300 and the instruction fetch boundary information output from the instruction fetch boundary detection unit 310, the unit within one word in the instruction sequence Instruction sequences having different execution results are extracted when the instructions are executed simultaneously and when the unit instructions in one word are sequentially executed in units of instruction fetch boundaries. An instruction sequence with different execution results specifically refers to an instruction fetch boundary in the combination of an output instruction and a reference instruction when a subsequent unit instruction refers to a result output by an arbitrary unit instruction in one word. For example, a combination of an instruction “add r0, r1, r1” and a subsequent instruction “addr1, r2, r3” included in one word straddles an instruction fetch boundary.
[0060]
The sequential execution guarantee code generating means 330 uses the information of the avoidance target instruction sequence output from the avoidance target code detection means 320, and the instruction sequence stored in the source code storage means 300 is executed simultaneously with the case where it is executed simultaneously. To convert the instruction sequence to the same operation. Specifically, it searches for unused registers in the instruction sequence, replaces the output register of the instruction that outputs the problematic register in the problematic instruction sequence with an unused register, and uses the following word Replace the reference register of the instruction that refers to the register in question with the replaced register. For example, the instruction “add r0, r1, r1” and the following instruction “add r1, r2, r3” exist in one word, and the instruction “add # 1, r1, r1” exists in the following word ( Hereafter, “add r0, r1, r1 & add r1, r2, r3; add # 1, r1, r1” is described, where “&” is included in the same word, and in the case of sequential execution, from left to right ";" Indicates that this is a boundary with the next word). If a register that is not used in the instruction sequence is r4, the problematic register in the problematic instruction sequence Replace the output register of the instruction “add r0, r1, r1” that outputs r1 with an unused register “add r0, r1, r4”, and also refer to the register in question in the subsequent word “add” # 1, r1, r1 " The reference register is replaced with the replaced register to “add # 1, r4, r1”. The converted instruction sequence is output to the instruction sequence storage means 340.
[0061]
(Operation of instruction sequence generator)
Next, the operation of this instruction sequence generation apparatus when a specific instruction is decoded and executed will be described.
[0062]
FIG. 10A shows an instruction sequence generated by a conventional program generation apparatus for a VLIW processor stored in the source code storage unit 300.
[0063]
First, (10000000) ₁₆ Process one word starting from the address.
Instruction boundary detection means 310 is stored in source code storage means 300 (10000000) ₁₆ It is an instruction boundary in one word of an instruction sequence starting from an address (10000008) ₁₆ The address is detected.
[0064]
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000000) ₁₆ Input "add # 1, r0, r0 & add # 1, r1, r1 & add # 1, r2, r2 & add # 1, r3, r3" for one instruction string starting from the address. It is checked whether there is an instruction sequence having different execution results when one word is executed simultaneously and when a unit instruction is executed sequentially with the instruction fetch boundary output from the instruction boundary detection means 310 in one word as a unit. That is, when “add # 1, r0, r0 & add # 1, r1, r1 & add # 1, r2, r2 & add # 1, r3, r3” is executed simultaneously for one instruction string, “add # 1, Execution result when two unit instructions "1, r0, r0 & add # 1, r1, r1" and two unit instructions "add # 1, r2, r2 & add # 1, r3, r3" are executed sequentially Inspect for any differences. Since there is no problematic instruction sequence in this instruction sequence, the avoidance target code detecting unit 320 does not output the problematic instruction sequence.
[0065]
The sequential execution guarantee code generation means 330 is stored in the source code storage means 300 because the avoidance target code detection means 320 does not output the avoidance target instruction sequence (10000000). ₁₆ The instruction sequence starting from the address is output to the instruction sequence storage means 340 as it is.
[0066]
Next follows (10000010) ₁₆ Process one word starting from the address.
The instruction boundary detection unit 310 is stored in the source code storage unit 300 (10000010). ₁₆ It is an instruction boundary in one word of an instruction sequence starting from an address (10000018) ₁₆ The address is detected.
[0067]
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000010). ₁₆ Input "add r0, r1, r0 & sub r0, r1, r1 & add # 1, r2, r2 & add # 1, r3, r3" for one instruction string starting from the address. Are simultaneously executed, and when a unit instruction is sequentially executed in units of instruction fetch boundaries output from the instruction boundary detection unit 310 in one word, it is checked whether there is an instruction sequence having different execution results. That is, "add r0, r1, r0 & sub r0, r1, r1 & add # 1, r2, r2 & add # 1, r3, r3" for one instruction string is executed simultaneously with "add r0, r1". , R0 & sub r0, r1, r1 ”and two unit instructions“ add # 1, r2, r2 & add # 1, r3, r3 ”are executed differently. Inspect for any. Since there is no problematic instruction sequence in this instruction sequence, the avoidance target code detecting unit 320 does not output the problematic instruction sequence.
[0068]
The sequential execution guarantee code generation means 330 is stored in the source code storage means 300 because the avoidance target code detection means 320 does not output the avoidance target instruction sequence (10000010). ₁₆ The instruction sequence starting from the address is output to the instruction sequence storage means 340 as it is.
[0069]
Then follow (10000020) ₁₆ Process one word starting from the address.
The instruction boundary detection unit 310 is stored in the source code storage unit 300 (10000020). ₁₆ It is an instruction boundary in one word of an instruction sequence starting from an address (10000028) ₁₆ The address is detected.
[0070]
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000020). ₁₆ Input "add # 1, r0, r0 & add r1, r2, r1 & sub r1, r2, r2 & add # 1, r3, r3" for one instruction string starting from the address. Are simultaneously executed and whether or not unit instructions are sequentially executed in units of instruction fetch boundaries output from the instruction boundary detection unit 210 in one word, it is checked whether there is an instruction sequence having different execution results. That is, when “add # 1, r0, r0 & add r1, r2, r1 & sub r1, r2, r2 & add # 1, r3, r3” are executed simultaneously for one instruction string, “add # 1, When two unit instructions “r0, r0 & add r1, r2, r1” and two unit instructions “sub r1, r2, r2 & add # 1, r3, r3” are sequentially executed, the execution results are different. Inspect for any. In this case, the instruction “add r1, r2, r1 & sub r1, r2, r2” is a corresponding instruction.
[0071]
The sequential execution guarantee code generation means 330 stores the avoidance target instruction sequence “add r1, r2, r1 & sub r1, r2, r2” output from the avoidance target code detection means 320 in the source code storage means 300. The executed instruction sequence is converted into an instruction sequence whose operation is the same between the simultaneous execution and the sequential execution. The subsequent instruction sequence is referenced, the r4 register is used as an unused register, the instruction “add r1, r2, r1” in the avoidance target instruction sequence is converted to the instruction “add r1, r2, r5”, and the subsequent The instruction referring to r1 is retrieved, and the instruction “add # 1, r1, r1” is converted into the instruction “add # 1, r5, r1”, and then output to the instruction string storage means 340.
[0072]
Since then, (10000030) ₁₆ Since there is no problem with one instruction sequence word starting from the address, it is output to the instruction sequence storage means 340 as it is.
[0073]
Through the above processing, the instruction fetch boundary detection unit 310 outputs the instruction fetch boundary information indicated by the bold line in FIG. 10A, and the avoidance target code detection unit 320 outputs the instruction in the shaded portion as shown in FIG. As shown in FIG. 10B, the sequential execution guarantee code generation unit 330 changes the output register of the instruction sequence in the shaded portion output from the avoidance target code detection unit 320 and includes it in the subsequent word. The reference register of the instruction sequence in the dark shaded portion that refers to the output register is changed, and the instruction sequence is output to the instruction sequence storage means 340.
[0074]
In this embodiment, a VLIW processor with an instruction fetch width of 64 bits, a fixed length of 128 bits, and a maximum of four simultaneous execution instructions is assumed, but these values are not particularly limited. For example, the word length of an instruction may be 64 bits or 256 bits, and the width of the data bus may be 16 bits or 32 bits. That is, it is sufficient if there is a case where the width of the data bus is smaller than the word length of the instruction. .
[0075]
Further, the sequential execution guarantee code generation means searches for an unused register in the instruction sequence, and replaces the output register of the instruction that outputs the problematic register in the problematic instruction sequence with an unused register. In the following word, the algorithm that replaces the reference register of the instruction that refers to the register in question with the replaced register was explained. However, the register in question is transferred to an unused register in advance, and the register in question is An algorithm for replacing the reference register of the instruction to be referenced with a replaced register may be used. Specifically, in the embodiment, the instruction sequence of “add r0, r1, r0 & sub r0, r1, r1; add # 1, r0, r0” is changed to “mov r0, r4; add r0, r1, r0 & add”. r4, r1, r1; add # 1, r0, r0 ".
[0076]
The instruction sequence output by the avoidance target code detection means is a combination of an output instruction and a reference instruction, and thus is not limited to two instructions. When there are a plurality of reference instructions, there may be a combination of three or more instructions.
[0077]
The instruction sequence storage means may be a recording medium such as a floppy disk, a tape, a hard disk, or a memory, or may be an input file to an optimization program such as a compiler or an assembler optimizer. It is possible to further optimize the output file by repeating the process using the optimization program.
[0078]
Further, the instruction fetch width recognized by the instruction fetch boundary detection means does not need to be fixed, and for example, a different value may be set for each memory area. In that case, the instruction fetch boundary detecting means determines the instruction fetch width from the address information.
[0079]
Further, the instruction fetch width information may be incorporated in the program generation apparatus or information may be given from the outside. Specifically, it may be specified as a constant incorporated in a compiler, assembler or linker, or may be specified in the form of an argument or an environment file. Also, the instruction fetch width to be specified may be constant or may be given individually for each space.
[0080]
(Second Embodiment)
The present embodiment relates to a processor or the like that can execute instructions efficiently even for variable-length instructions.
(1) Processor
FIG. 3 is a block diagram of the VLIW processor according to the second embodiment of the present invention. This processor is a VLIW processor that has two unit instructions of 32 bits and 64 bits and can simultaneously execute a variable-length word composed of a maximum of four unit instructions.
[0081]
The basic structure is the same as that of the VLIW processor of FIG. 1, but in order to handle variable-length instructions, (a) the instruction supply / issuance unit 220 uses the instruction buffer 225 to fetch instructions fetched from the memory 110 in units of 128 bytes. The difference is that the instruction buffer stores 32 bits as a unit in a maximum of 8 registers, and (b) a selector 229 is provided for switching between a 32-bit instruction and a 64-bit instruction.
[0082]
This VLIW processor decodes and executes without waiting for the instruction fetch of four instructions even if four instructions that can be executed simultaneously are supplied for the first time by two instruction fetches. The maximum number of instructions that can be executed simultaneously is four, but the boundary information of instructions that can be executed simultaneously embedded in the instruction can specify the number of instructions that can be executed simultaneously of 4 or less. Is omitted.
[0083]
The operation of the processor according to the second embodiment of the present invention configured as described above will be described below.
(Instruction supply unit 220)
First, the instruction fetch control unit 221 in the instruction supply issue unit 220 gives the address of an instruction to be executed based on the PC 202 and the clock 201 from the address bus 211 to the memory 210. As a result, the memory 210 supplies instructions to the four instruction registers in the instruction register 222 via the 128-bit data bus 212 in 32-bit units. The instruction register 222 stores data supplied from the memory 210 based on the clock 201. At the same time, the storage flag 223 is set to (000011111) to indicate that the instruction is stored in the four instruction registers. ₂ And The instruction buffer 225 is used to store an instruction having a maximum of 256 bits in the instruction register 222 by temporarily storing an instruction fetched with 128 bytes.
(Instruction decoding unit 230)
Next, the first instruction decoder of the decoder 232 in the instruction decoding unit 230 decodes the output of the leftmost selector 229. At the time of decoding, it recognizes whether the instruction is a 64-bit instruction which is a 32-bit instruction, and outputs instruction length information 241 and a decoding result 242. Specifically, as shown in FIG. 4, format information indicating whether the instruction is a 32-bit instruction or a 64-bit instruction is assigned to the head of 32 bits as one unit, and this information is output as instruction length information 241 as it is. Each selector 229 always outputs 64-bit data regardless of whether the instruction is a 32-bit instruction or a 64-bit instruction.
[0084]
The first instruction issuer of the decoder 232 has the value of the storage flag 223 (00001111). ₂ Is used to determine whether an instruction is supplied. Specifically, when the instruction is a 32-bit instruction, the use flag update unit 240 (00000000) ₂ Is shifted by 1 bit to the right with “1” from the left based on the instruction length information 241 (10000000) ₂ Get. And this and the value of the storage flag 223 (00001111) ₂ AND for each bit unit and (00000000) ₂ If it becomes (all bits are “0”), it is determined that an instruction is supplied, and “1” is output as the cancel signal 234. In the case of a 64-bit instruction, the use flag update unit 240 shifts 2 bits to the right while inserting “1” from the left ((11000000) ₂ And the value of the storage flag 223 (00001111) ₂ About the logical product of each bit, (00000000) ₂ It is determined that the command is supplied, and “1” is output as the cancel signal 234. Note that the use flag updating unit 240 does not shift when the cancel signal 234 is “0”, that is, the instruction supply is insufficient.
[0085]
The leftmost storage flag shifter 239 shifts the storage flag 223 to the left while putting “1” from the right based on the instruction length information 241. Specifically, when a 32-bit instruction is decoded by the first instruction decoding unit, the storage flag 223 (00001111) ₂ Is shifted 1 bit to the left (00011111) ₂ And pass this to the second instruction issuer. If it is a 64-bit instruction, shift it 2 bits to the left (00111111) ₂ And pass this to the second instruction issuer. For example, the storage flag 223 is (00001111) ₂ However, when the 64-bit instruction is decoded by the first and second instruction decoding units, the third instruction issuer starts from the storage flag shifter 239 (11111111). ₂ It is determined that the supply of instructions is insufficient. At the same time, the instruction register 222 to be selected is switched by the selector 239 corresponding to the second instruction decoder. The number of bits used by the first to fourth instruction decoders is calculated by the use flag update unit 240 and stored as the use flag 224.
[0086]
Then, the NOP generator 237 outputs the decoding result. The NOP generator 237 is the same as the NOP generator 137 of FIG. 1 and is an AND circuit that calculates the logical product of the output of the decoder 236 and the cancel signal 234. That is, when the cancel signal 234 is “0”, “0” meaning NOP is output regardless of the output of the decoder 236.
[0087]
Next, the flow of the pipeline when the program of FIG. 16 is executed will be described with reference to FIG.
[0088]
In FIG. 16, (10000000) ₁₆ At the address, the instruction “mov (mem), r0” for storing the result read from the memory in the r0 register is (10000004). ₁₆ At the address, an instruction “add # 1, r1, r1” for incrementing the value of the register r1 by 1 is similarly given below (1000001F). ₁₆ Instructions are arranged up to the address. In this instruction, the “add # 12345678, r3, r3” instruction is a 64-bit unit instruction, and the others are 32-bit unit instructions.
[0089]
In this case, as shown in FIG. ₁₆ Since the instruction at the address is a 64-bit length instruction, four instructions are prepared for the first time by two instruction fetches at timings t1 and t2. However, in this processor, the second instruction fetch is not performed as shown in FIG. (10000000) ₁₆ Decode and execute the three instructions including the address instruction “mov (mem), r0”. At a timing t6, the register r0 can be used.
[0090]
On the other hand, at timing t3 (10000029) ₁₆ The instruction fetch of the address instruction “add # 1, r0, r0” is performed, and the EX stage is entered at timing t5. However, since the register r0 is not yet ready for use, a pipeline interlock occurs due to register interference. . Since the register r0 is ready for use at timing t6, “add # 1, r0, r0” is executed. As a result, 8 cycles are required to execute all instructions.
[0091]
When the pipeline flow shown in FIG. 17 is compared with the pipeline flow shown in FIG. 5, (10000020) ₁₆ The address instruction “add # 1, r0, r0” enters the EX stage at the same time t5. But (10000000) ₁₆ The instruction at the address “mov (mem), r0” completes the WB stage at timing t7 in FIG. 17, but differs at timing t6 in FIG. In FIG. 17, this is decoded and executed when all four instructions to be executed in parallel are prepared, whereas in FIG. 5, the second instruction fetch is not waited (the fourth instruction is fetched). This is because the data is decoded and executed (without waiting for this). For this reason, in FIG. 17, 9 cycles are required to execute all instructions (pipeline interlock occurs at timings t5 and t6), whereas in FIG. 5, execution is completed in 8 cycles (only at timing t5). A pipeline interlock has occurred.
[0092]
At time t2, (10000010) ₁₆ At the same time that the instruction “add # 12345678, r3, r3” is fetched at the address (10000020) ₁₆ The instruction up to the address is also fetched, but since the “add # 12345678, r3, r3” instruction is an instruction boundary that can be executed simultaneously, only this instruction is executed at timing t3.
[0093]
In this embodiment, it is assumed that four instructions are always supplied to a VLIW processor having hardware capable of executing four instructions simultaneously. However, instructions that can be executed simultaneously on the same hardware are assumed. It is also possible to supply less than four instructions using a technique that indicates boundaries. Even in this case, even if the number of instructions that can be executed simultaneously is not reached, decoding and execution are performed for each instruction fetch.
(Program generation device)
(Configuration of first program generation device)
FIG. 6 is a block diagram of the first program generation device according to the second embodiment of the present invention.
[0094]
Although the basic structure is the same as that of the first program generation device of the first embodiment, the avoidance target code detection means 320, and the unit instruction and the bit width of one word are variable, and The sequential execution guarantee code generating means 330 is different in that it recognizes the parallel execution boundary information 301 and the format information 302 in the unit instruction.
[0095]
(Operation of instruction sequence generator)
With respect to the first program generation apparatus according to the second embodiment of the present invention configured as described above, the operation when a specific instruction is decoded and executed will be described below.
[0096]
FIG. 11A shows an instruction sequence generated by a conventional program generation apparatus for a VLIW processor stored in the source code storage unit 300.
[0097]
First, (10000000) ₁₆ Process one word starting from the address.
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000000) ₁₆ Input "add # 1, r0, r0 & add # 1, r1, r1 & add # 1, r2, r2 & add # 12345678, r3, r3" for one word of the instruction string starting from the address. It is checked whether there is an instruction sequence having different execution results when one word is executed simultaneously and when unit instructions within one word are executed sequentially. Since there is no problematic instruction sequence in this instruction sequence, the avoidance target code detecting unit 320 does not output the problematic instruction sequence.
[0098]
The sequential execution guarantee code generation means 330 is stored in the source code storage means 300 because the avoidance target code detection means 320 does not output the avoidance target instruction sequence (10000000). ₁₆ The instruction sequence starting from the address is output to the instruction sequence storage means 340 as it is.
[0099]
Then follow (10000014) ₁₆ Process one word starting from the address.
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000014). ₁₆ Input "add r0, r1, r0 & sub # 12345678, r0, r1 & add # 1, r2, r2 & add # 1, r3, r3" for one word in the instruction string starting from the address. It is checked whether there is an instruction sequence with different execution results between the case where the instruction is executed simultaneously and the case where the unit instructions within one word are executed sequentially. In this instruction sequence, “add r0, r1, r0 & sub # 12345678, r0, r1” is a corresponding instruction.
[0100]
The sequential execution guarantee code generation means 330 uses the information of the avoidance target instruction sequence “add r0, r1, r0 & sub # 12345678, r0, r1” output from the avoidance target code detection means 320 to store in the source code storage means 300. The stored instruction sequence is converted into an instruction sequence whose operation is the same between the simultaneous execution and the sequential execution. Refer to the subsequent instruction sequence, use the r4 register as an unused register, convert the instruction “add r0, r1, r0” in the avoidance target instruction sequence to the instruction “add r0, r1, r4” and follow The instruction referring to r0 is retrieved, and the instruction “add # 1, r0, r0” is converted into the instruction “add # 1, r4, r0”, and then output to the instruction string storage means 340.
[0101]
Since then, (10000028) ₁₆ By processing one word of the instruction sequence starting from the address, “add r1, r2, r1 & sub # 12345678, r1, r2; add # 1, r1, r1 to add r1, r2, r5 & sub # 12345678, r1, r2; add # 1, r5, r1 ″ to (1000003c) ₁₆ By processing one word of the instruction sequence starting from the address, “add r2, r3, r2 & sub # 12345678, r2, r3” is converted to “add r2, r3, r6 & sub # 12345678, r2, r3”.
[0102]
By the above processing, the avoidance target code detection unit 320 detects the instruction sequence in the shaded portion as shown in FIG. 11B, and the sequential execution guarantee code generation unit 330 performs the operation as shown in FIG. The output register of the shaded portion instruction sequence output from the avoidance target code detection means 320 is changed, and the reference register of the dark shaded portion instruction sequence included in the subsequent word and referring to the output register is changed. The sequence is output to the instruction sequence storage means 340.
[0103]
(Configuration of second program generation device)
FIG. 9 is a block diagram of a second program generation device according to the second embodiment of the present invention.
[0104]
Although the basic structure is the same as that of the second program generation device of the first embodiment, the avoidance target code detection means 320 and the unit instruction and the bit width of one word are variable, and When the sequential execution guarantee code generating means 330 recognizes the format information 302 and the avoidance target code detecting means 320 has an instruction fetch boundary in a unit instruction, the unit instruction to which the instruction fetch boundary corresponds The difference is that the instruction fetch width detected by the instruction fetch boundary detection means is 128 bits, which is the instruction fetch width of the target processor.
[0105]
(Operation of instruction sequence generator)
Next, the operation of this instruction sequence generation apparatus when a specific instruction is decoded and executed will be described.
[0106]
FIG. 12A shows an instruction sequence generated by a conventional program generation apparatus for a VLIW processor stored in the source code storage unit 300.
[0107]
First, (10000000) ₁₆ Process one word starting from the address.
Instruction boundary detection means 310 is stored in source code storage means 300 (10000000) ₁₆ This is an instruction boundary in one word of the instruction sequence starting from the address. (10000010) ₁₆ The address is detected.
[0108]
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000000) ₁₆ Input "add # 1, r0, r0 & add # 1, r1, r1 & add # 1, r2, r2 & add # 12345678, r3, r3" for one word of the instruction string starting from the address. It is checked whether there is an instruction sequence with different execution results when one word is executed simultaneously and when a unit instruction is executed sequentially with the instruction fetch boundary output from the instruction boundary detection unit 310 within one word as a unit. Since there is no problematic instruction sequence in this instruction sequence, the avoidance target code detecting unit 320 does not output the problematic instruction sequence.
[0109]
The sequential execution guarantee code generation means 330 is stored in the source code storage means 300 because the avoidance target code detection means 320 does not output the avoidance target instruction sequence (10000000). ₁₆ The instruction sequence starting from the address is output to the instruction sequence storage means 340 as it is.
[0110]
Then follow (10000014) ₁₆ Process one word starting from the address.
The instruction boundary detection unit 310 is stored in the source code storage unit 300 (10000014). ₁₆ It is an instruction boundary in one word of an instruction sequence starting from an address (10000020) ₁₆ The address is detected.
[0111]
The avoidance target code detection means 320 is stored in the source code storage means 300 (10000014). ₁₆ "Add r0, r1, r0 & sub # 12345678, r0, r1 & add # 1, r2, r2 & add # 1, r3, r3" is input for one word of the instruction string starting from the address. It is checked whether there is an instruction sequence with different execution results when the words are executed simultaneously and when the unit instructions are executed sequentially with the instruction fetch boundary output from the instruction boundary detection means 310 in one word as a unit. That is, when “add r0, r1, r0 & sub # 12345678, r0, r1 & add # 1, r2, r2 & add # 1, r3, r3” are executed simultaneously for one instruction string, “add r0, Execution results differ when two unit instructions “r1, r0 & sub # 12345678, r0, r1” and two unit instructions “add # 1, r2, r2 & add # 1, r3, r3” are executed sequentially. Inspect for anything. Since there is no problematic instruction sequence in this instruction sequence, the avoidance target code detecting unit 320 does not output the problematic instruction sequence.
[0112]
The sequential execution guarantee code generation means 330 is stored in the source code storage means 300 because the avoidance target code detection means 320 does not output the avoidance target instruction sequence (10000014). ₁₆ The instruction sequence starting from the address is output to the instruction sequence storage means 340 as it is.
[0113]
Then follow (10000028) ₁₆ Process one word starting from the address.
The instruction boundary detection means 310 is stored in the source code storage means 300 (10000028) ₁₆ This is an instruction boundary in one word of the instruction sequence starting from the address, (10000030) ₁₆ The address is detected.
[0114]
The avoidance target code detection unit 320 is stored in the source code storage unit 300 (10000028). ₁₆ “Add # 1, r0, r0 & add r1, r2, r1 & sub # 12345678, r1, r2 & add # 1, r3, r3” are input for one word in the instruction string starting from the address. It is checked whether there is an instruction sequence with different execution results when the words are executed simultaneously and when the unit instructions are executed sequentially with the instruction fetch boundary output from the instruction boundary detection means 310 in one word as a unit. That is, when “add # 1, r0, r0 & add r1, r2, r1 & sub # 12345678, r1, r2 & add # 1, r3, r3” are executed simultaneously for one instruction string, “add # 1 , R0, r0 & add r1, r2, r1 ”and two unit instructions“ sub # 12345678, r1, r2 & add # 1, r3, r3 ”are executed differently Inspect for anything. In this case, the instruction “add r1, r2, r1 & sub # 12345678, r1, r2” is a corresponding instruction.
[0115]
The sequential execution guarantee code generation means 330 uses the information of the avoidance target instruction sequence “add r1, r2, r1 & sub # 12345678, r1, r2” output from the avoidance target code detection means 320 to the source code storage means 300. The stored instruction sequence is converted into an instruction sequence whose operation is the same between the simultaneous execution and the sequential execution. The subsequent instruction sequence is referenced, the r4 register is used as an unused register, the instruction “add r1, r2, r1” in the avoidance target instruction sequence is converted to the instruction “add r1, r2, r5”, and the subsequent The instruction referring to r1 is retrieved, and the instruction “add # 1, r1, r1” is converted into the instruction “add # 1, r5, r1”, and then output to the instruction string storage means 340.
[0116]
Since then, (10000030) ₁₆ Since there is no problem with one instruction sequence word starting from the address, it is output to the instruction sequence storage means 340 as it is.
[0117]
Through the above processing, the instruction fetch boundary detection unit 310 outputs the instruction fetch boundary information indicated by the bold line in FIG. 12A, and the avoidance target code detection unit 320 detects the shaded portion as shown in FIG. As shown in FIG. 12B, the sequential execution guarantee code generating means 330 detects the instruction sequence and changes the output register of the shaded instruction sequence output from the avoidance target code detecting means 320 as shown in FIG. , The reference register of the instruction sequence in the dark shaded portion that refers to the output register is changed, and the instruction sequence is output to the instruction sequence storage means 340.
[0118]
In this embodiment, an instruction fetch width of 128 bits, a variable length of 32 bits and 64 bits, and a VLIW processor with a maximum of four simultaneous execution instructions are assumed, but these values are not particularly limited.
[0119]
Further, the sequential execution guarantee code generation means searches for an unused register in the instruction sequence, and replaces the output register of the instruction that outputs the problematic register in the problematic instruction sequence with an unused register. The algorithm for replacing the reference register of the instruction that refers to the register in question in the following word with the replaced register has been described, but in the same way as the second program generation device in the first embodiment, the register in question in advance May be transferred to a register that is not used, and an algorithm that replaces the reference register of the instruction that refers to the register in question with the replaced register may be performed.
[0120]
The instruction sequence output by the avoidance target code detection means is a combination of an output instruction and a reference instruction, and thus is not limited to two instructions. When there are a plurality of reference instructions, there may be a combination of three or more instructions.
[0121]
The instruction sequence storage means may be a recording medium such as a floppy disk, a tape, a hard disk, or a memory, or may be an input file to an optimization program such as a compiler or an assembler optimizer. It is possible to further optimize the output file by repeating the process using the optimization program.
[0122]
Further, the instruction fetch width recognized by the instruction fetch boundary detection means does not need to be fixed, and for example, a different value may be set for each memory area. In that case, the instruction fetch boundary detecting means determines the instruction fetch width from the address information.
[0123]
Further, the instruction fetch width information may be incorporated in the program generation apparatus or information may be given from the outside. Specifically, it may be specified as a constant incorporated in a compiler, assembler or linker, or may be specified in the form of an argument or an environment file. Also, the instruction fetch width to be specified may be constant or may be given individually for each space.
[0124]
【The invention's effect】
As described above, according to the present invention, even if it is used in an environment where instruction supply cannot be sufficiently performed, Execution By doing so, performance degradation can be suppressed.
[Brief description of the drawings]
FIG. 1 is a block configuration diagram of a processor according to a first embodiment of the present invention.
FIG. 2 is a first program example and pipeline diagram according to the first embodiment of the present invention;
FIG. 3 is a second program example and pipeline diagram in the first and second embodiments of the present invention;
FIG. 4 is a block diagram of a first program generation device in the first and second embodiments of the present invention.
FIG. 5 is a program diagram in the first program generation device in the first and second embodiments of the present invention;
FIG. 6 is a block diagram of a first program generation device in the first and second embodiments of the present invention.
FIG. 7 is a diagram showing a detection algorithm of an avoidance target code detection unit in the first program generation device according to the first embodiment of the present invention;
FIG. 8 is a program diagram of the first program generation device in the first embodiment of the present invention;
FIG. 9 is a block diagram of a second program generation device in the first and second embodiments of the present invention.
FIG. 10 is a program diagram of the second program generation device in the first embodiment of the present invention;
FIG. 11 is a program diagram of the first program generation device in the second embodiment of the present invention;
FIG. 12 is a program diagram of a second program generation device according to the second embodiment of the present invention;
FIG. 13 is a block diagram of a processor in the first conventional example.
FIG. 14 is a diagram showing a first program example
FIG. 15 is a pipeline diagram of a first program example in a conventional example.
FIG. 16 is a diagram showing a second program example
FIG. 17 is a pipeline diagram of a second program example in the conventional example.
[Explanation of symbols]
101, 201 clock
102, 202 PC
110, 210 memory
111, 211 Address bus
112, 212 Data bus
120, 220 Instruction supply issuer
121, 221 Instruction fetch control unit
122, 222 Instruction register
123 Instruction fetch flag
124 Location information
130, 230 Instruction decoding part
131 Cancel signal generator
132, 232 decoder
133, 233 registers
134, 135, 234 Cancel signal
136, 236 decoder
137, 237 NOP signal generator
223 storage flag
224 use flag

Claims

In a VLIW processor that can execute a plurality of fixed-length or variable-length instructions in parallel, an instruction supply issuing unit that fetches instructions in units smaller than the total number of bits of instructions that can be executed in parallel and stores them in an instruction register;
An instruction issuer for determining which instruction decoder is supplied with an instruction;
Based on the instruction issuer, there is provided a NOP generation unit that outputs a NOP as a decoding result corresponding to an instruction register in which no instruction is stored and outputs a decoding result corresponding to the instruction register in which the instruction is stored as it is. And
When the plurality of instructions that can be executed in parallel include an instruction fetched instruction and an instruction not fetched instruction, only the instruction fetched instruction is executed, and a program to be executed by the VLIW processor is generated A program generation device for
Source code storage means for storing the source code of the VLIW processor in which one word is composed of a plurality of unit instructions;
Avoidance target code for detecting problem codes having different execution results when unit instructions in one word are simultaneously executed in the source code stored in the source code storage means and when unit instructions in one word are sequentially executed. Detection means;
Sequential execution guarantee code generation that replaces the problem code detected by the avoidance target code detection means with a code that does not differ in execution result when a unit instruction in one word is executed simultaneously and when a unit instruction in one word is executed sequentially Means,
A program generation apparatus comprising: generated code storage means for storing the generated code generated by the sequential execution guarantee code generating means.

Instruction fetch boundary detection means for detecting an instruction fetch boundary in the source code stored in the source code storage means and outputting instruction fetch boundary information;
The avoidance target code detecting means sequentially executes unit instructions in one word in the source code stored in the source code storage means and sequentially executes unit instructions in one word in units of instruction fetch boundaries. Detect problem codes with different execution results
Sequential execution guarantee code generating means is characterized in that a unit instruction in one word is replaced with a code that does not differ in execution result when a unit instruction in one word is sequentially executed in units of instruction fetch boundaries. The program generation device according to claim 1 .