KR102915344B1

KR102915344B1 - Processor memory access

Info

Publication number: KR102915344B1
Application number: KR1020217002974A
Authority: KR
Inventors: 칼레드 마알레즈; 트룽-둥 응우엔; 줄리엔 쉬미트; 피에르-엠마뉴엘 베르나르드
Original assignee: 브이소라
Priority date: 2018-06-29
Filing date: 2019-05-21
Publication date: 2026-01-20
Anticipated expiration: 2039-05-21
Also published as: US20210271488A1; CN112602058A; CN112602058B; EP3814893A1; FR3083350B1; KR20210021587A; FR3083350A1; EP3814893C0; EP3814893B1; WO2020002782A1; US11640302B2

Abstract

컴퓨팅 디바이스가 제공되고, 이러한 컴퓨팅 디바이스는, - 복수의 ALU들(9); - 일 세트의 레지스터들(11); - 메모리(13); - 레지스터들(11)과 메모리(13) 사이의 메모리 인터페이스; - 제어 유닛(5)을 포함하고, 제어 유닛(5)은, - 적어도 하나의 싸이클 i과; - 적어도 하나의 싸이클 i에 후속하는 적어도 하나의 싸이클 ii를 발생시킴으로써 ALU들(9)을 제어하고, 여기서 적어도 하나의 싸이클 i은, 산술 로직 유닛(9)을 통해 적어도 하나의 제 1 컴퓨팅 동작을 구현하는 것 및 제 1 데이터세트(AA4_7; BB4_7)를 메모리(13)로부터 적어도 하나의 레지스터(11)로 다운로드하는 것을 모두 포함하고, 그리고 적어도 하나의 싸이클 ii는, 산술 로직 유닛(9)을 통해 제 2 컴퓨팅 동작을 구현하는 것을 포함하고, 제 2 컴퓨팅 동작 동안 제 1 데이터세트(AA4_7; BB4_7)의 적어도 일부분(A4; B4)은 적어도 하나의 오퍼랜드를 형성한다.A computing device is provided, comprising: - a plurality of ALUs (9); - a set of registers (11); - a memory (13); - a memory interface between the registers (11) and the memory (13); - a control unit (5), the control unit (5) comprising: - at least one cycle i; - Controlling the ALUs (9) by causing at least one cycle ii following at least one cycle i, wherein at least one cycle i comprises implementing at least one first computing operation via the arithmetic logic unit (9) and downloading a first data set (AA4_7; BB4_7) from a memory (13) to at least one register (11), and at least one cycle ii comprises implementing a second computing operation via the arithmetic logic unit (9), during which at least a part (A4; B4) of the first data set (AA4_7; BB4_7) forms at least one operand.

Description

Processor memory access

본 발명은 프로세서(processor)들의 분야에 관한 것이며 아울러 메모리 유닛(memory unit)들과 프로세서들의 상호작용들에 관한 것이다.The present invention relates to the field of processors and also to the interactions of memory units and processors.

종래에, 컴퓨팅 디바이스(computing device)는 일 세트의 하나 이상의 프로세서들을 포함한다. 각각의 프로세서는 하나 이상의 프로세싱 유닛(Processing Unit)들, 혹은 PU를 포함한다. 각각의 PU는 산술 로직 유닛(Aithmetic Logic Unit)들, 혹은 ALU로 지칭되는 하나 이상의 컴퓨팅 유닛들을 포함한다. 고-성능 컴퓨팅 디바이스를 갖기 위해, 즉, 컴퓨팅 동작(computing operation)들을 수행하기 위해 빠른 컴퓨팅 디바이스를 갖기 위해, 많은 수의 ALU들을 제공하는 것이 종래의 기술이다. 따라서, ALU들은 동작들을 병렬로, 즉 동시에 프로세싱할 수 있다. 이 경우 시간의 단위는 컴퓨팅 싸이클(computing cycle)이다. 따라서, 컴퓨팅 싸이클 당 수행할 수 있는 동작들의 수의 측면에서 컴퓨팅 디바이스의 컴퓨팅 파워(computing power)를 정량화(quantify)하는 것이 일반적이다.Conventionally, a computing device includes a set of one or more processors. Each processor includes one or more processing units, or PUs. Each PU includes one or more computing units, referred to as arithmetic logic units, or ALUs. To achieve high-performance computing devices, i.e., to achieve fast computing operations, it is conventional to provide a large number of ALUs. Thus, the ALUs can process operations in parallel, i.e., simultaneously. In this case, the unit of time is a computing cycle. Therefore, it is common to quantify the computing power of a computing device in terms of the number of operations that can be performed per computing cycle.

하지만, 만약 ALU들과 상호작용하는 디바이스의 요소들이 동시에 동작하도록 요구되는 ALU들의 수에 맞게 설계되지 않는다면(크기를 갖지 않는다면), 많은 수의 ALU들을 갖는 것은 부적절하거나 심지어 필요없다. 달리 말하면, 만약 많은 수의 ALU들이 존재한다면, ALU들의 환경의 구성은 디바이스의 파워를 제한하는 기준일 수 있다. 특히, 디바이스는 메모리 조립체를 포함하고, 메모리 조립체 자체는 하나 이상의 메모리 유닛들을 포함하고, 메모리 유닛들 각각은 컴퓨팅 데이터가 영구히 저장될 수 있는 고정된 개수의 메모리 위치들을 갖는다. 컴퓨팅 프로세싱 동작들 동안, ALU들은 입력에서 메모리 유닛들로부터 데이터를 수신하고, 출력에서 데이터를 공급하는데, 이러한 데이터는 이들의 일부분에 대해 메모리 유닛들 상에 저장된다. 이 경우, ALU들의 개수에 추가하여, 메모리 유닛들의 개수는 디바이스의 컴퓨팅 파워를 결정하는 또 하나의 다른 기준임이 이해돼야 한다.However, if the elements of the device that interact with the ALUs are not designed (sized) to accommodate the number of ALUs required to operate concurrently, then having a large number of ALUs may be inappropriate or even unnecessary. In other words, if a large number of ALUs are present, the configuration of the ALUs' environment may be a limiting factor in the device's power. In particular, the device includes a memory assembly, which itself includes one or more memory units, each of which has a fixed number of memory locations where computational data can be permanently stored. During computational processing operations, the ALUs receive data from the memory units at their inputs and provide data at their outputs, some of which is stored on the memory units. In this case, it should be understood that in addition to the number of ALUs, the number of memory units is another factor in determining the device's computational power.

데이터는 ALU들과 메모리 유닛들 간에, 양쪽 방향들에서, 디바이스의 버스(bus)에 의해 라우팅(routing)된다. 용어 "버스"는 본 명세서에서 데이터를 전달하기 위한 시스템(혹은 인터페이스(interface))의 그 일반적인 의미에서 사용되며, 여기에는 교환들을 관리하는 프로토콜(protocol)들 및 하드웨어(인터페이스 회로)가 포함된다. 버스는 데이터 자체, 어드레스(address)들, 및 제어 신호(control signal)들을 전송한다. 각각의 버스 자체는 또한 하드웨어 및 소프트웨어 제한들을 갖고, 이에 따라 데이터의 라우팅이 제한되게 된다. 버스는 특히 메모리 유닛 측 상에서 제한된 수의 포트(port)들을 갖고, ALU 측 상에서 제한된 수의 포트들을 갖는다. 현재의 맥락에서, 메모리 유닛들은 단일-포트(single-port)인 것이 고려되는데, 즉 판독 및 기입 동작들이 상이한 싸이클들 동안 구현되는 것이 고려되는데, 이것은 "이중-포트(double-port)" 메모리들(표면적으로 더 비쌈, 그리고 판독 및 기입을 위해 더 큰 이중 제어 버스들을 요구함)로 지칭되는 것과는 대조적이다. 따라서, 컴퓨팅 싸이클 동안, 메모리 위치는 단일 방향에서("판독(read)" 모드에서 또는 "기입(write)" 모드에서) 버스를 통해 액세스가능하다. 더욱이, 컴퓨팅 싸이클 동안, 메모리 위치는 단일 ALU에게만 액세스가능하다. 변형예로서, 제안된 기술적 해법들은 "이중-포트" 메모리들로 지칭되는 것으로 구현될 수 있다. 이러한 실시예들에서, 판독 및 기입 동작들은 하나의 동일한 컴퓨팅 싸이클 동안 구현될 수 있다.Data is routed between the ALUs and memory units, in both directions, by the device's bus. The term "bus" is used herein in its general sense of a system (or interface) for transferring data, including the protocols and hardware (interface circuitry) that govern the exchanges. The bus carries data itself, addresses, and control signals. Each bus itself also has hardware and software limitations that restrict the routing of data. In particular, the bus has a limited number of ports on the memory unit side and a limited number of ports on the ALU side. In the present context, the memory units are considered single-port, i.e., read and write operations are implemented during different cycles, in contrast to what are referred to as "double-port" memories (which are ostensibly more expensive and require larger dual control buses for reads and writes). Thus, during a computing cycle, a memory location is accessible via the bus in a single direction (either in "read" mode or in "write" mode). Furthermore, during a computing cycle, the memory location is accessible only to a single ALU. As a variant, the proposed technical solutions can be implemented as so-called "dual-port" memories. In such embodiments, read and write operations can be implemented during the same computing cycle.

버스 및 ALU들 사이에서, 컴퓨팅 디바이스는 일반적으로 앞서 언급된 메모리 유닛들로부터 분리된 메모리들로서 보여질 수 있는 일 세트의 레지스터(register)들 및 로컬 메모리 유닛(local memory unit)들을 포함한다. 이해의 용이함을 위해, 여기서는 데이터를 저장하도록 그렇게 의도된 "레지스터들"과 메모리 어드레스들을 저장하도록 의도된 "로컬 메모리 유닛들"이 구분되어 있다. 각각의 레지스터에는 PU의 ALU들이 할당된다. PU에는 복수의 레지스터들이 할당된다. 레지스터들의 저장 용량(storage capacity)은 메모리 유닛들과 비교해 매우 제한되지만, 레지스터들의 내용(content)은 ALU들에게 직접적으로 액세스가능하다.Between the bus and the ALUs, the computing device typically includes a set of registers and local memory units, which can be viewed as memories separate from the aforementioned memory units. For ease of understanding, a distinction is made herein between "registers," which are intended to store data, and "local memory units," which are intended to store memory addresses. Each register is assigned to an ALU of a PU. A PU is assigned multiple registers. The storage capacity of the registers is significantly limited compared to the memory units, but the contents of the registers are directly accessible to the ALUs.

컴퓨팅 동작들을 수행하기 위해, 각각의 ALU는 일반적으로 무엇보다도 먼저 컴퓨팅 동작의 입력 데이터를 획득해야하는데, 전형적으로는 기본 컴퓨팅 동작(elementary computing operation)의 두 개의 오퍼랜드(operand)들을 획득해야 한다. 따라서, 레지스터 상에 두 개의 오퍼랜드들 각각을 유입(import)하기 위해 버스를 통한 대응하는 메모리 위치 상에서의 "판독(read)" 동작이 구현된다. 그 다음에, ALU는 레지스터로부터의 데이터에 근거하여 그리고 데이터의 항목(item)의 형태로 결과를 레지스터 상에 유출(exporting)시킴으로써 컴퓨팅 동작 자체를 수행한다. 마지막으로, "기입(write)" 동작이 컴퓨팅 동작의 결과를 메모리 위치에 기록(record)하기 위해 구현된다. 이러한 기입 동작 동안, 레지스터 상에 저장된 결과는 버스를 통해서 메모리 위치 내에 기록된다. 동작들 각각은 선험적으로 하나 이상의 컴퓨팅 싸이클들을 소비한다.To perform a computing operation, each ALU must first obtain the input data for the computing operation, typically two operands of the elementary computing operation. Therefore, a "read" operation is implemented on the corresponding memory location via the bus to import each of the two operands into a register. The ALU then performs the computing operation itself, based on the data from the register and exporting the result in the form of an item of data onto the register. Finally, a "write" operation is implemented to record the result of the computing operation into the memory location. During this write operation, the result stored in the register is written into the memory location via the bus. Each of these operations typically consumes one or more computing cycles.

알려진 컴퓨팅 디바이스들에서, 컴퓨팅 싸이클들의 전체 수를 감소시키기 위해 그리고 이에 따라 효율을 증가시키기 위해, 하나의 동일한 컴퓨팅 싸이클 동안 복수의 동작들(혹은 복수의 명령들)을 실행하려고 시도하는 것이 일반적이다. 이 경우 병렬 "프로세싱 체인(processing chain)들" 혹은 "파이프라인(pipeline)들"이 언급된다. 하지만, 동작들 간에는 종종 수많은 상호 종속성들이 존재한다. 예를 들어, 오퍼랜드들이 판독되지 않았고 이들이 ALU에 대한 레지스터 상에서 액세스가능하지 않는 동안에 기본 컴퓨팅 동작을 수행하는 것은 가능하지 않다. 따라서, 프로세싱 체인들을 구현하는 것은 동작들(명령들) 간의 상호 종속성을 점검(checking)하는 것을 수반하는데, 이것은 복잡하고 그리고 이에 따라 비용이 많은 든다.In known computing devices, it is common to attempt to execute multiple operations (or instructions) during a single computing cycle to reduce the overall number of computing cycles and thus increase efficiency. This is referred to as parallel "processing chains" or "pipelines." However, numerous interdependencies often exist between operations. For example, it is impossible to perform a basic computing operation while operands are not read and accessible in the registers for the ALU. Therefore, implementing processing chains involves checking interdependencies between operations (instructions), which is complex and therefore expensive.

복수의 독립적 동작들이 대게 하나의 동일한 컴퓨팅 싸이클 동안 구현된다. 일반적으로, 주어진 ALU에 대해 그리고 하나의 동일한 컴퓨팅 싸이클 동안, 컴퓨팅 동작과 판독 혹은 기입 동작을 수행하는 것이 가능하다. 이와는 대조적으로, 주어진 ALU에 대해 그리고 하나의 동일한 컴퓨팅 싸이클 동안, (단일-포트 메모리 유닛들의 경우에) 판독 동작과 기입 동작을 동시에 수행하는 것은 가능하지 않다. 반면에, 메모리 액세스 동작(memory access operation)들(버스)은, 하나의 동일한 컴퓨팅 싸이클 동안 그리고 주어진 메모리 위치에 대해, 서로 분리되어 있는 두 개의 ALU들을 위한 판독 혹은 기입 동작들을 수행하는 것을 가능하게 하지 않는다.Multiple independent operations are usually implemented during a single computing cycle. In general, it is possible to perform a computing operation and a read or write operation for a given ALU during a single computing cycle. In contrast, it is not possible to perform a read operation and a write operation simultaneously (in the case of single-port memory units) for a given ALU during a single computing cycle. On the other hand, memory access operations (buses) do not allow performing read or write operations for two separate ALUs during a single computing cycle and for a given memory location.

따라서, 각각의 ALU가 (컴퓨팅 싸이클의 손실 없이) 가능한 한 활성 상태인 것을 보장하기 위해, 각각의 컴퓨팅 싸이클에서 세 개의 메모리 위치들이 ALU들 각각에 대해 액세스가능한 상황을 달성하려고 시도하는 것이 직관적인데, 여기서 세 개의 메모리 위치들 중 두 개는 ALU의 두 개의 입력들에 오퍼랜드들을 공급하도록 의도된 것이고(판독 모드), 세 개의 메모리 위치들 중 한 개는 ALU로부터 기본 컴퓨팅 동작 결과를 수신하기 위한 것이다(기입 모드). 따라서, 두 개의 판독 동작들이, 후속하는 컴퓨팅 싸이클 동안 구현될 기본 컴퓨팅 동작을 위해 요구되는 오퍼랜드들을 획득(레지스터 상에 저장되는 것)하기 위해 선택된다. 따라서, 컴퓨팅 파워를 향상시키기 위해, 많은 수의 ALU들 및 이에 비례하는 수의 메모리 위치들을 모두 제공하는 것이 직관적이다(예를 들어, ALU들보다 적어도 세 배 더 많은 메모리 위치들).Therefore, to ensure that each ALU is as active as possible (without losing any compute cycles), it is intuitive to try to achieve a situation where three memory locations are accessible to each ALU in each compute cycle, two of which are intended to supply operands to the two inputs of the ALU (read mode), and one of which is intended to receive the result of the elementary compute operation from the ALU (write mode). Thus, the two read operations are chosen to obtain (store in registers) the operands required for the elementary compute operation to be implemented during the subsequent compute cycle. Therefore, to increase the compute power, it is intuitive to provide both a large number of ALUs and a proportional number of memory locations (e.g., at least three times as many memory locations as ALUs).

하지만, ALU들의 개수 및 메모리 유닛들의 개수에 있어서의 증가는 이러한 두 가지 요소 타입들 간의 상호작용들의 복잡도를 증가시킨다. 디바이스의 ALU들 및 여기에 연결될 수 있는 메모리 유닛들의 개수를 증가시키는 것은 버스의 복잡도에 있어서의 비-선형적 증가로 이어진다. 따라서, ALU들의 개수 및 메모리 유닛들의 개수를 증가시키는 것은 복잡하고 비용이 많이 든다.However, increasing the number of ALUs and memory units increases the complexity of the interactions between these two types of elements. Increasing the number of ALUs in a device and the memory units that can be connected to them leads to a non-linear increase in bus complexity. Therefore, increasing the number of ALUs and memory units is complex and expensive.

본 발명은 이러한 상황을 개선하는 것을 목표로 한다.The present invention aims to improve this situation.

제안되는 것은 컴퓨팅 디바이스이고, 이러한 컴퓨팅 디바이스는,What is proposed is a computing device, which computing device comprises:

- 복수의 산술 로직 유닛들;- Multiple arithmetic logic units;

- 일 세트의 레지스터들(여기서, 레지스터들은 상기 산술 로직 유닛들의 입력들에 오퍼랜드 타입의 데이터를 공급할 수 있고 그리고 상기 산술 로직 유닛들의 출력들로부터 데이터를 공급받을 수 있음);- a set of registers (wherein the registers can supply data of operand type to the inputs of the arithmetic logic units and can supply data from the outputs of the arithmetic logic units);

- 메모리;- Memory;

- 메모리 인터페이스(이러한 메모리 인터페이스를 통해 데이터가 레지스터들과 메모리 간에 전송되고 라우팅됨);- Memory interfaces (through which data is transferred and routed between registers and memory);

- 제어 유닛(control unit)을 포함한다(여기서 제어 유닛은, 산술 로직 유닛들이 서로 병렬로 컴퓨팅 동작들을 수행하도록 프로세싱 체인 마이크로아키텍처(processing chain microarchitecture)에 따라 산술 로직 유닛들을 제어하도록 구성됨, 그리고 제어 유닛은 또한, 메모리 인터페이스를 통해 메모리 액세스 동작들을 제어하도록 설계됨). 제어 동작들은,- A control unit (wherein the control unit is configured to control the arithmetic logic units according to a processing chain microarchitecture so that the arithmetic logic units perform computing operations in parallel with each other, and the control unit is also designed to control memory access operations via a memory interface). The control operations are:

- 적어도 하나의 싸이클 i(cycle i)(여기서 적어도 하나의 싸이클 i은, 산술 로직 유닛을 통해 적어도 하나의 제 1 컴퓨팅 동작을 구현하는 것 및 제 1 데이터세트(dataset)를 메모리로부터 적어도 하나의 레지스터로 다운로드(downloading)하는 것을 모두 포함함);- at least one cycle i (wherein at least one cycle i includes both implementing at least one first computing operation via an arithmetic logic unit and downloading a first dataset from memory to at least one register);

- 적어도 하나의 싸이클 i에 후속하는 적어도 하나의 싸이클 ii(cycle ii)(여기서 적어도 하나의 싸이클 ii는, 산술 로직 유닛을 통해 제 2 컴퓨팅 동작을 구현하는 것을 포함함, 그리고 제 2 컴퓨팅 동작 동안 제 1 데이터세트의 적어도 일부분은 적어도 하나의 오퍼랜드를 형성함)를 발생시킨다.- at least one cycle ii subsequent to at least one cycle i, wherein at least one cycle ii includes implementing a second computing operation via an arithmetic logic unit, and during the second computing operation, at least a portion of the first dataset forms at least one operand.

이러한 디바이스는, 메모리 유닛으로부터 데이터세트를 판독하는 것 및 상기 데이터를 레지스터에 일시적으로 정하는 것을 단일 동작에서 가능하게 한다. 컴퓨팅 싸이클 t에서 판독된 데이터 모두가, 바로 후속하는 컴퓨팅 싸이클 t+1에서 모두 사용될 수는 없다. 적어도 일부 경우들에서, 판독된 데이터 중의 일부 데이터는 상기 후속하는 컴퓨팅 싸이클 t+1 동안 필요없는데, 하지만 이들은 추가적인 판독 동작을 수행할 필요 없이, 그리고 이에 따라 추가적인 컴퓨팅 싸이클을 소비할 필요 없이, 또 다른 후속하는 컴퓨팅 싸이클 t+1+n 동안 사용된다.Such a device enables reading a data set from a memory unit and temporarily setting said data in a register in a single operation. Not all data read in a computing cycle t may be used in the immediately subsequent computing cycle t+1. At least in some cases, some of the read data is not needed during the subsequent computing cycle t+1, but is used during another subsequent computing cycle t+1+n without the need to perform an additional read operation and thus consume additional computing cycles.

컴퓨팅 디바이스를 사용하는 병렬 데이터 프로세싱의 분야에서, 일반적인 접근법은, 구현될 다음 기본 컴퓨팅 동작에서 요구되는 데이터 각각, 그리고 다음 기본 컴퓨팅 동작을 위해 요구된 것들만을 판독하기 위한 전용 메모리 액세스 동작을 스케쥴링(schedule)하는 것이다. 이러한 일반적인 접근법은 "적시 메모리 액세스(just-in-time memory access)"로 지칭될 수 있다. 이러한 일반적인 접근법에서, 바로 요구되지 않는 데이터의 항목을 판독(및 레지스터 상에 저장)하는 것은 불필요한 것으로 고려된다. 따라서, 각각의 메모리 액세스 동작은 시간적으로 기본 컴퓨팅 동작 자체 이전에 있는데(필수적임), 하지만 각각의 메모리 액세스 동작은 다음 기본 컴퓨팅 동작에 근거하여서만 직접적으로 스케쥴링된다. 본 출원인은 상이한 접근법을 구현함으로써 본 분야에서 선행 기술을 벗어났다.In the field of parallel data processing using computing devices, a common approach is to schedule dedicated memory access operations to read only the data required by the next elementary computing operation to be implemented, and only those data required for the next elementary computing operation. This general approach may be referred to as "just-in-time memory access." In this general approach, reading (and storing) data items that are not immediately required is considered unnecessary. Thus, each memory access operation temporally precedes (and is necessary for) the elementary computing operation itself, but each memory access operation is directly scheduled based solely on the next elementary computing operation. The present applicant has overcome the prior art in this field by implementing a different approach.

따라서, 출원인은, 각각의 판독 동작시, 판독되는 데이터의 개수가 다음 컴퓨팅 동작을 구현하기 위해 꼭 필요한 데이터의 개수보다 더 많은 그러한 접근법을 제안한다. 앞서와는 반대로, 이러한 접근법은 "예비적 메모리 액세스(provisional memory access)"로 지칭될 수 있다. 이 경우, 판독된 데이터 중의 데이터의 하나의 항목이, 판독 동작 직후 구현되는 컴퓨팅 동작이 아닌 장래 컴퓨팅 동작을 위해 사용되는 것이 가능하다. 이러한 경우들에서, 필요한 데이터는 (메모리의 대역폭에서의 증가와 함께) 단일 메모리 액세스 동작 동안 획득되었는데, 반면 일반적인 접근법은 적어도 두 개의 분리된 메모리 액세스 동작들을 요구했을 것이다. 따라서, 출원인에 의해 제안된 접근법의 효과는, 적어도 일부 경우들에서, 메모리 액세스 동작들을 위한 컴퓨팅 싸이클들의 소비를 감소시키는 효과이고, 따라서 이것은 디바이스의 효율을 향상시키는 것을 가능하게 한다. 긴 기간에 걸쳐(복수의 연속적인 컴퓨팅 싸이클들에 걸쳐), (판독 모드에서 그리고/또는 기입 모드에서) 메모리 액세스 동작들의 수는 감소된다.Therefore, the applicant proposes an approach in which, during each read operation, the number of data read is greater than the number of data absolutely necessary to implement the next computing operation. In contrast to the previous approach, this approach may be referred to as "provisional memory access." In this case, it is possible for one item of data among the read data to be used for a future computing operation rather than the computing operation implemented immediately after the read operation. In such cases, the required data is obtained during a single memory access operation (with an increase in memory bandwidth), whereas a typical approach would require at least two separate memory access operations. Therefore, the effect of the approach proposed by the applicant is, at least in some cases, to reduce the consumption of computing cycles for memory access operations, thereby enabling improved device efficiency. Over a long period of time (over multiple consecutive computing cycles), the number of memory access operations (in read mode and/or write mode) is reduced.

이러한 접근법이, 판독되어 레지스터 상에 저장되는 데이터의 일부가 심지어 컴퓨팅 동작에서 사용되기 전에도 손실될 수 있는(동일한 레지스터 상에 저장되는 다른 데이터에 의해 소거될 수 있) 그러한 손실들을 배제하지는 못한다. 하지만, 많은 수의 컴퓨팅 동작들 및 컴퓨팅 싸이클들에 걸쳐, 출원인은 성능에서의 향상을 관측했다(여기에는 판독되는 데이터세트들을 선택하지 않는 것이 포함됨). 달리 말하면, 판독되는 데이터를 선택하지 않음(혹은 무작위 선택)에도 불구하고, 이러한 접근법은 일반적인 접근법과 비교하여 컴퓨팅 디바이스의 효율을 통계적으로 향상시키는 것을 가능하게 한다.This approach does not eliminate the possibility of loss, where some of the data read and stored in the registers may be lost (e.g., erased by other data stored in the same registers) even before being used in a computing operation. However, over a large number of computing operations and computing cycles, the applicant has observed an improvement in performance (even without selecting the datasets to be read). In other words, despite not selecting (or randomly selecting) the data to be read, this approach statistically improves the efficiency of the computing device compared to conventional approaches.

또 하나의 다른 실시형태에 따르면, 제안되는 것은 컴퓨팅 디바이스의 제어 유닛에 의해 구현되는 데이터 프로세싱 방법이고, 상기 디바이스는,According to another embodiment, what is proposed is a data processing method implemented by a control unit of a computing device, said device comprising:

- 복수의 산술 로직 유닛들;- Multiple arithmetic logic units;

- 메모리;- Memory;

- 제어 유닛을 포함한다(여기서 제어 유닛은, 산술 로직 유닛들이 서로 병렬로 컴퓨팅 동작들을 수행하도록 프로세싱 체인 마이크로아키텍처에 따라 산술 로직 유닛들을 제어하도록 구성됨, 그리고 제어 유닛은 또한, 메모리 인터페이스를 통해 메모리 액세스 동작들을 제어하도록 설계됨).- Includes a control unit (wherein the control unit is configured to control the arithmetic logic units according to a processing chain microarchitecture so that the arithmetic logic units perform computing operations in parallel with each other, and the control unit is also designed to control memory access operations via a memory interface).

방법은 적어도,At least the method is,

- 싸이클 i을 발생시키는 것(여기서 적어도 하나의 싸이클 i은, 산술 로직 유닛을 통해 적어도 하나의 제 1 컴퓨팅 동작을 구현하는 것 및 제 1 데이터세트를 메모리로부터 적어도 하나의 레지스터로 다운로드하는 것을 모두 포함함);- generating cycle i, wherein at least one cycle i includes both implementing at least one first computing operation via an arithmetic logic unit and downloading a first dataset from memory to at least one register;

- 싸이클 i에 후속하는 싸이클 ii를 발생시키는 것(여기서 적어도 하나의 싸이클 ii는, 산술 로직 유닛을 통해 제 2 컴퓨팅 동작을 구현하는 것을 포함함, 그리고 제 2 컴퓨팅 동작 동안 제 1 데이터세트의 적어도 일부분은 적어도 하나의 오퍼랜드를 형성함)을 포함한다.- generating a cycle ii subsequent to cycle i, wherein at least one cycle ii includes implementing a second computing operation via an arithmetic logic unit, and during the second computing operation, at least a portion of the first dataset forms at least one operand.

또 하나의 다른 실시형태에 따르면, 제안되는 것은 컴퓨터 프로그램이고, 특히 컴파일링 컴퓨터 프로그램(compilation computer program)이고, 여기서 컴퓨터 프로그램은 이러한 프로그램이 프로세서에 의해 실행될 때 본 명세서에서 정의되는 바와 같은 방법 중 일부 혹은 모두를 구현하기 위한 명령들을 포함한다. 또 하나의 다른 실시형태에 따르면, 제안되는 것은 프로그램이 기록되는 비-일시적 컴퓨터-판독가능 기록 매체(non-transient computer-readable recording medium)이다.According to another embodiment, a computer program, in particular a compilation computer program, is proposed, wherein the computer program comprises instructions for implementing some or all of the methods defined herein when the program is executed by a processor. According to another embodiment, a non-transient computer-readable recording medium on which the program is recorded is proposed.

다음의 특징들이 선택에 따라 구현될 수 있다. 이들은 서로 독립적으로 구현될 수 있거나 또는 서로 결합되어 구현될 수 있다.The following features may be optionally implemented. They may be implemented independently or in combination.

- 제어 유닛은 또한, 산술 유닛들 및 메모리 액세스 동작들을 제어하기 전에, 적어도 하나의 싸이클 ii 동안 구현될 제 2 컴퓨팅 동작에 근거하여 적어도 하나의 싸이클 i 동안 다운로드될 제 1 데이터세트를 식별하기 위한 식별 알고리즘(identification algorithm)을 구현하도록 구성된다. 이것은 수행될 컴퓨팅 동작들에 근거하여 판독 데이터를 적응(adapt)시키는 것을 가능하게 하고, 이에 따라 다운로드되는 제 1 데이터세트 내의 데이터의 관련성(relevance)을 향성시키는 것을 가능하게 한다.- The control unit is also configured to implement an identification algorithm for identifying a first dataset to be downloaded during at least one cycle i based on a second computing operation to be implemented during at least one cycle ii before controlling the arithmetic units and memory access operations. This makes it possible to adapt the read data based on the computing operations to be performed, thereby improving the relevance of data within the first dataset to be downloaded.

- 제어 유닛은, 서로 분리된 두 개의 싸이클 i을 구현하도록 구성되어, 서로 분리된 두 개의 제 1 데이터세트들이 적어도 하나의 레지스터에 다운로드되도록 구성되고, 두 개의 제 1 데이터세트들 각각의 적어도 일부분은 적어도 하나의 싸이클 ii의 제 2 컴퓨팅 동작에 대한 오퍼랜드를 형성한다. 이것은 상기 적어도 두 개의 싸이클들 동안 컴퓨팅 동작들과 메모리 액세스 동작들을 결합하는 것을 가능하게 한다. 따라서, 두 개의 싸이클 i의 끝에서, 후속하는 컴퓨팅 동작들을 위해 요구되는 모든 오퍼랜드들이 다운로드되었을 수 있다.- The control unit is configured to implement two separate cycles i, such that two separate first data sets are downloaded to at least one register, and at least a portion of each of the two first data sets forms an operand for a second computing operation of at least one cycle ii. This makes it possible to combine computing operations and memory access operations during said at least two cycles. Thus, at the end of the two cycles i, all operands required for subsequent computing operations may have been downloaded.

- 제어 유닛은, 서로 분리된 복수 개의 싸이클 ii를 구현하도록 구성되어, 싸이클 ii의 제 2 컴퓨팅 동작에 대한 적어도 하나의 오퍼랜드를 형성하는 제 1 데이터세트의 일부분이 복수 개의 싸이클 ii의 하나의 싸이클 ii와 또 하나의 다른 싸이클 ii에서 서로 다르게 되도록 구성된다. 이것은 단일 다운로드된 데이터세트에 대한 복수의 기본 컴퓨팅 동작들을 구현하는 것을 가능하게 한다.- The control unit is configured to implement a plurality of cycles ii that are separated from each other, such that a portion of the first dataset forming at least one operand for a second computing operation of the cycles ii is configured to be different in one cycle ii and another cycle ii of the plurality of cycles ii. This makes it possible to implement a plurality of basic computing operations on a single downloaded dataset.

- 제어 유닛은 일련의 적어도 하나의 싸이클 i 및 하나의 싸이클 ii의 적어도 두 개의 반복(iteration)들을 수행하도록 구성되고, 상기 두 개의 반복들은 첫 번째 반복의 적어도 하나의 싸이클 ii가 후속하는 반복의 싸이클 i을 형성하도록 적어도 부분적으로 중첩(superimpose)된다. 이것은 특히 제한된 수의 컴퓨팅 싸이클들에 걸쳐 컴퓨팅 동작들 및 메모리 액세스 동작들을 수행하는 것을 가능하게 하고, 이에 따라 효율을 더 증가시키는 것을 가능하게 한다.- The control unit is configured to perform at least two iterations of a series of at least one cycle i and one cycle ii, said two iterations being at least partially superimposed such that at least one cycle ii of a first iteration forms the cycle i of a subsequent iteration. This makes it possible to perform computing operations and memory access operations over a particularly limited number of computing cycles, thereby further increasing efficiency.

- 제어 유닛은, 제 1 싸이클 i에 선행하여 상기 제 1 싸이클 i의 제 1 컴퓨팅 동작에 대한 오퍼랜드들을 형성하는 적어도 하나의 데이터세트를 메모리로부터 적어도 하나의 레지스터로 다운로드하는 것을 포함하는 초기화 국면(initialization phase)을 수행하도록 구성된다. 이것은 프로세싱될 데이터세트들 각각에 대한 싸이클 i 및 싸이클 ii를 필요한만큼 여러 번 반복하도록 방법을 초기화하는 것을 가능하게 한다.- The control unit is configured to perform an initialization phase, which comprises downloading from memory to at least one register at least one dataset forming operands for a first computing operation of said first cycle i, prior to the first cycle i. This enables the method to be initialized to repeat cycle i and cycle ii as many times as necessary for each of the datasets to be processed.

- 제어 유닛은 또한, 메모리 인터페이스를 통해 메모리 액세스 동작들을 제어하도록 설계되어, 상기 제어 동작들이,- The control unit is also designed to control memory access operations via the memory interface, such that the control operations are:

- 싸이클 i 동안, 복수의 산술 로직 유닛들에 의한 복수의 제 1 컴퓨팅 동작들의 구현;- During cycle i, implementation of a plurality of first computing operations by a plurality of arithmetic logic units;

- 싸이클 ii 동안, 복수의 산술 로직 유닛들에 의한 복수의 제 2 컴퓨팅 동작들의 구현을 발생시키도록 설계되고,- designed to cause the implementation of a plurality of second computing operations by a plurality of arithmetic logic units during cycle ii,

다운로드될 데이터세트에 대한 데이터의 그룹화(grouping)는, 복수의 산술 로직 유닛들 각각에 대한 컴퓨팅 동작들의 할당들의 분포에 매칭(match)되도록 선택되어, 상기 산술 로직 유닛들이 동기화된 동작, 비동기 동작, 또는 혼합된 동작을 갖도록 선택된다. 이것은 수행될 프로세싱 동작들 및 이용가능한 리소스(resource)들에 근거하여 ALU들의 조정된 동작을 적응시킴으로써 효율을 더 향상시키는 것을 가능하게 한다.The grouping of data for the dataset to be downloaded is selected to match the distribution of allocations of computing operations to each of the plurality of arithmetic logic units, such that the arithmetic logic units are selected to have synchronous, asynchronous, or mixed operations. This enables further efficiency improvement by adapting the coordinated operation of the ALUs based on the processing operations to be performed and available resources.

본 발명의 다른 특징들, 세부사항들, 및 장점들은 아래의 상세한 설명을 판독하는 경우, 및 첨부되는 도면들을 분석하는 경우, 명백하게 될 것이고, 도면들에서,
- 도 1은 본 발명에 따른 컴퓨팅 디바이스의 아키텍처를 보여주고;
- 도 2는 본 발명에 따른 컴퓨팅 디바이스의 아키텍처의 부분적 묘사이고;
- 도 3은 본 발명에 따른 메모리 액세스 동작의 하나의 예를 보여주고; 그리고
- 도 4는 도 3으로부터의 예의 변형이다.Other features, details, and advantages of the present invention will become apparent upon reading the detailed description below and upon analyzing the accompanying drawings, in which:
- Figure 1 shows the architecture of a computing device according to the present invention;
- Figure 2 is a partial depiction of the architecture of a computing device according to the present invention;
- Figure 3 shows one example of a memory access operation according to the present invention; and
- Fig. 4 is a variation of the example from Fig. 3.

아래의 도면들 및 설명은 본질적으로 특정 성질의 요소들을 포함한다. 따라서, 이들은 본 발명의 더 좋은 이해를 제공하는 역할을 할 수 있을 것이고, 뿐만 아니라 적절한 경우 그 정의에 기여할 것이다.The drawings and descriptions below inherently include elements of specific nature. Therefore, they may serve to provide a better understanding of the present invention and, where appropriate, contribute to its definition.

도 1은 컴퓨팅 디바이스(1)의 하나의 예를 보여준다. 디바이스(1)는 일 세트의 하나 이상의 프로세서들(3)(때때로, 중앙 프로세싱 유닛(Central Processing Unit)들 혹은 CPU들로 지칭됨)을 포함한다. 일 세트의 프로세서(들)(3)는 적어도 하나의 제어 유닛(5) 및 적어도 하나의 프로세싱 유닛(Processing Unit)(7), 혹은 PU(7)을 포함한다. 각각의 PU(7)는 하나 이상의 컴퓨팅 유닛들(산술 로직 유닛(Arithmetic Logic Unit)들(9) 혹은 ALU(9)로 지칭됨)을 포함한다. 본 명세서에서 설명되는 예에서, 각각의 PU(7)는 또한 일 세트의 레지스터들(11)을 포함한다. 디바이스(1)는 일 세트의 프로세서(들)(3)와 상호작용할 수 있는 적어도 하나의 메모리(13)를 포함한다. 이를 위해, 디바이스(1)는 또한 메모리 인터페이스(15), 혹은 "버스(bus)"를 포함한다.Figure 1 shows an example of a computing device (1). The device (1) comprises a set of one or more processors (3) (sometimes referred to as central processing units or CPUs). The set of processor(s) (3) comprises at least one control unit (5) and at least one processing unit (7), or PU (7). Each PU (7) comprises one or more computing units (referred to as arithmetic logic units (9) or ALU (9)). In the example described herein, each PU (7) also comprises a set of registers (11). The device (1) comprises at least one memory (13) capable of interacting with the set of processor(s) (3). For this purpose, the device (1) also comprises a memory interface (15), or “bus”.

현재의 맥락에서, 메모리 유닛들은 단일-포트인 것이 고려되는데, 즉 판독 및 기입 동작들이 상이한 싸이클들 동안 구현되는 것이 고려되는데, 이것은 "이중-포트" 메모리들(표면적으로 더 비쌈, 그리고 판독 및 기입을 위해 더 큰 이중 제어 버스들을 요구함)로 지칭되는 것과는 대조적이다. 변형예로서, 제안된 기술적 해법들은 "이중-포트" 메모리들로 지칭되는 것으로 구현될 수 있다. 이러한 실시예들에서, 판독 및 기입 동작들은 하나의 동일한 컴퓨팅 싸이클 동안 구현될 수 있다.In the present context, the memory units are considered to be single-ported, i.e., read and write operations are implemented during different cycles, in contrast to what are referred to as "dual-ported" memories (which are ostensibly more expensive and require larger dual control buses for reads and writes). As a variant, the proposed technical solutions can be implemented as what are referred to as "dual-ported" memories. In such embodiments, read and write operations can be implemented during one and the same computing cycle.

도 1은 세 개의 PU들(7)을 보여주는 데, PU 1, PU X, 및 PU N을 보여준다. 도 1을 간략화하기 위해 PU X의 구조만이 상세히 보여진다. 하지만, PU들의 구조들은 서로 유사하다. 일부 변형예에서, PU들의 수는 다르다. 디바이스(1)는 단일 PU, 두 개의 PU들, 또는 세 개보다 많은 PU들을 포함할 수 있다.Figure 1 shows three PUs (7), PU 1, PU X, and PU N. To simplify Figure 1, only the structure of PU X is shown in detail. However, the structures of the PUs are similar to each other. In some variations, the number of PUs is different. The device (1) may include a single PU, two PUs, or more than three PUs.

본 명세서에서 설명되는 예에서, PU X는 네 개의 ALU들을 포함하는데, ALU X.0, ALU X.1, ALU X.2 및 ALU X.3을 포함한다. 일부 변형예들에서, PU들은 서로 다른 개수의 ALU들을 포함할 수 있고, 그리고/또는 네 개와는 다른 개수의 ALU들을 포함할 수 있고, 여기에는 단일의 ALU가 포함된다. 각각의 PU는 일 세트의 레지스터들(11)을 포함하는데, 여기서 적어도 하나의 레지스터(11)가 각각의 ALU에 할당된다. 본 명세서에서 설명되는 예에서, PU X는 ALU 당 단일의 레지스터(11)를 포함하는데, 즉, 네 개의 레지스터들이 REG X.0, REG X.1, REG X.2 및 REG X.3으로 참조되어 있고, ALU X.0, ALU X.1, ALU X.2 및 ALU X.3에 각각 할당되어 있다. 일부 변형예들에서, 각각의 ALU에는 복수의 레지스터들(11)이 할당된다.In the example described herein, PU X includes four ALUs, ALU X.0, ALU X.1, ALU X.2, and ALU X.3. In some variations, the PUs may include different numbers of ALUs and/or different numbers of ALUs than four, including a single ALU. Each PU includes a set of registers (11), where at least one register (11) is assigned to each ALU. In the example described herein, PU X includes a single register (11) per ALU, i.e., four registers are referenced as REG X.0, REG X.1, REG X.2, and REG X.3, and are assigned to ALU X.0, ALU X.1, ALU X.2, and ALU X.3, respectively. In some variations, each ALU is assigned multiple registers (11).

각각의 레지스터(11)는 상기 ALU들(9)의 입력들에 오퍼랜드 데이터를 공급할 수 있고, 그리고 상기 ALU들(9)의 출력들로부터 데이터를 공급받을 수 있다. 각각의 레지스터(11)는 또한, "판독" 동작으로 지칭되는 것을 통해서, 버스(15)를 통해 획득된 메모리(13)로부터의 데이터를 저장할 수 있다. 각각의 레지스터(11)는 또한, "기입" 동작으로 지칭되는 것을 통해서, 저장된 데이터를 버스(15)를 통해 메모리(13)로 전송할 수 있다. 판독 및 기입 동작들은 제어 유닛(5)으로부터의 메모리 액세스 동작들을 제어함으로써 관리된다.Each register (11) can supply operand data to the inputs of the ALUs (9) and can receive data from the outputs of the ALUs (9). Each register (11) can also store data from the memory (13) obtained via the bus (15) through what is referred to as a "read" operation. Each register (11) can also transfer the stored data to the memory (13) via the bus (15) through what is referred to as a "write" operation. The read and write operations are managed by controlling memory access operations from the control unit (5).

제어 유닛(5)은 각각의 ALU(9)가 기본 컴퓨팅 동작들을 수행하는 방식을 부과(impose)하는데, 특히 기본 컴퓨팅 동작들의 순서를 부과하고, 그리고 각각의 ALU(9)에게 실행될 동작들을 할당한다. 본 명세서에서 설명되는 예에서, 제어 유닛(5)은 ALU들(9)이 서로 병렬로 컴퓨팅 동작들을 수행하도록 프로세싱 체인 마이크로아키텍처에 따라 ALU들(9)을 제어하도록 구성된다. 예를 들어, 디바이스(1)는, "단일 명령 다중 데이터(Single Instructions Multiple Data)"에 대해 SIMD로 지칭되는 단일 명령 흐름 및 다중 데이터 흐름 아키텍처를 갖고, 그리고/또는 "다중 명령 다중 데이터(Multiple Instructions Multiple Data)"에 대해 MIMD로 지칭되는 다중 명령 흐름 및 다중 데이터 흐름 아키텍처를 갖는다. 한편, 제어 유닛(5)은 또한 메모리 인터페이스(15)를 통해 메모리 액세스 동작들을 제어하도록 설계되는데, 특히 본 경우에서는, 판독 및 기입 동작들을 제어하도록 설계된다. 제어의 두 개의 타입들(컴퓨팅 및 메모리 액세스)이 도 1에서 파선들에서 화살표들에 의해 보여진다.The control unit (5) imposes on each ALU (9) the manner in which it performs basic computing operations, in particular imposing the order of the basic computing operations, and assigning the operations to be executed to each ALU (9). In the example described herein, the control unit (5) is configured to control the ALUs (9) according to a processing chain microarchitecture so that the ALUs (9) perform computing operations in parallel with each other. For example, the device (1) has a single instruction flow and multiple data flow architecture, referred to as SIMD for "Single Instructions Multiple Data", and/or a multiple instruction flow and multiple data flow architecture, referred to as MIMD for "Multiple Instructions Multiple Data". Meanwhile, the control unit (5) is also designed to control memory access operations via the memory interface (15), in particular, in the present case, to control read and write operations. The two types of control (computing and memory access) are shown by arrows in the dashed lines in Figure 1.

이제 도 2가 참조되는데, 도 2에서는 단일 ALU Y가 보여진다. 데이터 전송들이 실선들에서 화살표들에 의해 보여진다. 데이터는 단계별로 전송되기 때문에, 도 2는 동시 데이터 전송들을 갖는 시간 t를 반드시 보여주는 것은 아님이 이해돼야 한다. 반면, 데이터의 항목이 레지스터(11)로부터 ALU(9)로 전송되기 위해서는, 예를 들어, 데이터의 상기 항목이 미리 메모리(13)로부터 이 경우에 있어서는 메모리 인터페이스(15)(혹은 버스)를 통해 상기 레지스터(11)로 전송될 필요가 있다.Reference is now made to Figure 2, which illustrates a single ALU Y. Data transfers are indicated by arrows in the solid lines. It should be understood that Figure 2 does not necessarily show a time t with simultaneous data transfers, since data is transferred stepwise. On the other hand, in order for an item of data to be transferred from a register (11) to an ALU (9), for example, said item of data needs to be transferred in advance from a memory (13), in this case via a memory interface (15) (or bus) to said register (11).

도 2의 예에서는, REG Y.0, REG Y.1 및 REG Y.2로 각각 참조되어 있는 세 개의 레지스터들(11)에는 ALU Y로 참조되어 있는 ALU가 할당된다. 각각의 ALU(9)는 적어도 세 개의 포트들을 갖는데, 구체적으로 두 개의 입력들 및 하나의 출력을 갖는다. 각각의 동작에 대해, 제 1 입력 및 제 2 입력에 의해 각각 적어도 두 개의 오퍼랜드들이 수신된다. 컴퓨팅 동작의 결과는 출력을 통해 전송된다. 도 2에서 보여지는 예의 경우, 입력에서 수신된 오퍼랜드들은 레지스터 REG Y.0으로부터 그리고 레지스터 REG Y.2으로부터 각각 비롯된다. 컴퓨팅 동작의 결과는 레지스터 REG Y.1에 기입된다. 레지스터 REG Y.1에 기입되면, (데이터의 항목의 형태를 갖는) 결과는 메모리 인터페이스(15)를 통해 메모리(13)에 기입된다. 일부 변형예들에서, 적어도 하나의 ALU는 두 개보다 많은 입력들을 가질 수 있고 그리고 컴퓨팅 동작을 위해 두 개보다 많은 오퍼랜드들을 수신할 수 있다.In the example of Fig. 2, an ALU referred to as ALU Y is assigned to three registers (11) referred to as REG Y.0, REG Y.1, and REG Y.2, respectively. Each ALU (9) has at least three ports, specifically two inputs and one output. For each operation, at least two operands are received by the first input and the second input, respectively. The result of the computing operation is transmitted via the output. In the example shown in Fig. 2, the operands received at the inputs come from register REG Y.0 and register REG Y.2, respectively. The result of the computing operation is written to register REG Y.1. Once written to register REG Y.1, the result (in the form of an item of data) is written to memory (13) via the memory interface (15). In some variations, at least one ALU may have more than two inputs and may receive more than two operands for the computing operation.

각각의 ALU(9)는,Each ALU(9) has,

- 데이터에 관한 정수 산술 동작들(덧셈, 뺄셈, 곱셈, 나눗셈, 등);- Integer arithmetic operations on data (addition, subtraction, multiplication, division, etc.);

- 데이터에 관한 부동-소수점 산술 동작들(덧셈, 뺄셈, 곱셈, 나눗셈, 반전(inversion), 제곱근(square root), 로가리즘(logarithms), 삼각법(trigonometry), 등);- Floating-point arithmetic operations on data (addition, subtraction, multiplication, division, inversion, square root, logarithms, trigonometry, etc.);

- 로직 동작들(2의 보수(complement), "논리곱(AND)", "논리합(OR)", "배타적 논리합(Exclusive OR)", 등)을 수행할 수 있다.- It can perform logic operations (2's complement, "logical AND", "logical OR", "exclusive OR", etc.).

ALU들(9)은 데이터를 서로 직접적으로 교환하지 않는다. 예를 들어, 만약 제 1 ALU에 의해 수행되는 제 1 컴퓨팅 동작의 결과가 제 2 ALU에 의해 수행될 제 2 컴퓨팅 동작에 대한 오퍼랜드를 구성한다면, 제 1 컴퓨팅 동작의 결과는 적어도 ALU(9)에 의해 사용될 수 있기 전에 레지스터(11)에 기입돼야 한다.The ALUs (9) do not exchange data directly with each other. For example, if the result of a first computing operation performed by a first ALU constitutes an operand for a second computing operation to be performed by a second ALU, the result of the first computing operation must be written to a register (11) at least before it can be used by the ALU (9).

일부 실시예들에서, 레지스터(11)에 기입되는 데이터는 또한 (메모리 인터페이스(15)를 통해) 메모리(13)에 자동적으로 기입되는데, 비록 데이터의 상기 항목이 그 전체에 있어 프로세싱 프로세스의 결과로서의 역할이 아닌 단지 오퍼랜드로서의 역할만 하도록 획득되어도 그러하다.In some embodiments, data written to a register (11) is also automatically written to memory (13) (via a memory interface (15)), even if said item of data is acquired to serve only as an operand and not as a result of a processing process in its entirety.

일부 실시예들에서, 짧은 관련성(그 전체에 있어 프로세싱 동작의 끝에서 관심의 대상이 아닌 중간 결과)을 갖고 오퍼랜드로서의 역할을 하도록 획득된 데이터는 메모리(13)에 자동적으로 기입되지 않고 단지 레지스터(11) 상에 일시적으로만 저장될 수 있다. 예를 들어, 만약 제 1 ALU에 의해 수행되는 제 1 컴퓨팅 동작의 결과가 제 2 ALU에 의해 수행될 제 2 컴퓨팅 동작에 대한 오퍼랜드를 구성한다면, 제 1 컴퓨팅 동작의 결과는 레지스터(11)에 기입돼야 한다. 다음으로, 데이터의 상기 항목은 레지스터(11)로부터 직접적으로 오퍼랜드로서 제 2 ALU로 전송된다. 이 경우 레지스터(11)를 ALU(9)에 할당하는 것은 시간 경과에 따라, 특히 하나의 컴퓨팅 싸이클로부터 또 하나의 다른 컴퓨팅 싸이클까지, 전개(evolve)될 수 있음이 이해돼야 한다. 이러한 할당은 데이터의 항목의 위치(이것은 레지스터(11) 상에 있거나 메모리(15) 내의 위치에 있음)를 찾는 것을 항상 가능하게 하는 어드레싱 데이터(addressing data)의 형태를 취할 수 있다.In some embodiments, data obtained to serve as an operand with short relevance (intermediate results that are not of interest at the end of the processing operation in their entirety) may not be automatically written to memory (13), but may only be temporarily stored on a register (11). For example, if the result of a first computing operation performed by a first ALU constitutes an operand for a second computing operation to be performed by a second ALU, the result of the first computing operation must be written to a register (11). Then, the said item of data is transferred directly from the register (11) to the second ALU as an operand. It should be understood that in this case, the assignment of the register (11) to the ALU (9) may evolve over time, in particular from one computing cycle to another. Such allocations may take the form of addressing data which always makes it possible to find the location of an item of data (whether it is on a register (11) or at a location within memory (15)).

다음의 설명에서는, 컴퓨팅 데이터에 적용된 프로세싱 동작에 대해 디바이스(1)의 동작이 설명되는데, 이러한 프로세싱 동작은 일 세트의 동작들로 분해되고, 이러한 동작들은 일련의 컴퓨팅 싸이클들로 구성된 기간 동안 복수의 ALU들(9)에 의해 병렬로 수행되는 컴퓨팅 동작들을 포함한다. 이 경우 ALU들(9)은 프로세싱 체인 마이크로아키텍처에 따라 동작하고 있다고 말해진다. 하지만, 디바이스(1)에 의해 구현되는 그리고 본 명세서에 포함되는 프로세싱 동작 그 자체는 더 넓은 컴퓨팅 프로세스의 일부분(혹은 서브세트)을 구성할 수 있다. 이러한 더 넓은 프로세스는, 다른 일부분들 혹은 서브세트들에서, 예를 들어, 직렬 동작 모드에서 혹은 캐스케이드(cascade)로, 복수의 ALU들에 의해 비-병렬 방식으로 수행되는 컴퓨팅 동작들을 포함할 수 있다.In the following description, the operation of the device (1) is described with respect to processing operations applied to computing data, which processing operations are decomposed into a set of operations, which operations comprise computing operations performed in parallel by a plurality of ALUs (9) over a period of time consisting of a series of computing cycles. In this case, the ALUs (9) are said to operate according to a processing chain microarchitecture. However, the processing operations implemented by the device (1) and encompassed herein may themselves constitute a portion (or subset) of a broader computing process. This broader process may comprise computing operations that are performed in a non-parallel manner by the plurality of ALUs in other portions or subsets, for example, in a serial operating mode or in a cascade.

(병렬 혹은 직렬) 동작 아키텍처들은 일정할 수 있거나, 또는 동적일 수 있는데, 예를 들어, 제어 유닛(5)에 의해 부과(제어)될 수 있다. 아키텍처 변형들은 예를 들어, 프로세싱될 데이터에 따라 달라질 수 있고, 그리고 디바이스(1)의 입력에서 수신되는 현재 명령들에 따라 달라질 수 있다. 아키텍처들의 이러한 동적 적응은, 프로세싱될 데이터의 타입 및 명령들이 소스 코드(source code)로부터 도출(deduce)될 수 있을 때, 프로세싱될 데이터의 타입 및 명령들에 근거하여, 컴파일러(compiler)에 의해 발생되는 머신 명령(machine instruction)들을 적응시킴으로써, 컴파일링 스테이지(compilation stage)만큼 일찍 구현될 수 있다. 이러한 적응은 또한, 디바이스(1) 또는 프로세서가 종래의 머신 코드(machine code)를 실행하고 그리고 디바이스(1) 또는 프로세서가 프로세싱될 데이터 및 현재 수신된 명령들에 따라 일 세트의 구성 명령들을 구현하도록 프로그래밍된 경우, 디바이스(1) 또는 프로세서에서만 구현될 수 있다.The (parallel or serial) operating architectures may be constant or dynamic, for example imposed (controlled) by a control unit (5). Architectural variations may vary, for example, depending on the data to be processed and on the current instructions received at the input of the device (1). Such dynamic adaptation of the architectures may be implemented as early as the compilation stage by adapting the machine instructions generated by the compiler based on the type of data to be processed and the instructions, when these can be derived from the source code. Such adaptation may also be implemented only in the device (1) or the processor, if the device (1) or the processor executes conventional machine code and is programmed to implement a set of configuration instructions depending on the data to be processed and the currently received instructions.

메모리 인터페이스(15) 또는 "버스"는 ALU들(9)과 메모리(15) 간에 데이터를 양쪽 방향들에서 전송 및 라우팅한다. 메모리 인터페이스(15)는 제어 유닛(5)에 의해 제어된다. 따라서, 제어 유닛(5)은 메모리 인터페이스(15)를 통해 디바이스(1)의 메모리(13)에 대한 액세스를 제어한다. A memory interface (15) or "bus" transmits and routes data in both directions between the ALUs (9) and the memory (15). The memory interface (15) is controlled by the control unit (5). Thus, the control unit (5) controls access to the memory (13) of the device (1) via the memory interface (15).

제어 유닛(5)은 조정되는 방식으로 메모리 액세스 동작들 및 ALU들(9)에 의해 구현되는 (컴퓨팅) 동작들을 제어한다. 제어 유닛(5)에 의한 제어는 컴퓨팅 싸이클들로 분해되는 일련의 동작들을 구현하는 것을 포함한다. 제어는 제 1 싸이클 i 및 제 2 싸이클 ii를 발생시키는 것을 포함한다. 제 1 싸이클 i은 시간적으로 제 2 싸이클 ii 전에 존재한다. 아래의 예들에서 더 상세히 설명되는 바와 같이, 제 2 싸이클 ii는 제 1 싸이클 i에 바로 후속할 수 있거나, 또는 그렇지 않으면 제 1 싸이클 i과 제 2 싸이클 ii는 예를 들어, 중간 싸이클들과 함께 서로 시간적으로 이격될 수 있다.A control unit (5) controls memory access operations and (computing) operations implemented by ALUs (9) in a coordinated manner. Control by the control unit (5) includes implementing a series of operations that are decomposed into computing cycles. Control includes generating a first cycle i and a second cycle ii. The first cycle i temporally precedes the second cycle ii. As described in more detail in the examples below, the second cycle ii may immediately follow the first cycle i, or alternatively, the first cycle i and the second cycle ii may be temporally separated from each other, for example, with intermediate cycles.

제 1 싸이클 i은,The first cycle i is,

- 적어도 하나의 ALU(9)를 통해 제 1 컴퓨팅 동작을 구현하는 것; 그리고- implementing a first computing operation through at least one ALU (9); and

- 제 1 데이터세트를 메모리(13)로부터 적어도 하나의 레지스터(11)로 다운로드하는 것을 포함한다.- Includes downloading the first data set from memory (13) to at least one register (11).

제 2 싸이클 ii는 적어도 하나의 ALU(9)를 통해 제 2 컴퓨팅 동작을 구현하는 것을 포함한다. 제 2 컴퓨팅 동작은 제 1 컴퓨팅 동작과 동일한 ALU(9)에 의해 구현될 수 있거나, 또는 별개의 ALU(9)에 의해 구현될 수 있다. 제 1 싸이클 i 동안 다운로드된 제 1 데이터세트의 적어도 일부분은 제 2 컴퓨팅 동작에 대한 오퍼랜드를 형성한다.The second cycle ii includes implementing a second computing operation via at least one ALU (9). The second computing operation may be implemented by the same ALU (9) as the first computing operation, or may be implemented by a separate ALU (9). At least a portion of the first dataset downloaded during the first cycle i forms an operand for the second computing operation.

이제 도 3이 참조된다. 일부 데이터 혹은 데이터의 블록들이 A0 내지 A15로 각각 참조되고, 메모리(13) 내에 저장된다. 본 예에서, 데이터 A0 내지 A15는 다음과 같이 네 가지들로 함께 그룹화된다:Now, referring to Figure 3, some data or blocks of data are referenced as A0 to A15, respectively, and are stored in memory (13). In this example, data A0 to A15 are grouped together into four groups as follows:

- 데이터 A0, A1, A2 및 A3으로 구성되며 AA0_3으로 참조되는 데이터세트;- A dataset consisting of data A0, A1, A2 and A3 and referenced as AA0_3;

- 데이터 A4, A5, A6 및 A7로 구성되며 AA4_7로 참조되는 데이터세트;- A dataset consisting of data A4, A5, A6 and A7 and referenced as AA4_7;

- 데이터 A8, A9, A10 및 A11로 구성되며 AA8_11로 참조되는 데이터세트; 및- A dataset consisting of data A8, A9, A10 and A11 and referenced as AA8_11; and

- 데이터 A12, A13, A14 및 A15로 구성되며 AA12_15로 참조되는 데이터세트.- A dataset consisting of data A12, A13, A14, and A15, referenced as AA12_15.

변형예로서, 데이터는 서로 다르게 함께 그룹화될 수 있는데, 특히 두 개, 세 개, 또는 네 개보다 많은 수의 그룹들(혹은 "블록들" 혹은 "슬롯(slot)들")로 그룹화될 수 있다. 데이터세트는 단일 판독 동작 동안 메모리 인터페이스(15)의 단일 포트를 통해 메모리(13) 상에서 액세스가능한 데이터의 그룹인 것으로 보여질 수 있다. 마찬가지로, 데이터세트의 데이터는 단일 기입 동작 동안 메모리 인터페이스(15)의 단일 포트를 통해 메모리(13)에 기입될 수 있다.As a variation, the data may be grouped together differently, particularly into more than two, three, or four groups (or "blocks" or "slots"). A dataset may be viewed as a group of data accessible on memory (13) via a single port of the memory interface (15) during a single read operation. Similarly, data in a dataset may be written to memory (13) via a single port of the memory interface (15) during a single write operation.

따라서, 제 1 싸이클 i 동안, 적어도 하나의 데이터세트 AA0_3, AA4_7, AA8_11 및/또는 AA12_15가 적어도 하나의 레지스터(11)로 다운로드된다. 본 도면에서의 예에서, 데이터세트들 AA0_3, AA4_7, AA8_11 및/또는 AA12_15 각각은 각각의 레지스터(11)로 다운로드되는데, 즉 서로 분리된 네 개의 레지스터들(11)로 다운로드된다. 레지스터들(11) 각각은 여기서 ALU 0, ALU 1, ALU 2 및 ALU 3으로 각각 참조되는 각각의 ALU(9)에 적어도 일시적으로 할당된다. 이러한 하나의 동일한 싸이클 i 동안, ALU들(9)은 컴퓨팅 동작을 구현했을 수 있다.Thus, during a first cycle i, at least one dataset AA0_3, AA4_7, AA8_11 and/or AA12_15 is downloaded to at least one register (11). In the example in the figure, each of the datasets AA0_3, AA4_7, AA8_11 and/or AA12_15 is downloaded to a respective register (11), i.e., to four separate registers (11). Each of the registers (11) is at least temporarily assigned to a respective ALU (9), referred to herein as ALU 0, ALU 1, ALU 2 and ALU 3, respectively. During this one and the same cycle i, the ALUs (9) may have implemented computing operations.

제 2 싸이클 ii 동안, 각각의 ALU(9)는 컴퓨팅 동작을 구현하는데, 이러한 컴퓨팅 동작 동안, 대응하는 레지스터(11) 상에 저장된 데이터의 항목들 중 적어도 하나는 오퍼랜드를 형성한다. 예를 들어, ALU 0은 컴퓨팅 동작을 구현하는데, 이러한 컴퓨팅 동작 동안, 오퍼랜드들 중 하나는 A0이다. A1, A2 및 A3은 제 2 싸이클 ii 동안 미사용될 수 있다.During the second cycle ii, each ALU (9) implements a computing operation, during which at least one of the items of data stored on the corresponding register (11) forms an operand. For example, ALU 0 implements a computing operation, during which one of the operands is A0. A1, A2 and A3 may be unused during the second cycle ii.

일반적으로 말하면, 데이터를 메모리(13)로부터 레지스터(11)로 다운로드하는 것은 ALU들(9)을 통해 컴퓨팅 동작들을 구현하는 것보다 더 적은 컴퓨팅 시간을 소비한다. 따라서, 일반적으로, 메모리 액세스 동작(여기서는 판독 동작)은 단일 컴퓨팅 싸이클을 소비하고, 반면 ALU(9)를 통해 컴퓨팅 동작을 구현하는 것은 하나의 컴퓨팅 싸이클 또는 연속하는 복수의 컴퓨팅 싸이클들(예컨대, 네 개의 컴퓨팅 싸이클들)을 소비하는 것이 고려될 수 있다.Generally speaking, downloading data from memory (13) to registers (11) consumes less computing time than implementing computing operations via ALUs (9). Therefore, in general, a memory access operation (here, a read operation) may consume a single computing cycle, while implementing a computing operation via ALUs (9) may consume one computing cycle or a plurality of consecutive computing cycles (e.g., four computing cycles).

도 3의 예에서, 각각의 ALU(9)에 할당된 복수의 레지스터들(11)이 존재하는데, 이들은 REG A, REG B 및 REG C로 참조되는 레지스터들(11)의 그룹들에 의해 보여진다. 메모리(13)로부터 레지스터들(11)로 다운로드된 데이터는 그룹들 REG A 및 REG B에 대응한다. 그룹 REG C는 여기서 (기입 동작 동안) ALU들(9)에 의해 구현되는 컴퓨팅 동작들을 통해 획득된 데이터를 저장하도록 의도된 것이다.In the example of Fig. 3, there are a plurality of registers (11) assigned to each ALU (9), which are represented by groups of registers (11) referred to as REG A, REG B and REG C. Data downloaded from memory (13) to the registers (11) correspond to groups REG A and REG B. Group REG C is intended here to store data obtained through computing operations implemented by the ALUs (9) (during a write operation).

따라서, 그룹들 REG B 및 REG C의 레지스터들(11)은 다음과 같이 REG A의 레지스터들과 유사하게 참조되는 데이터세트들을 포함할 수 있다:Accordingly, registers (11) of groups REG B and REG C may contain datasets referenced similarly to the registers of REG A as follows:

- 그룹 REG B는 네 개의 레지스터들(11)을 포함하고, 여기에는 데이터 B0 내지 B3으로 구성되는 데이터세트 BB0_3, 데이터 B4 내지 B7로 구성되는 데이터세트 BB4_7, 데이터 B8 내지 B11로 구성되는 데이터세트 BB8_11, 및 데이터 B12 내지 B15로 구성되는 데이터세트 BB12_15가 각각 저장되고;- Group REG B includes four registers (11), in which a dataset BB0_3 consisting of data B0 to B3, a dataset BB4_7 consisting of data B4 to B7, a dataset BB8_11 consisting of data B8 to B11, and a dataset BB12_15 consisting of data B12 to B15 are stored, respectively;

- 그룹 REG C는 네 개의 레지스터들(11)을 포함하고, 여기에는 데이터 C0 내지 C3으로 구성되는 데이터세트 CC0_3, 데이터 C4 내지 C7로 구성되는 데이터세트 CC4_7, 데이터 C8 내지 C11로 구성되는 데이터세트 CC8_11, 및 데이터 C12 내지 C15로 구성되는 데이터세트 CC12_15가 각각 저장된다.- Group REG C includes four registers (11), in which data set CC0_3 consisting of data C0 to C3, data set CC4_7 consisting of data C4 to C7, data set CC8_11 consisting of data C8 to C11, and data set CC12_15 consisting of data C12 to C15 are stored, respectively.

도 3의 예에서, 데이터 AN 및 BN은 ALU(9)에 의해 구현되는 컴퓨팅 동작에 대한 오퍼랜드들을 구성하고, 반면 데이터 CN의 항목은 결과를 구성하는데, 여기서 "N"은 0과 15 사이의 정수이다. 예를 들어, 덧셈의 경우에, CN = AN + BN이다. 이러한 예에서, 디바이스(1)에 의해 구현되는 데이터 프로세싱 동작은 16개의 동작들에 대응한다. 16개의 동작들의 결과들 중 어느 것도 다른 15개의 동작들 중 하나의 동작을 구현하기 위해 필요하지 않는다는 점에서 16개의 동작들은 서로 독립되어 있다.In the example of Figure 3, data AN and BN constitute operands for computing operations implemented by ALU (9), while items of data CN constitute results, where "N" is an integer between 0 and 15. For example, in the case of addition, CN = AN + BN. In this example, the data processing operations implemented by device (1) correspond to 16 operations. The 16 operations are independent of each other in that none of the results of the 16 operations is required to implement any of the other 15 operations.

따라서, 프로세싱 동작(16개의 동작들)의 구현은 예를 들어, 다음과 같이 18개의 싸이클들로 분해될 수 있다.Therefore, the implementation of a processing operation (16 operations) can be decomposed into, for example, 18 cycles as follows:

사례 1(Example 1):Example 1:

- 싸이클 #0: AA0_3 판독;- Cycle #0: Read AA0_3;

- 싸이클 #1: BB0_3 판독;- Cycle #1: Read BB0_3;

- 싸이클 #2: (세트 CC0_3으로부터의) C0 컴퓨팅 및 (예컨대, 싸이클 i을 형성하는) AA4_7 판독;- Cycle #2: Compute C0 (from set CC0_3) and read AA4_7 (e.g. forming cycle i);

- 싸이클 #3: (세트 CC0_3으로부터의) C1 컴퓨팅 및 (예컨대, 싸이클 i을 형성하는) BB4_7 판독;- Cycle #3: Compute C1 (from set CC0_3) and read BB4_7 (e.g. forming cycle i);

- 싸이클 #4: (세트 CC0_3으로부터의) C2 컴퓨팅;- Cycle #4: Compute C2 (from set CC0_3);

- 싸이클 #5: (세트 CC0_3으로부터의) C3 컴퓨팅 및 CC0_3 기입;- Cycle #5: Compute C3 (from set CC0_3) and fill in CC0_3;

- 싸이클 #6: (세트 CC4_7로부터의) C4 컴퓨팅 및 (예컨대, 싸이클 ii를 형성하는) AA8_11 판독;- Cycle #6: Compute C4 (from set CC4_7) and read AA8_11 (e.g. forming cycle ii);

- 싸이클 #7: (세트 CC4_7로부터의) C5 컴퓨팅 및 (예컨대, 싸이클 ii를 형성하는) BB8_11 판독;- Cycle #7: Compute C5 (from set CC4_7) and read BB8_11 (e.g. forming cycle ii);

- 싸이클 #8: (세트 CC4_7로부터의) C6 컴퓨팅(예컨대, 싸이클 ii를 형성함);- Cycle #8: Compute C6 (from set CC4_7) (e.g., forming cycle ii);

- 싸이클 #9: (세트 CC4_7로부터의) C7 컴퓨팅 및 (예컨대, 싸이클 ii를 형성하는) CC4_7 기입;- Cycle #9: Compute C7 (from set CC4_7) and write CC4_7 (e.g. forming cycle ii);

- 싸이클 #10: (세트 CC8_11로부터의) C8 컴퓨팅 및 AA12_15 판독;- Cycle #10: Compute C8 and read AA12_15 (from set CC8_11);

- 싸이클 #11: (세트 CC8_11로부터의) C9 컴퓨팅 및 BB12_15 판독;- Cycle #11: Compute C9 (from set CC8_11) and read BB12_15;

- 싸이클 #12: (세트 CC8_11로부터의) C10 컴퓨팅;- Cycle #12: C10 Computing (from set CC8_11);

- 싸이클 #13: (세트 CC8_11로부터의) C11 컴퓨팅 및 CC8_11 기입;- Cycle #13: Compute C11 (from set CC8_11) and write CC8_11;

- 싸이클 #14: (세트 CC12_15로부터의) C12 컴퓨팅;- Cycle #14: C12 Computing (from set CC12_15);

- 싸이클 #15: (세트 CC12_15로부터의) C13 컴퓨팅;- Cycle #15: Compute C13 (from set CC12_15);

- 싸이클 #16: (세트 CC12_15로부터의) C14 컴퓨팅;- Cycle #16: Compute C14 (from set CC12_15);

- 싸이클 #17: (세트 CC12_15로부터의) C15 컴퓨팅 및 CC12_15 기입.- Cycle #17: Compute C15 (from set CC12_15) and fill in CC12_15.

이 경우, 초기 싸이클 #0 및 싸이클 #1을 제외하면, 메모리 액세스 동작들(판독 및 기입 동작들)은 추가적인 컴퓨팅 싸이클을 소비함이 없이 컴퓨팅 동작들과 병렬로 구현됨이 이해돼야 한다. 데이터의 단일 항목을 판독하는 것이 아니라 (복수의) 데이터 또는 데이터의 블록들을 포함하는 데이터세트들을 판독하는 것은, 상기 데이터가 컴퓨팅 동작에 대해 오퍼랜드로서 필요하게 되기 전에도, 메모리(13)로부터 레지스터들로 데이터를 유입하는 것을 끝내는 것을 가능하게 한다.In this case, it should be understood that, except for initial cycle #0 and cycle #1, memory access operations (read and write operations) are implemented in parallel with the computing operations without consuming additional computing cycles. Reading datasets containing (multiple) pieces of data or blocks of data rather than reading a single item of data allows for data to be loaded from memory (13) into registers even before the data is needed as an operand for a computing operation.

앞서의 싸이클 #2의 예에서, 만약 세트 AA0_3 = {A0; A1; A2; A3}을 판독하는 것이 아니라 즉시 필요한 데이터의 항목(A0)만이 판독되었다면, A1, A2 및 A3을 획득하기 위해 세 개의 추가적인 판독 동작들을 후속적으로 구현할 필요가 있게 된다.In the example of Cycle #2 above, if only the item of data needed immediately (A0) was read, rather than reading the set AA0_3 = {A0; A1; A2; A3}, then three additional read operations would need to be implemented subsequently to obtain A1, A2, and A3.

더 나은 이해를 위해, 그리고 비교를 위해, (복수의) 데이터를 포함하는 데이터세트가 아닌 데이터의 단일 항목이 매번 판독되는 프로세싱 동작의 구현이 아래에서 재현된다. 48개의 싸이클들이 필요함이 관측된다.For better understanding and comparison, an implementation of the processing operation is reproduced below, where a single data item is read each time, rather than a dataset containing (multiple) data items. It is observed that 48 cycles are required.

사례 0(Example 0):Example 0:

- 싸이클 #0: A0 판독;- Cycle #0: Read A0;

- 싸이클 #1: B0 ;- Cycle #1: B0 ;

- 싸이클 #2: C0 컴퓨팅 및 C0 기입;- Cycle #2: Compute C0 and write C0;

- 싸이클 #3: A1 판독;- Cycle #3: A1 Reading;

- 싸이클 #4: B1 판독;- Cycle #4: B1 Reading;

- 싸이클 #5: C1 컴퓨팅 C1 기입;- Cycle #5: Compute C1 Write C1;

- ...- ...

- 싸이클 #45: A15 판독;- Cycle #45: A15 reading;

- 싸이클 #46: B15 판독;- Cycle #46: B15 reading;

- 싸이클 #47: C15 컴퓨팅 및 C15 기입.- Cycle #47: Compute C15 and write C15.

사례1(18개의 싸이클들)에서, 처음 두 개의 싸이클들인 싸이클 #0 및 싸이클 #1는 초기화 싸이클들을 구성함에 유의해야 한다. 초기화 싸이클들의 수(I)는 컴퓨팅 동작 당 오퍼랜드들의 수에 대응한다. 다음으로, 네 개의 연속하는 싸이클들의 패턴이 네 번 반복된다. 예를 들어, 싸이클 #2 내지 싸이클 #5가 하나의 패턴을 함께 형성한다. 패턴 당 싸이클들의 수는 데이터세트 당 데이터의 수(D)에 대응하고, 반면 패턴들의 수는 프로세싱될 데이터세트들의 수(E)에 대응한다. 따라서, 싸이클들의 전체 수는 다음과 같이 I + D*E로서 표현될 수 있다.Note that in Case 1 (18 cycles), the first two cycles, Cycle #0 and Cycle #1, constitute initialization cycles. The number of initialization cycles (I) corresponds to the number of operands per computation. Next, a pattern of four consecutive cycles is repeated four times. For example, Cycles #2 to #5 together form a pattern. The number of cycles per pattern corresponds to the number of data per dataset (D), while the number of patterns corresponds to the number of datasets to be processed (E). Therefore, the total number of cycles can be expressed as I + D*E, as follows.

좋은 성능을 달성하는 것은 싸이클들의 전체 수를 최소치까지 감소시키는 것과 같다. 따라서, 고려되는 조건들에서는, 즉, 16개의 독립된 기본 동작들 각각이 하나의 싸이클에 걸쳐 구현될 수 있는 조건들에서는, 싸이클들의 최적의 수는 기본 동작들의 수(16개)에 초기화 국면(2개의 싸이클들)을 더한 것, 즉 총 18개의 싸이클들과 동등한 것으로 나타난다.Achieving good performance means minimizing the total number of cycles. Therefore, under the conditions considered—that is, where each of the 16 independent basic operations can be implemented in a single cycle—the optimal number of cycles is equal to the number of basic operations (16) plus the initialization phase (two cycles), for a total of 18 cycles.

하나의 변형예에서, 단일 싸이클에서 (판독 모드에서 혹은 기입 모드에서) 액세스가능한 데이터의 수(데이터세트 당 데이터의 수(D))는 예를 들어, 하드웨어 제한들로 인해, (네 개가 아닌) 세 개와 같음이 고려된다. 이 경우, 일련의 싸이클들은 예를 들어, 다음과 같이 분해될 수 있다:In one variation, the number of data accessible in a single cycle (either in read mode or in write mode) (the number of data per dataset (D)) is considered to be three (rather than four), for example, due to hardware limitations. In this case, the series of cycles can be decomposed, for example, as follows:

- 2개의 싸이클들의 초기화 국면; 그 다음에- Initialization phase of 2 cycles; then

- 수행될 16개 중에서 총 15개의 기본 컴퓨팅 동작들에 대해 3개의 싸이클들의 5개의 패턴들; 그 다음에- 5 patterns of 3 cycles for a total of 15 basic computing operations out of 16 to be performed; then

- 마지막 기본 컴퓨팅 동작의 결과를 컴퓨팅 및 기록하기 위한 마지막 싸이클.- The last cycle to compute and record the results of the last basic computing operation.

사례 2(Example 2):Example 2:

- 싸이클 #0: AA0_2={A0; A1; A2} 판독;- Cycle #0: Read AA0_2={A0; A1; A2};

- 싸이클 #1: BB0_2={B0; B1; B2} 판독;- Cycle #1: Read BB0_2={B0; B1; B2};

- 싸이클 #2: (세트 CC0_2={C0; C1; C2}로부터의) C0 컴퓨팅 및 (예컨대, 싸이클 i을 형성하는) AA3_5 판독;- Cycle #2: Compute C0 (from set CC0_2={C0; C1; C2}) and read AA3_5 (e.g. forming cycle i);

- 싸이클 #3: (세트 CC0_2로부터의) C1 컴퓨팅 및 (예컨대, 싸이클 i을 형성하는) BB3_5 판독;- Cycle #3: Compute C1 (from set CC0_2) and read BB3_5 (e.g. forming cycle i);

- 싸이클 #4: (세트 CC0_2로부터의) C2 컴퓨팅 및 CC0_2 기입;- Cycle #4: Compute C2 (from set CC0_2) and fill in CC0_2;

- 싸이클 #5: (세트 CC3_5로부터의) C3 컴퓨팅 및 (예컨대, 싸이클 ii를 형성하는) AA6_8 판독;- Cycle #5: Compute C3 (from set CC3_5) and read AA6_8 (e.g. forming cycle ii);

- 싸이클 #6: (세트 CC3_5로부터의) C4 컴퓨팅 및 (예컨대, 싸이클 ii를 형성하는) BB6_8 판독;- Cycle #6: Compute C4 (from set CC3_5) and read BB6_8 (e.g. forming cycle ii);

- 싸이클 #7: (세트 CC3_5로부터의) C5 컴퓨팅 및 (예컨대, 싸이클 ii를 형성하는) CC3_5 기입;- Cycle #7: Compute C5 (from set CC3_5) and write CC3_5 (e.g. forming cycle ii);

- 싸이클 #8: (세트 CC6_8로부터의) C6 컴퓨팅 및 AA9_11 판독;- Cycle #8: Compute C6 (from set CC6_8) and read AA9_11;

- 싸이클 #9: (세트 CC6_8로부터의) C7 컴퓨팅 및 BB9_11 판독;- Cycle #9: Compute C7 (from set CC6_8) and read BB9_11;

- 싸이클 #10: (세트 CC6_8로부터의) C8 컴퓨팅 및 CC6_8 기입;- Cycle #10: Compute C8 (from set CC6_8) and fill in CC6_8;

- 싸이클 #11: (세트 CC9_11로부터의) C9 컴퓨팅 및 AA12_14 판독;- Cycle #11: Compute C9 and read AA12_14 (from set CC9_11);

- 싸이클 #12: (세트 CC9_11로부터의) C10 컴퓨팅 및 BB12_14 판독;- Cycle #12: Compute C10 (from set CC9_11) and read BB12_14;

- 싸이클 #13: (세트 CC9_11로부터의) C11 컴퓨팅 및 CC9_11 기입;- Cycle #13: Compute C11 (from set CC9_11) and write CC9_11;

- 싸이클 #14: (세트 CC12_14로부터의) C12 컴퓨팅 및 (예컨대, 싸이클 i을 형성하는) A15 판독;- Cycle #14: Compute C12 (from set CC12_14) and read A15 (e.g., forming cycle i);

- 싸이클 #15: (세트 CC12_14로부터의) C13 컴퓨팅 및 (예컨대, 싸이클 i을 형성하는) B15 판독;- Cycle #15: Compute C13 (from set CC12_14) and read B15 (e.g., forming cycle i);

- 싸이클 #16: (세트 CC12_14로부터의) C14 컴퓨팅 및 CC12_14 기입;- Cycle #16: Compute C14 (from set CC12_14) and write CC12_14;

- 싸이클 #17: (데이터의 분리된 항목) C15 컴퓨팅 및 (예컨대, 싸이클 ii를 형성하는) C15 기입.- Cycle #17: Compute C15 (a separate item of data) and write C15 (e.g., forming cycle ii).

사례 2에서, 각각의 싸이클은 (판독 모드에서 혹은 기입 모드에서) 메모리 액세스 동작을 포함함이 관측된다. 따라서, 만약 단일 싸이클 내에서 액세스가능한 데이터의 수(D)가 세 개보다 확실히 적다면, 추가적인 싸이클들이 메모리 액세스 동작들을 수행하기 위해 필요할 것임이 이해돼야 한다. 따라서, 16개의 기본 동작들에 대한 최적의 18개의 싸이클들은 이제 더 이상 달성되지 않을 것이다. 하지만, 최적화가 달성되지 않아도, 싸이클들의 수는 사례 0에서 필요한 싸이클들의 수보다 상당히 낮은 상태로 유지된다. 데이터세트들이 데이터의 두 개의 항목들을 포함하는 실시예가 현재 존재하는 것보다 향상을 나타낸다.In Case 2, each cycle is observed to involve a memory access operation (either in read mode or in write mode). Therefore, if the number of data items (D) accessible within a single cycle is significantly less than three, it should be understood that additional cycles will be required to perform the memory access operations. Therefore, the optimal 18 cycles for 16 basic operations will no longer be achievable. However, even without optimization, the number of cycles remains significantly lower than the number of cycles required in Case 0. An embodiment in which the datasets include two data items represents an improvement over the existing implementation.

사례 1에서, 만약 싸이클 #2 및/또는 싸이클 #3이 예를 들어, 앞서 정의된 바와 같이 싸이클 i에 대응한다면, 싸이클 #6, 싸이클 #7, 싸이클 #8 및 싸이클 #9 각각은 싸이클 ii에 대응한다. 당연한 것으로, 이것은 패턴에서 패턴으로 바뀔 수 있다. 사례 2에서, 만약 싸이클 #2 및/또는 싸이클 #3이 예를 들어, 앞서 정의된 바와 같이 싸이클 i에 대응한다면, 싸이클 #5, 싸이클 #6 및 싸이클 #7 각각은 싸이클 ii에 대응한다. 당연한 것으로, 이것은 패턴에서 패턴으로 바뀔 수 있다.In Case 1, if Cycle #2 and/or Cycle #3 correspond to Cycle i, for example, as defined above, then Cycle #6, Cycle #7, Cycle #8, and Cycle #9 each correspond to Cycle ii. Naturally, this can change from pattern to pattern. In Case 2, if Cycle #2 and/or Cycle #3 correspond to Cycle i, for example, as defined above, then Cycle #5, Cycle #6, and Cycle #7 each correspond to Cycle ii. Naturally, this can change from pattern to pattern.

지금까지 설명된 예들에서, 특히 사례 1 및 사례 2에서, 싸이클의 전체 수가 특히 낮아지는 것이 달성되는데, 왜냐하면 메모리 액세스 동작들의 최대 수가, 컴퓨팅 동작들과 병렬로 그리고 유닛에서가 아니라 (복수의) 데이터를 포함하는 데이터세트 당 구현되기 때문이다. 따라서, 프로세스의 일부 부분들에 대해(최적화된 예들에서는 모든 부분들에 대해), 선행하는 기본 컴퓨팅 동작이 끝나기 전에도 모든 필요한 오퍼랜드들에 관한 판독 동작이 달성될 수 있다. 바람직하게는, 공통 컴퓨팅 싸이클(예를 들어, 사례 1에서의 싸이클 #5)에서 컴퓨팅 동작을 수행하고 상기 컴퓨팅 동작의 결과를 기록(기입 동작)하기 위해 컴퓨팅 파워가 절약된다.In the examples described so far, particularly in Cases 1 and 2, a particularly low overall number of cycles is achieved, since the maximum number of memory access operations is implemented per dataset containing (multiple) data, rather than in parallel with the computing operations and not in units. Thus, for some parts of the process (all parts in the optimized examples), read operations on all required operands can be achieved even before the preceding basic computing operation is finished. Preferably, computing power is saved by performing the computing operations in a common computing cycle (e.g., cycle #5 in Case 1) and writing the results of said computing operations (write operations).

예들에서, 오퍼랜드 데이터를 미리 판독하는 것은 프로세스 전체에 걸쳐 구현된다(하나의 패턴에서 또 하나의 다른 패턴으로 반복됨). 패턴 동안 수행되는 컴퓨팅 동작들을 위해 필요한 오퍼랜드들은 시간적으로 이전의 패턴 동안 자동적으로 획득(판독)된다. 저하된 실시예들에서, 미리 판독하는 것은 단지 부분적으로만(단지 두 개의 연속하는 패턴들에 대해서만) 구현됨에 유의해야 할 것이다. 앞서의 예들과 비교하여 이러한 저하된 모드도 기존의 방법들보다 더 좋은 결과들을 나타낸다.In these examples, pre-reading of operand data is implemented throughout the process (repeatedly from one pattern to another). Operands required for computational operations performed during a pattern are automatically obtained (read) during the previous pattern. Note that in the degraded embodiments, pre-reading is only partially implemented (only for two consecutive patterns). Compared to the previous examples, this degraded mode also yields better results than existing methods.

지금까지 설명된 예들에서, 인식되었던 것은 데이터가 오퍼랜드들로서의 역할을 하기 전에 판독되었다는 것이다. 일부 실시예들에서, 미리 판독되는 데이터는 무작위로 판독되거나, 또는 수행될 장래의 컴퓨팅 동작들과는 적어도 독립적으로 판독된다. 따라서, 데이터세트들 중에서 미리 판독되는 데이터의 적어도 일부는 후속하는 컴퓨팅 동작들에 대한 오퍼랜드들에 효과적으로 대응하고, 반면 다른 판독 데이터는 후속하는 컴퓨팅 동작들에 대한 오퍼랜드들이 아니다. 예를 들어, 판독 데이터의 적어도 일부는 ALU들(9)에 의해 사용됨이 없이 레지스터들(11)로부터 후속적으로 소거될 수 있는데, 전형적으로는 레지스터들(11) 상에 후속적으로 기록되는 다른 데이터에 의해 소거될 수 있다. 따라서, 일부 데이터는 불필요하게 판독된다(그리고 레지스터들(11) 상에 불필요하게 기록됨). 하지만, 컴퓨팅 싸이클들의 관점에서 절약을 달성하기 위해서 판독 데이터세트들로부터의 데이터의 적어도 일부가 효과적으로 오퍼랜드들이 되기에는 충분하고, 따라서 상황은 현재 존재하는 것과 비교해 향상된다. 따라서, 프로세싱될 데이터의 수 그리고 싸이클들의 수에 따라, 프리-페치(pre-fetch)되는 데이터의 적어도 일부가, 후속하는 싸이클에서 ALU(9)에 의해 수행되는 컴퓨팅 동작에서 오퍼랜드로서 효과적으로 사용될 수 있을 가능성(이 용어의 수학적 의미에서의 가능성)이 높다.In the examples described so far, it has been recognized that data is read before it can serve as operands. In some embodiments, the pre-read data is read randomly, or at least independently of future computing operations to be performed. Thus, at least some of the pre-read data from among the datasets effectively corresponds to operands for subsequent computing operations, while other read data is not operands for subsequent computing operations. For example, at least some of the read data may be subsequently erased from the registers (11) without being used by the ALUs (9), typically by other data that is subsequently written to the registers (11). Thus, some data is unnecessarily read (and unnecessarily written to the registers (11). However, to achieve savings in computing cycles, at least some of the data from the read datasets effectively becomes operands, thus improving the situation compared to what currently exists. Therefore, depending on the number of data to be processed and the number of cycles, there is a high probability (in the mathematical sense of the term) that at least some of the pre-fetched data can be effectively used as operands in computing operations performed by the ALU (9) in subsequent cycles.

일부 실시예들에서, 미리 판독되는 데이터는, 수행될 컴퓨팅 동작들에 따라, 미리선택된다. 이것은 프리-페치되는 데이터의 관련성을 향상시키는 것을 가능하게 한다. 구체적으로, 앞서의 16개의 기본 컴퓨팅 동작들을 갖는 예들에서, 16개의 기본 컴퓨팅 동작들 각각은 입력에서 한 쌍의 오퍼랜드들, A0 및 B0; A1 및 B1; ...; A15 및 B15를 각각 요구한다. 만약 데이터가 무작위로 판독된다면, 처음 두 개의 싸이클들은 AA0_3 및 BB4_7에 관한 판독 동작에 대응할 수 있다. 이러한 경우에, 처음 두 개의 싸이클들의 끝에서 완전한 오퍼랜드 쌍이 레지스터들(11) 상에서 이용가능하지 않다. 따라서, ALU들(9)은 후속하는 싸이클에서 임의의 기본 컴퓨팅 동작을 구현할 수 없다. 따라서 하나 이상의 추가적인 싸이클들이 기본 컴퓨팅 동작들이 시작할 수 있기 전에 메모리 액세스 동작들에 대해 반드시 소비되게 되고, 그럼으로써 싸이클들의 전체 수를 증가시키게 되며, 효율에 해로운 영향을 미치게 된다.In some embodiments, the pre-fetched data is pre-selected based on the computing operations to be performed. This allows for improved relevance of the pre-fetched data. Specifically, in the example with the 16 basic computing operations described above, each of the 16 basic computing operations requires a pair of operands at the input, A0 and B0; A1 and B1; ...; A15 and B15, respectively. If the data is read randomly, the first two cycles may correspond to the read operation on AA0_3 and BB4_7. In this case, the complete operand pair is not available in the registers (11) at the end of the first two cycles. Therefore, the ALUs (9) cannot implement any basic computing operation in the subsequent cycle. Therefore, one or more additional cycles are necessarily consumed on memory access operations before the basic computing operations can begin, thereby increasing the overall number of cycles and detrimentally affecting efficiency.

판독 모드에서 획득된 데이터가 가능한한 관련이 있는 것일 가능성(chance) 및 확률(probability)을 계산(counting)하는 것은 충분히 만족스럽지는 않지만 현재 존재하는 것을 향상시키기에 충분하다. 상황은 더 향상될 수 있다.Calculating the chance and probability that data acquired in read mode is as relevant as possible is not entirely satisfactory, but it is sufficient to improve upon what currently exists. The situation can be further improved.

프리페치 알고리즘을 구현하는 것은 수행될 다음 컴퓨팅 동작의 모든 오퍼랜드들을 가능한 한 일찍 획득하는 것을 가능하게 한다. 앞서의 예에서, 처음 두 개의 싸이클들 동안 AA0_3 및 BB0_3을 판독하는 것은 예를 들어, 처음 4개의 기본 컴퓨팅 동작들을 구현하기 위해 필요한 모든 오퍼랜드들이 레지스터들(11) 상에서 이용가능하게 하는 것을 가능하게 한다.Implementing a prefetch algorithm ensures that all operands for the next computational operation are available as early as possible. In the previous example, reading AA0_3 and BB0_3 during the first two cycles ensures that all operands required to implement the first four basic computational operations are available in registers (11).

이러한 알고리즘은, ALU들(9)에 의해 후속적으로 수행될 컴퓨팅 동작들과 관련된, 그리고 특히 필요한 오퍼랜드들과 관련된, 정보 데이터를 입력 파라미터들로서 수신한다. 이러한 알고리즘은, 출력에서, 수행될 장래 컴퓨팅 동작들을 예측하여 (세트 당) 판독되는 데이터를 선택하는 것을 가능하게 한다. 이러한 알고리즘은 예를 들어, 메모리 액세스 동작들을 제어할 때 제어 유닛(5)에 의해 구현된다.These algorithms receive, as input parameters, information data related to the computational operations to be subsequently performed by the ALUs (9), and in particular, related to the required operands. These algorithms enable selection of data to be read (per set) by predicting future computational operations to be performed at the output. These algorithms are implemented by the control unit (5), for example, when controlling memory access operations.

제 1 접근법에 따르면, 알고리즘은 데이터가 메모리(13)에 기록되자마자 데이터의 구조화(organization)를 부과한다. 예를 들어, 데이터세트를 형성하는 것이 바람직한 데이터는 전체 데이터세트가 단일 요청에 의해 호출될 수 있도록 병치(juxtapose) 및/또는 정렬(order)된다. 예를 들어, 만약 데이터 A0, A1, A2 및 A3의 어드레스들이 @A0, @A1, @A2 및 @A3으로 각각 참조된다면, 메모리 인터페이스(15)는 @A0에 관한 판독 요청에 응답하여 또한 그 후속하는 세 개의 어드레스들 @A1, @A2 및 @A3에서의 데이터를 자동적으로 판독하도록 구성될 수 있다.According to a first approach, the algorithm imposes organization on the data as soon as it is written to the memory (13). For example, data that is desirable to form a dataset is juxtaposed and/or ordered so that the entire dataset can be retrieved with a single request. For example, if the addresses of data A0, A1, A2, and A3 are referenced as @A0, @A1, @A2, and @A3, respectively, the memory interface (15) can be configured to automatically read data at the three subsequent addresses @A1, @A2, and @A3 in response to a read request for @A0.

두 번째 접근법에 따르면, 프리페치 알고리즘은, 출력에서, ALU들(9)에 의해 후속적으로 수행될 컴퓨팅 동작들에 근거하여, 그리고 특히 필요한 오퍼랜드들과 관련하여, 적응되는 메모리 액세스 요청들을 제공한다. 앞서의 예들에서, 알고리즘은 예를 들어, 결과 CC0_3을 제공하는 기본 컴퓨팅 동작들(즉, 오퍼랜드들 A0 및 B0으로 C0을 컴퓨팅하는 것, 오퍼랜드들 A1 및 B1로 C1을 컴퓨팅하는 것, 오퍼랜드들 A2 및 B2로 C2를 컴퓨팅하는 것, 그리고 오퍼랜드들 A3 및 B3으로 C3을 컴퓨팅하는 것)을 후속하는 싸이클만큼 일찍 가능하게 하기 위해서 최우선으로 판독될 데이터가 AA0_3 및 BB0_3의 데이터임을 식별할 수 있다. 따라서, 알고리즘은, 출력에서, AA0_3 및 BB0_3에 관한 판독 동작을 발생시키도록 구성되는 메모리 액세스 요청들을 제공한다.According to a second approach, the prefetch algorithm provides, at the output, memory access requests that are adapted based on the computational operations to be subsequently performed by the ALUs (9), and in particular with respect to the operands required. In the preceding examples, the algorithm may identify that the data to be read first are data from AA0_3 and BB0_3, in order to enable, for example, the basic computational operations that provide the result CC0_3 (i.e., computing C0 with operands A0 and B0, computing C1 with operands A1 and B1, computing C2 with operands A2 and B2, and computing C3 with operands A3 and B3) as early as the subsequent cycle. Accordingly, the algorithm provides, at the output, memory access requests that are configured to trigger read operations with respect to AA0_3 and BB0_3.

두 가지 접근법들은 선택에 따라서는 서로 결합될 수 있는바, 알고리즘이 판독될 데이터를 식별하고, 그리고 이로부터 제어 유닛(5)이 상기 데이터를 획득하기 위해 메모리 인터페이스(15)에서의 메모리 액세스 요청들을 도출하는데, 여기서 요청들은 메모리 인터페이스(15)의 특징들(구조 및 프로토콜)에 근거하여 적응된다.The two approaches can optionally be combined, whereby the algorithm identifies the data to be read, and from this the control unit (5) derives memory access requests at the memory interface (15) to obtain said data, whereby the requests are adapted based on the characteristics (structure and protocol) of the memory interface (15).

앞서의 예들에서, 특히 앞서의 사례 1 및 사례 2에서, 기본 컴퓨팅 동작들에 할당된 ALU들의 수는 정의되지 않는다. 단일 ALU(9)가 모든 기본 컴퓨팅 동작들을 한 싸이클 한 싸이클 수행할 수 있다. 수행될 기본 컴퓨팅 동작들은 또한, PU의 복수의(예컨대, 네 개의) ALU들(9)에 걸쳐 분산될 수 있다. 이러한 경우들에서, 각각의 판독 동작에서 판독될 데이터를 함께 그룹화하는 기법을 이용해 ALU들에 걸쳐 컴퓨팅 동작들을 분산시키는 것을 조정하는 것은, 효율을 더 향상시키는 것을 가능하게 할 수 있다. 두 가지 접근법들은 서로 구분된다.In the above examples, particularly in Cases 1 and 2, the number of ALUs allocated to basic computing operations is undefined. A single ALU (9) may perform all basic computing operations in one cycle. The basic computing operations to be performed may also be distributed across multiple (e.g., four) ALUs (9) of the PU. In such cases, coordinating the distribution of computing operations across the ALUs by grouping together the data to be read in each read operation may enable further efficiency improvements. The two approaches are distinct.

제 1 접근법에서, 하나의 동작에서 판독되는 데이터는 딱 하나의 동일한 ALU(9)에 의해 구현되는 컴퓨팅 동작들 내의 오퍼랜드들을 형성한다. 예를 들어, 데이터 A0, A1, A2, A3, B0, B1, B2 및 B3의 그룹들 AA0_3 및 BB0_3이 먼저 판독되고, 제 1 ALU가 CC0_3(C0, C1, C2 및 C3)을 컴퓨팅하는 것을 담당하게 된다. 그 다음에, 그룹들 AA4_7(A4, A5, A6, A7) 및 BB4_7(B4, B5, B6 및 B7)이 판독되고, 제 2 ALU가 CC4_7(C4, C5, C6 및 C7)을 컴퓨팅하는 것을 담당하게 된다. 이 경우, 제 1 ALU는, 제 2 ALU가 컴퓨팅 동작들의 구현을 시작할 수 있기 전에, 컴퓨팅 동작들의 구현을 시작할 수 있는데, 왜냐하면 제 1 ALU의 컴퓨팅 동작들에 대해 필요한 오퍼랜드들이, 제 2 ALU의 컴퓨팅 동작들에 대해 필요한 오퍼랜드들이 존재하기 전에, 레지스터들(11) 상에서 이용가능할 것이기 때문임이 이해돼야 한다. 이 경우, PU의 ALU들(9)은 병렬로 그리고 비동기 방식으로 동작한다.In the first approach, the data read in one operation form operands within computing operations implemented by exactly one identical ALU (9). For example, groups AA0_3 and BB0_3 of data A0, A1, A2, A3, B0, B1, B2 and B3 are read first, and the first ALU is responsible for computing CC0_3 (C0, C1, C2 and C3). Next, groups AA4_7 (A4, A5, A6 and A7) and BB4_7 (B4, B5, B6 and B7) are read, and the second ALU is responsible for computing CC4_7 (C4, C5, C6 and C7). In this case, it should be understood that the first ALU can start implementing the computing operations before the second ALU can start implementing the computing operations, because the operands required for the computing operations of the first ALU will be available on the registers (11) before the operands required for the computing operations of the second ALU are present. In this case, the ALUs (9) of the PU operate in parallel and in an asynchronous manner.

제 2 접근법에서, 하나의 동작에서 판독되는 데이터는 상이한(예컨대, 네 개의) ALU들(9)에 의해 각각 구현되는 컴퓨팅 동작들 내의 오퍼랜드들을 형성한다. 예를 들어, A0, A4, A8 및 A12; B0, B4, B8 및 B12를 각각 포함하는 데이터의 두 개의 그룹들이 먼저 판독된다. 제 1 ALU는 C0을 컴퓨팅하는 것을 담당하게 되고, 제 2 ALU는 C4를 컴퓨팅하는 것을 담당하게 되고, 제 3 ALU는 C8을 컴퓨팅하는 것을 담당하게 되고, 그리고 제 4 ALU는 C12를 컴퓨팅하는 것을 담당하게 된다. 이 경우, 네 개의 ALU들은 이들 각각의 컴퓨팅 동작의 구현을 실질적으로 동시에 시작할 수 있을 것인데, 왜냐하면 필요한 오퍼랜드들이 공통 동작에서 이들이 다운로드됨과 동시에 레지스터들(11) 상에서 이용가능할 것이기 때문임이 이해돼야 한다. PU의 ALU들(9)은 병렬로 그리고 동기 방식으로 동작한다. 수행될 컴퓨팅 동작들의 타입들, 메모리 내의 데이터의 액세스가능성(accessibility), 및 이용가능한 리소스들에 따라, 두 가지 접근법들 하나 혹은 나머지 하나가 바람직할 수 있다. 두 가지 접근법들은 또한 결합될 수 있는바, ALU들이 서브그룹(subgroup)들로 구조화될 수 있다(동기 방식으로 동작하는 하나의 서브그룹의 ALU들, 그리고 서로에 대해 비동기 방식으로 동작하는 서브그룹들).In a second approach, the data read in one operation forms operands in computing operations that are each implemented by different (e.g., four) ALUs (9). For example, two groups of data are read first, each containing A0, A4, A8, and A12; and B0, B4, B8, and B12. The first ALU would be responsible for computing C0, the second ALU would be responsible for computing C4, the third ALU would be responsible for computing C8, and the fourth ALU would be responsible for computing C12. In this case, it should be understood that the four ALUs could begin implementing their respective computing operations substantially simultaneously, since the necessary operands would be available in the registers (11) as soon as they were downloaded in the common operation. The ALUs (9) of the PU operate in parallel and synchronously. Depending on the types of computations to be performed, the accessibility of data in memory, and available resources, one or the other of two approaches may be preferred. The two approaches can also be combined, such that the ALUs can be structured into subgroups (one subgroup of ALUs operating synchronously, and the other subgroups operating asynchronously with respect to each other).

ALU들의 동기화된 동작, 비동기 동작, 또는 혼합된 동작을 부과하기 위해, 판독 동작 당 판독될 데이터를 함께 그룹화하는 것은 다양한 ALU들에 대한 컴퓨팅 동작들의 할당들의 분포에 대응하도록 선택돼야 한다.To impose synchronous, asynchronous, or mixed operation of the ALUs, the grouping of data to be read per read operation must be chosen to correspond to the distribution of assignments of computing operations to the various ALUs.

앞서의 예들에서, 기본 컴퓨팅 동작들은 서로 독립되어 있다. 따라서, 이들이 수행되는 순서는 선험적으로 어떤한 중요성도 갖지 않는다. 컴퓨팅 동작들의 적어도 일부가 서로 종속되어 있는 일부 애플리케이션들에서, 컴퓨팅 동작들의 순서는 특정적일 수 있다. 이러한 상황은 전형적으로 회귀적 컴퓨팅 동작(recursive computing operation)들의 상황에서 일어난다. 이러한 경우들에서, 알고리즘은 최우선으로 획득(판독)될 데이터를 식별하도록 구성될 수 있다. 예를 들어, 만약In the preceding examples, the basic computational operations are independent of each other. Therefore, the order in which they are performed has no a priori significance. In some applications where at least some of the computational operations are dependent on each other, the order of the computational operations may be specific. This situation typically occurs in the context of recursive computing operations. In such cases, the algorithm can be configured to identify the data to be acquired (read) first. For example, if

- 결과 C1이, 오퍼랜드들 중 하나가 C0이고 CO 자체는 오퍼랜드들 A0 및 B0으로부터 획득되는 컴퓨팅 동작을 통해 획득된다면,- If the result C1 is obtained through a computing operation in which one of the operands is C0 and CO itself is obtained from the operands A0 and B0,

- 결과 C5가, 오퍼랜드들 중 하나가 C4이고 C4 자체는 오퍼랜드들 A4 및 B4로부터 획득되는 컴퓨팅 동작을 통해 획득된다면,- If the result C5 is obtained through a computational operation in which one of the operands is C4 and C4 itself is obtained from the operands A4 and B4,

- 결과 C9가, 오퍼랜드들 중 하나가 C8이고 C8 자체는 오퍼랜드들 A8 및 B8로부터 획득되는 컴퓨팅 동작을 통해 획득된다면, 그리고- If the result C9 is obtained through a computing operation in which one of the operands is C8 and C8 itself is obtained from the operands A8 and B8, and

- 결과 C13이, 오퍼랜드들 중 하나가 C12이고 C12 자체는 오퍼랜드들 A12 및 B12로부터 획득되는 컴퓨팅 동작을 통해 획득된다면,- If the result C13 is obtained through a computational operation in which one of the operands is C12 and C12 itself is obtained from the operands A12 and B12,

알고리즘은 처음 두 개의 초기화 싸이클 #0 및 싸이클 #1 동안, 다음과 같이 정의되는 데이터세트들:During the first two initialization cycles #0 and #1, the algorithm computes the following datasets:

- {A0; A4; A8; A12}, 및- {A0; A4; A8; A12}, and

- {B0; B4; B8; B12}를 판독하도록 구성될 수 있다.- It can be configured to read {B0; B4; B8; B12}.

이에 따라 정의된 데이터세트가 도 4에서 보여진다. 비유적으로, 데이터가 도 3에서 보여지는 실시예에서는 "행별로(in rows)" 함께 그룹화되고, 그리고 도 4에서 보여지는 실시예에서는 "열별로(in columns)" 함께 그룹화된다고 말해질 수 있다. 따라서, 알고리즘을 구현하는 것은, 최우선 기본 컴퓨팅 동작들에 대해 유용한 오퍼랜드들을 판독하는 것 및 이들을 레지스터들(11) 상에서 이용가능하게 하는 것을 가능하게 한다. 달리 말하면, 알고리즘을 구현하는 것은 무작위 판독 동작과 비교하여 판독 데이터의 단기 관련성(short-term relevance)을 증가시키는 것을 가능하게 한다.The dataset defined accordingly is shown in Fig. 4. Metaphorically, it can be said that the data is grouped together "in rows" in the embodiment shown in Fig. 3, and "in columns" in the embodiment shown in Fig. 4. Therefore, implementing the algorithm enables reading operands useful for the primary computing operations and making them available on the registers (11). In other words, implementing the algorithm enables increasing the short-term relevance of the read data compared to a random read operation.

본 발명은 오로지 예로서 앞에서 설명된 프로세싱 유닛들 및 방법들의 예들에 한정되지 않으며, 오히려 추구되는 보호의 범위 내에서 본 발명의 기술분야에서 숙련된 사람이 고려할 수 있을 모든 변형들을 포함한다. 본 발명은 또한, 프로세서 혹은 일 세트의 프로세서들과 같은 그러한 컴퓨팅 디바이스를 획득하기 위한 일 세트의 프로세서-구현가능 머신 명령들에 관한 것이고, 프로세서 상에 그러한 일 세트의 머신 명령들을 구현하는 것에 관한 것이고, 프로세서에 의해 구현되는 프로세서 아키텍처 관리 방법에 관한 것이고, 대응하는 세트의 머신 명령들을 포함하는 컴퓨터 프로그램에 관한 것이고, 그리고 이러한 일 세트의 머신 명령들이 컴퓨팅가능하게 기록되는 기록 매체에 관한 것이다.The present invention is not limited to the examples of processing units and methods described above by way of example only, but rather includes all variations that a person skilled in the art would consider within the scope of the protection sought. The present invention also relates to a set of processor-implementable machine instructions for obtaining such a computing device, such as a processor or a set of processors, to implementing such a set of machine instructions on a processor, to a method for managing a processor architecture implemented by a processor, to a computer program comprising a corresponding set of machine instructions, and to a recording medium on which such a set of machine instructions is computably recorded.

Claims

As a computing device (1),
The above computing device (1) is,
- a plurality of arithmetic logic units (9);
- A set of registers (11),
The above registers (11) can supply data of operand type to the inputs of the arithmetic logic units (9) and can receive data from the outputs of the arithmetic logic units (9);
- Memory (13) and;
- Memory interface (15),
Data (A0, A15) is transmitted and routed between the registers (11) and the memory (13) through the memory interface (15);
- Includes a control unit (5),
The control unit (5) is configured to control the arithmetic logic units (9) according to a processing chain microarchitecture so that the arithmetic logic units (9) perform computing operations in parallel with each other.
The above control unit (5) is also designed to control the memory access operations via the memory interface (15),
The control operations performed by the above control unit (5) are:
- at least one cycle i;
- generating at least one cycle ii subsequent to at least one cycle i,
wherein said at least one cycle i comprises, in parallel, implementing at least one first computing operation via the arithmetic logic unit (9) and downloading a first dataset (AA4_7; BB4_7) from the memory (13) to at least one register (11), wherein at least a part (A4; B4) of said first dataset (AA4_7; BB4_7) is not used by any computing operation of the arithmetic logic units (9) during said cycle i,
A computing device characterized in that said at least one cycle ii comprises implementing a second computing operation via an arithmetic logic unit (9), wherein during said second computing operation at least said part (A4; B4) of said first dataset (AA4_7; BB4_7) forms at least one operand.

In the first paragraph,
A computing device characterized in that the control unit (5) is also configured to implement an identification algorithm for identifying the first dataset (AA4_7; BB4_7) to be downloaded during the at least one cycle i based on the second computing operation to be implemented during the at least one cycle ii before controlling the arithmetic logic units and the memory access operations.

In the first paragraph,
The above control unit (5) is configured to implement two separate cycles i, so that two separate first data sets (AA4_7, BB4_7) are downloaded to at least one register (11),
A computing device characterized in that at least a portion (A4, B4) of each of said two first data sets (AA4_7, BB4_7) forms an operand for said second computing operation of said at least one cycle ii.

In the first paragraph,
A computing device characterized in that the control unit (5) is configured to implement a plurality of cycles ii that are separated from each other, such that a part (A4; A5; A6; A7) of the first data set (AA4_7) forming at least one operand for the second computing operation of the cycle ii is configured to be different in one cycle ii and another cycle ii of the plurality of cycles ii.

In the first paragraph,
The above control unit (5) is configured to perform at least two iterations of at least one cycle i and one cycle ii in a series,
A computing device characterized in that the two iterations are at least partially superimposed such that at least one cycle ii of the first iteration forms a cycle i of the subsequent iteration.

In the first paragraph,
A computing device characterized in that the control unit (5) is configured to perform an initialization phase, which comprises downloading at least one data set (AA0_3; BB0_3) forming operands for the first computing operation of the first cycle i from the memory (13) to at least one register (11) prior to the first cycle i.

In the first paragraph,
The above control unit (5) is also designed to control the memory access operations through the memory interface (15), so that the control operations are
- During cycle i, implementation of a plurality of first computing operations by a plurality of arithmetic logic units (9);
- designed to cause the implementation of a plurality of second computing operations by a plurality of arithmetic logic units (9) during cycle ii,
A computing device characterized in that the grouping of data for the dataset to be downloaded is selected to match the distribution of allocations of computing operations to each of the plurality of arithmetic logic units (9), such that the arithmetic logic units (9) are selected to have synchronized operation, asynchronous operation, or mixed operation.

A data processing method implemented by a control unit (5) of a computing device (1),
The above device (1) is,
- Multiple arithmetic logic units (9);
- A set of registers (11),
The above registers (11) can supply data of operand type to the inputs of the arithmetic logic units (9) and can receive data from the outputs of the arithmetic logic units (9);
- Memory (13) and;
- Memory interface (15),
Data (A0, A15) is transmitted and routed between the registers (11) and the memory (13) through the memory interface (15);
- Including the above control unit (5),
The above control unit (5) is configured to control the arithmetic logic units (9) according to the processing chain microarchitecture so that the arithmetic logic units (9) perform computing operations in parallel with each other.
The above control unit (5) is also designed to control the memory access operations through the memory interface (15),
The above method comprises at least:
- generating cycle i;
- Including generating a cycle ii subsequent to the above cycle i,
The above cycle i comprises implementing at least one first computing operation via the arithmetic logic unit (9) and downloading a first data set (AA4_7; BB4_7) from the memory (13) to at least one register (11) in parallel, wherein at least a part (A4; B4) of the first data set (AA4_7; BB4_7) is not used by any computing operation of the arithmetic logic units (9) during the cycle i,
A data processing method, characterized in that the cycle ii comprises implementing a second computing operation via an arithmetic logic unit (9), during which at least a portion (A4; B4) of the first dataset (AA4_7; BB4_7) forms at least one operand.

A computer program, characterized in that the computer program includes instructions for implementing a method as described in claim 8 when the computer program is executed by a processor.

A non-transient computer-readable recording medium, wherein a program is recorded on the non-transient computer-readable recording medium, and the program is characterized in that the program is for implementing the method as described in claim 8 when the program is executed by a processor.