CN118284883A - Using buffer structures and signaling to record burst error information in dynamic random access memory (DRAM) - Google Patents

Using buffer structures and signaling to record burst error information in dynamic random access memory (DRAM) Download PDF

Info

Publication number
CN118284883A
CN118284883A CN202280077041.3A CN202280077041A CN118284883A CN 118284883 A CN118284883 A CN 118284883A CN 202280077041 A CN202280077041 A CN 202280077041A CN 118284883 A CN118284883 A CN 118284883A
Authority
CN
China
Prior art keywords
buffer
error
interrupt
read
error information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280077041.3A
Other languages
Chinese (zh)
Inventor
T·宋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rambus Inc
Original Assignee
Rambus Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rambus Inc filed Critical Rambus Inc
Publication of CN118284883A publication Critical patent/CN118284883A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature
    • G06F11/1052Bypassing or disabling error detection or correction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0772Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

描述了用于将突发错误信息存储在缓冲器结构和信令中以防止缓冲器结构溢出和重写的技术。一种控制器设备包括错误检测逻辑、缓冲器和缓冲器控制逻辑。错误检测逻辑检测与耦合到控制器设备的存储器设备相关联的读取操作中的错误。缓冲器存储与错误相关联的错误信息。缓冲器控制逻辑响应于缓冲器满而生成并且输出第一信号。

Techniques for storing burst error information in a buffer structure and signaling to prevent overflow and overwrite of the buffer structure are described. A controller device includes error detection logic, a buffer, and buffer control logic. The error detection logic detects errors in read operations associated with a memory device coupled to the controller device. The buffer stores error information associated with the errors. The buffer control logic generates and outputs a first signal in response to the buffer being full.

Description

Recording burst error information for Dynamic Random Access Memory (DRAM) using buffer structure and signaling
Background
Modern computer systems typically include data storage devices, such as memory components or memory devices. The memory component may be, for example, a Random Access Memory (RAM) device or a Dynamic Random Access Memory (DRAM) device. The memory device includes a bank of memory cells that are accessed by a memory controller or memory client through a command interface and a data interface within the memory device. The memory controller may include an Error Correction Code (ECC) engine that may detect errors in read data read from the DRAM device. The ECC engine may record the error until analyzed by another entity. However, in some cases, such as in the case of a word line driver failure, the continuous read response from the DRAM may contain multiple errors, referred to as burst error detection. However, the interrupt routine may take multiple clock cycles to read an error in the ECC engine, so that the previous error information may be overwritten by the next error information, resulting in a loss of error information.
Drawings
The present disclosure is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
FIG. 1A is a block diagram of a memory system having a controller and a memory device according to one implementation.
FIG. 1B is a timing diagram of a plurality of errors detected by an ECC engine during error processing time of an interrupt handling routine, according to one implementation.
FIG. 2 is a block diagram of a memory system having a memory device and a controller with a buffer structure in accordance with at least one embodiment.
FIG. 3 is a block diagram of a controller having an ECC engine, processor, and buffer structure in accordance with at least one embodiment.
FIG. 4 is a flow diagram of a method of reading burst error information from multiple entries of a first-in-first-out (FIFO) buffer, according to at least one embodiment.
FIG. 5 is a block diagram of an integrated circuit having an error reporting engine with a FIFO buffer in accordance with at least one embodiment.
FIG. 6 is a flow diagram of a method of operating an integrated circuit for recording burst error information for a memory device in accordance with at least one embodiment.
Detailed Description
The following description sets forth numerous specific details, such as examples of specific systems, components, methods, etc., in order to provide a good understanding of several embodiments of the present disclosure. It will be apparent, however, to one skilled in the art that at least some embodiments of the present disclosure may be practiced without these specific details. In other instances, well-known components or methods have not been described in detail or presented in simple block diagram format in order to avoid unnecessarily obscuring the present disclosure. Therefore, the specific details set forth are merely exemplary. The particular implementations may vary from these exemplary details and still be considered to be within the scope of the present disclosure.
FIG. 1A is a block diagram of a memory system 100 having a controller 102 and a memory device 104 according to one implementation. The controller 102 includes an ECC engine 106 and a processor 108 (also referred to as a management processor). During operation, ECC engine 106 may detect errors in data read from memory device 104 (e.g., a DRAM device) (101). The ECC engine 106 may record the error until it is analyzed by the processor 108. The error may be recorded in a designated register or memory location of the ECC engine 106. In response to the detection of the error (101), the ECC engine 106 asserts an interrupt to the processor 108 (103), causing the processor 108 to read the saved error information from the specified registers (105), and upon processing the interrupt, clear the interrupt (107). Asserting the interrupt (103) may trigger an interrupt handling routine on the processor 108 to read error information (105) from the ECC engine 106 and clear the interrupt (107). The interrupt handling routine may take several clock cycles, such as tens to hundreds of clock cycles, to read the error information (105) and clear the interrupt (107). Asserting the interrupt (103) may also trigger a request for erasure option to find the error type of the detected error. To manage the memory device 104, all error information should be recorded and analyzed by the processor 108. The processor 108 may enable post-package repair (PPR), perform page offline, health monitoring, replace failed memory devices, and/or other management processes based on the error information.
There are cases where multiple errors may occur in a shorter time than it takes for the interrupt handling routine of the processor 108 to read the error information (105) and clear the interrupt (107) before the subsequent error information rewrites the previous error information. The time it takes for the interrupt handling routine to read the error information (105) and clear the interrupt (107) is referred to as error handling time 159. As illustrated in fig. 1B, burst error detection occurs when multiple error detections occur in a time shorter than the error processing time.
FIG. 1B is a timing diagram 150 of a plurality of errors detected by an ECC engine during an error handling time of an interrupt handling routine, according to one implementation. As described above, in response to ECC engine 106 detecting error 151 (101), ECC engine 106 asserts interrupt 153 (103) and stores error information 155. Asserting interrupt 153 (103) triggers interrupt handling routine 157 to read error information 155 (105) and clear interrupt 153 (107). Since only one error 151 is detected within the error processing time 159, the error information 155 may be read from the ECC engine 106 without losing information.
However, as illustrated by timing diagram 150, in the event that a subsequent error is detected, there are cases where multiple errors may be detected within error processing time 159. For example, the word line driver may have a fault that causes a sequential read response from the memory device 104 to contain an error, thereby causing burst error detection 160. In particular, burst error detection 160 may begin with detection of a first error 161. In response to the ECC engine 106 detecting the first error 161 (101), the ECC engine 106 asserts the first interrupt 163 (103) and stores the first error information 165. Asserting the first interrupt 163 (103) triggers the interrupt handling routine 157 to read the first error information 165 (105) and clear the first interrupt 163 (107). The problem is that the interrupt handling routine 157 spends a first error handling time 171 reading the first error information 165 and clearing the first interrupt 163 and detecting the second 167 and third 169 errors within the first error handling time 171. Since two errors are detected within the first error processing time 171, the first error information 165 may be overwritten with the second error information from the second error 167 and/or the third error information from the third error 169, resulting in a loss of error information. In some cases, the second error information of the second error 167 is read and the third error information of the first error 165 and the third error 169 is lost. That is, the error information from the previous error may be overwritten by the error information of the next error.
As shown in fig. 1B, burst error detection caused by word line failure may not be properly managed. Error detection may occur at an error detection rate of every two clock cycles, which is much shorter than the error handling time. The error information may be overwritten where old information is lost, or overflow may occur where new information is lost.
Aspects of the present disclosure overcome the above and other drawbacks by providing signaling for buffer structures to prevent buffer structure overflow and overwriting. The buffer structure may include buffers, such as first-in-first-out (FIFO) buffers and buffer control logic. The FIFO buffer may include a plurality of entries to hold error information for a plurality of errors. The buffer control logic may generate and output a first signal to prevent overflow and overwriting in response to the FIFO buffer being full. In another embodiment, the buffer control logic outputs the second signal in response to the FIFO buffer meeting a fill condition that is less than FIFO buffer full. The second signal may upgrade the interrupt priority if the FIFO buffer reaches a threshold level. Aspects of the present disclosure may provide various benefits, including better reliability. The buffer structure and signaling described herein may improve the reliability of memory management of a memory device by a management processor because all error information may be reported and analyzed without losing the error information. The buffer structure can effectively process the DRAM burst error information while preventing the overwriting error information or overflow of the FIFO buffer. Since all error information is reported and analyzed, the management process (e.g., PPR, offline) can be reliably triggered when needed by the memory device. Aspects of the present disclosure may provide signaling (e.g., backpressure signals) to block read responses to the ECC engine and upgrade interrupt priority to read error information before the FIFO buffer becomes full. Aspects of the present disclosure also provide a mechanism to look up the corresponding Device Physical Address (DPA) from the returned Read Identifier (RID).
FIG. 2 is a block diagram of a memory system 200, the memory system 200 having a memory device 204 and a controller device 202 with a buffer structure 210, in accordance with at least one embodiment. The controller device 202 may communicate with the memory device 204 using a cache coherence interconnect protocol (e.g., compute Express Link TM(CXLTM protocols). The controller device 202 may be a device that implements the CXL TM standard. CXL TM protocol can be built on PCIStandard physical and electrical interfaces, where protocols establish consistency, simplify software stacks, and maintain compatibility with existing standards. The controller device 202 includes error detection logic 206 and a processor 208 (also referred to as a management processor). The controller device 202 may be part of a single host memory expansion integrated circuit, a multi-host memory pool integrated circuit, or the like.
In at least one embodiment, the controller device 202 includes error detection logic 206. The error detection logic 206 may detect errors in read operations associated with the memory device 204 coupled to the controller device 202. The error detection logic 206 may be part of an ECC engine. Alternatively, other types of error detection circuitry may be used to detect errors in data read from memory device 204. In at least one embodiment, the memory device 204 is a DRAM device.
In one embodiment, the buffer structure 210 may include a buffer to store error information associated with an error and buffer control logic to generate and output a first signal in response to the buffer being full. The buffer may be a FIFO buffer having a plurality of entries. Each entry may store an identifier, a device physical address, an error type, and error information. In other embodiments, the buffer control logic may monitor the buffer and generate and send the second signal in response to the buffer meeting a fill condition that is less than buffer full (e.g., less than 5% of remaining space or X remaining entries, etc.). In at least one embodiment, the first signal is a backpressure signal and the second signal is an interrupt. The backpressure signal may be indicative of data accumulation in the buffer. The backpressure signal may be sent when the buffer is full and unable to receive additional data. The backpressure signal may cause the error detection logic 206 (or ECC engine) to cease receiving read data from the memory device 204 to prevent the possibility of detecting additional errors and storing error information for those errors in the buffer. No additional data is transferred until the buffer is emptied or a specified condition is reached, such as a specified level of available space in the buffer.
In another embodiment, the buffer control logic may generate and output the first interrupt in response to the buffer meeting a first fill condition that is less than full of the buffer. The first interrupt may be associated with a first priority. The buffer control logic may generate and output a second interrupt in response to the buffer meeting a second fill condition between the first fill condition and the buffer full. The second interrupt may be associated with a second priority that is greater than the first priority. In this way, the buffer control logic may upgrade the priority of interrupts when the buffer is nearly full to improve performance by preventing overflow or overwriting of the buffer.
During operation, error detection logic 206 may detect (201) errors in read data read from memory device 204 (e.g., a DRAM device). The error detection logic 206 may record the error until it is analyzed by the processor 208. The error detection logic 206 may store the error information in the buffer structure 210 (205). The buffer structure 210 may include a buffer and buffer control logic. The buffer may be a FIFO buffer and may include a plurality of entries, each storing error information associated with each error detected by the error detection logic 206. In response to detecting the error (201), the error detection logic 206 asserts an interrupt to the processor 108 (203) such that the processor 208 reads the saved error information from the buffer structure 210 (207) and clears the interrupt once the interrupt is processed (209). Asserting an interrupt (203) may trigger an interrupt handling routine on the processor 208 to read error information (207) from the buffer structure 210 and clear the interrupt (209). The interrupt handling routine may take multiple clock cycles (such as tens to hundreds of clock cycles) to read the error information 207 and clear the interrupt 209. Asserting an interrupt (203) may also trigger a request erasure option to find the error type of the detected error. To manage the memory device 204, all error information should be recorded and analyzed by the processor 208. Processor 208 may enable PPR, perform page offline, health monitoring, replace failed memory devices, and/or other management processes based on the error information.
As described above, there are cases where multiple errors can occur in a time shorter than the time it takes for the interrupt handling routine of the processor 208 to read the error information (207) and clear the interrupt (209). However, in this case, subsequent error information may be written into the subsequent buffer entry, preventing the subsequent error information from overwriting the previous error information. The time it takes for the interrupt handling routine to read the error information (207) and clear the interrupt (209) is referred to as error handling time. Burst error detection occurs when multiple error detections occur in a time shorter than the error handling time. Using the buffer structure 210, burst error detection may be recorded that is read from the buffer structure 210 without losing information due to overwriting or overflow, as described in more detail below.
FIG. 3 is a block diagram of a controller 302 having an ECC engine 306, a processor 308, and a buffer structure 310, according to at least one embodiment. The buffer structure 310 includes an error recording FIFO structure 312 having a FIFO buffer 318 with a plurality of entries coupled between a multiplexer 320 and a demultiplexer 322 and buffer control logic. The buffer control logic may include backpressure signal logic 316 and match logic 314. Since only the Request Identifier (RID) is returned with the read data, not the Device Physical Address (DPA), the matching logic 314 may match the RID with the physical address of the read operation, as described in more detail below.
When an error is detected by ECC engine 306, match logic 314 uses the RID-DPA map in buffer 332 to provide the DPA for the corresponding request. The ECC engine 306 may also output an error signal to the match logic 314. The ECC engine 306 may also output error information associated with the error to the error record FIFO structure 312 simultaneously with the identifier and physical address output by the matching logic 314. The ECC engine 306 may detect a plurality of errors caused by word line faults in the memory device 304. For example, ECC engine 306 may detect errors once every two clock cycles, which is less than the error processing time. RID, DPA, error type (e.g., uncorrectable Error (UE) or Correctable Error (CE)), error location may be saved into a free entry in FIFO buffer 318. In other embodiments, other error information may be stored in error log FIFO structure 312. For example, ECC engine 306 may provide a DRAM identifier containing errors (multiple thermal encodings) and Bit Line (BL) locations (multiple thermal encodings). Accordingly, error record FIFO structure 312 holds error location information, including malfunctioning DRAMs and BLs associated with the errors. The error log FIFO structure 312 may assert an interrupt signal on the interrupt pin to trigger an interrupt handling routine of the processor 308.
In at least one embodiment, the controller 302 may be coupled to the memory device 304 using an address register bus 342 (AR bus) and a read bus 344 (R bus). The AR bus 342 may send read commands to the memory device 304, and the R bus 344 may receive read data and a Request Identifier (RID) associated with the read data from the memory device 304. Each read command includes an identifier, such as an AR identifier (ArID) and an AR device physical address (ArADDR). Typically, the read response from the memory controller does not have address information, so the match logic 314 may save the DPA for each request from the host Central Processing Unit (CPU). The match logic 314 is coupled to an AR bus 342 and an R bus 344. The match logic 314 receives ArID and ArADDR for each read operation on the AR bus 342. The match logic 314 may include a buffer 332 with a plurality of entries that store each of ArID and ArADDR for each read operation. Multiplexer 334 may be used to select the entries stored in buffer 332 for respective ArID and ArADDR.
Similarly, a demultiplexer 336 may be used to read the corresponding entry from the buffer 332. In at least one embodiment, a second demultiplexer 338 may be used to select between entries in buffer 332 and addresses provided by patrol erase logic 340 operating in erase mode. The demultiplexer 336 (and the second demultiplexer 338) may be enabled by a gate that is activated by detecting the error signal received from the ECC engine 306 and the RID on the R bus 344. The match logic 314 is coupled to the error log FIFO structure 312.ECC engine 306 is coupled to R bus 344 and error recording FIFO structure 312.
During operation, the matching logic 314 stores an identifier and associated physical address for each of the read commands sent on the AR bus 342. ECC engine 306 receives the read data via R bus 344. The match logic 314 receives the respective identifiers corresponding to the read data via the R bus 344 and receives the error signal from the ECC engine 306. The match logic 314 locates the associated physical address of the corresponding identifier received from the R bus 344 and outputs the identifier and the associated physical address to the error record FIFO structure 312 in response to the error signal. The ECC engine 306 also outputs error information to be stored in the error record FIFO structure 312 along with the identifier and associated physical address. The write pointer may control the multiplexer 320 to store the error information, the identifier, and the physical address in a specified entry of the FIFO buffer 318. The read pointer may be used by the processor 308 to control the demultiplexer 322 to read a specified entry in the error log FIFO structure 312.
In at least one embodiment, the interrupt register 328 of the error log FIFO structure 312 may be used to assert an interrupt signal to the processor 308. In at least one embodiment, the error log FIFO structure 312 may send two interrupt signals, including a first interrupt signal indicating that a valid entry exists and a second interrupt signal indicating that the queue occupancy of the FIFO buffer 318 exceeds a threshold (or meets a threshold condition). In at least one embodiment, error log FIFO structure 312 may include a full register 324. The full register 324 may store a value indicating that the FIFO buffer 318 has a free entry. When de-asserted, the ready signal 301 of the ECC engine 306 is de-asserted. This results in no read response from the memory device 304 to the ECC engine 306 on the R bus 344 to prevent overflow and overwriting of entries in the FIFO buffer 318. In at least one embodiment, error log FIFO structure 312 may include a next valid register 326, which may store a value indicating that processor 308 may read a plurality of entries as part of a set of errors. The next valid register 326 may indicate that the FIFO buffer 328 has another valid error record in the next entry. In general, when a controller accesses the same row having the same physical address, multiple errors may occur in reading data. In this case, FIFO buffer 318 may store multiple error events associated with the same physical address. Independent of the interrupt handling of each entry, the processor 308 may read all error event record entries until the value in the next valid register 326 indicates that it is the last entry of a set of error events (e.g., next_valid=0, instead of next_valid=1), such as illustrated in fig. 4. In another embodiment, error log FIFO structure 312 may include overflow registers 330.
In at least one embodiment, the buffer control logic provides a first signal (e.g., the backpressure signal or ready signal 301) via the R bus 344 in response to the FIFO buffer 318 being full. When the overflow register 330 stores a specified value, the buffer control logic does not generate and output a first signal (e.g., the backpressure signal or ready signal 301) to not block subsequent read responses on the R bus 344. If errors are detected in subsequent read responses, then error information associated with those errors will overflow FIFO buffer 318 (or alternatively overwrite entries in FIFO buffer 318).
FIG. 4 is a flow diagram of a method 400 of reading burst error information from multiple entries of a FIFO buffer, according to at least one embodiment. The method 400 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc.), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In one embodiment, method 400 is performed by processor 208 of fig. 2 or processor 308 of fig. 3.
Referring to FIG. 4, method 400 begins with processing logic detecting an interrupt (block 402). In response to detecting the interrupt at block 402, processing logic reads error information in the single entry (block 404). Processing logic checks whether the value in the next valid register indicates that the FIFO buffer has another valid error record in the next entry (block 406) (e.g., next_valid=1). If the value in the next valid register indicates another valid error record, then at block 404 processing logic reads the error information in the next entry. At block 406, processing logic continues to read error information in the entries of the FIFO buffer until the value in the next valid register indicates that there is no other valid error record in the next entry. In response, processing logic clears the interrupt (block 408).
FIG. 5 is a block diagram of an integrated circuit 500 having an error reporting engine 508 with a FIFO buffer 510, according to at least one embodiment. In at least one embodiment, integrated circuit 500 is a memory expansion chip coupled to a single host system through a cache coherence interconnect. In another embodiment, integrated circuit 500 is a multi-host memory pool chip coupled to multiple host systems through multiple cache coherence interconnects.
In the illustrated embodiment, integrated circuit 500 includes a first interface 502 coupled to one or more host systems (not shown in FIG. 5) and a second interface 504 coupled to one or more memory devices (not shown in FIG. 5). Integrated circuit 500 includes ECC engine 506, error reporting engine 508, and management processor 512. The ECC engine 506 may detect burst error information in the data 501 read from the one or more memory devices. The error reporting engine 508 includes a FIFO buffer 510, the FIFO buffer 510 to store burst error information and to set one or more interrupts to the management processor 512. The management processor 512 is coupled to the ECC engine 506 and the error reporting engine 508. The management processor 512 may read the burst error information from the FIFO buffer 510 and clear one or more interrupts. The error reporting engine 508 may use signaling to the ECC engine 506 and the management processor 512 to prevent overwriting of burst error information or overflow in the FIFO buffer 510.
In other embodiments, integrated circuit 500 includes memory controller 514. The error reporting engine 508 may send a signal to the memory controller 514 in response to the FIFO buffer 510 being full to prevent overwriting or overflowing in the FIFO buffer 510. In another embodiment, memory controller 514 is coupled to integrated circuit 500 and error reporting engine 508 sends a signal to memory controller 514.
In at least one embodiment, the error reporting engine 508 sends a first interrupt to the management processor 512 in response to the burst error information detected by the ECC engine 506. The error reporting engine 508 sends a second interrupt to the management processor 512 in response to the FIFO buffer 510 meeting a fill condition that is less than full of the FIFO buffer 510. The second interrupt may include a higher priority than the first interrupt.
In another embodiment, the management processor 512 includes an interrupt handling routine to read burst error information from the FIFO buffer 510 and clear one or more interrupts during a first amount of time. The first amount of time may be an error handling time of the interrupt handling routine. In at least one embodiment, the burst error information includes error information regarding at least two errors detected in a second amount of time less than the first amount of time.
In another embodiment, error reporting engine 508 includes a FIFO buffer 510 having a set of entries and matching logic having a buffer to store a set of read identifiers and corresponding Device Physical Addresses (DPAs). The error reporting engine 508 includes buffer control logic to send a signal 503 to the memory controller 514 in response to the FIFO buffer 510 being full to prevent overwriting or overflowing in the FIFO buffer 510. In at least one embodiment, error reporting engine 508 includes a first register to store a first indication that FIFO buffer 510 is full. The first indication may be a value, status bit, multi-bit in a first register that causes error reporting engine 508 to send signal 503 to memory controller 514. In another embodiment, error reporting engine 508 includes a second register to store a second indication of one or more interrupts. The second indication may be a value, status bit, multi-bit in a second register that causes error reporting engine 508 to send interrupt signal 505 to management processor 512.
Error reporting engine 508 may provide a structure that can efficiently handle DRAM burst information. The error reporting engine 508 may use an error recording FIFO module to prevent overwriting of error information or to prevent overflow. The error reporting engine 508 may generate a backpressure signal to prevent a read response to the ECC engine 506. Error reporting engine 508 can use a lookup table that matches corresponding DPAs using a Request Identifier (RID) returned from a memory device. The error reporting engine 508 may upgrade the interrupt priority to cause the management processor 512 to read the error information before the FIFO buffer 510 becomes full. The error reporting engine 508 may provide more reliable memory management operations such as PPR, offline, etc.
In another embodiment, integrated circuit 500 is a processor implementing the CXL TM standard and includes matching logic and a FIFO buffer. The output of the match logic passes through the FIFO and generates a backpressure signal when the FIFO buffer becomes full. In other embodiments, the processor may upgrade the interrupt level if the FIFO buffer reaches a threshold level or other fill condition that is less than FIFO buffer full.
In at least one embodiment, to prevent overwriting of error information caused by burst error detection in a time shorter than the interrupt handling time, an error record FIFO buffer (e.g., 510) of the error reporting engine 508 is interposed between the ECC engine 506 and the management processor 512. The error recording FIFO buffer may hold a plurality of error messages before the management processor 512 reads all the error messages. When an entry in the FIFO buffer exceeds a predefined threshold level, the error reporting engine 508 asserts an additional interrupt signal to indicate an emergency to the management processor 512. The interrupt has the highest priority and therefore the management processor 512 should read and invalidate the entry before overflowing or overwriting the FIFO buffer. When the error record FIFO is full, the error reporting engine 508 sends a backpressure signal (e.g., 503) to the memory controller 514 to keep the read operation. Using this backpressure signal, all error information may be passed to management processor 512 without any loss of error information.
FIG. 6 is a flow diagram of a method 600 of operating an integrated circuit for recording burst error information for a memory device in accordance with at least one embodiment. Method 600 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc.), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In one embodiment, the method 600 is performed by the controller device 202 of fig. 2. In one embodiment, method 600 is performed by buffer structure 310 of FIG. 3. In one embodiment, method 600 is performed by buffer control logic as described herein.
Referring to fig. 6, method 600 begins by processing logic detecting burst error information in data read from one or more memory devices (block 602). The burst error information includes error information regarding at least two errors detected in the first amount of time. Processing logic stores the burst error information in a buffer (block 604). Processing logic generates an interrupt to the management processor to read the burst error information and clear the interrupt (block 606). The management processor reads the burst error information and clears the interrupt for a second amount of time (interrupt processing time or error processing time) that is greater than the first amount of time. Processing logic prevents the buffer from being overwritten or overflowed (block 608), and the method 600 returns to block 602 or ends.
In at least one embodiment, at block 608, processing logic prevents the buffer from being overwritten or overflowed by sending a signal to the memory controller in response to the buffer being full. In another embodiment, at block 608, processing logic prevents the buffer from being overwritten or overflowed by: transmitting a signal to the memory controller in response to the buffer being full; transmitting a first interrupt to the management processor in response to detecting the burst error information; and sending a second interrupt to the management processor in response to the buffer meeting a fill condition less than full of the buffer, wherein the second interrupt includes a higher priority than the first interrupt.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementations will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
In the above description, numerous details are set forth. It will be apparent, however, to one skilled in the art that the aspects of the present disclosure may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present disclosure.
Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithms describe and represent the means by which those skilled in the data processing arts most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "receiving," "determining," "selecting," "storing," "setting," or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random Access Memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as described in the specification. In addition, aspects of the present disclosure are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the disclosure as described herein.
Aspects of the present disclosure may be provided as a computer program product or software which may include a machine-readable medium having stored thereon instructions which may be used to program a computer system (or other electronic device) to perform a process according to the present disclosure. A machine-readable medium includes any program for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium (e.g., read only memory ("ROM"), random access memory ("RAM"), magnetic disk storage media, optical storage media, flash memory devices, etc.).

Claims (20)

1. A controller device, comprising:
error detection logic to detect an error in a read operation associated with a memory device coupled to the controller device;
A buffer to store error information associated with the error; and
Buffer control logic to generate and output a first signal in response to the buffer being full.
2. The controller apparatus of claim 1, wherein the buffer control logic is to generate and output a second signal in response to the buffer meeting a fill condition that is less than the buffer full.
3. The controller device of claim 2, wherein the first signal is a backpressure signal and the second signal is an interrupt.
4. The controller apparatus of claim 1, wherein the buffer control logic is to:
generating and outputting a first interrupt in response to the buffer meeting a first fill condition that is less than the buffer full, wherein the first interrupt is associated with a first priority; and
A second interrupt is generated and output in response to the buffer meeting a second fill condition between the first fill condition and the buffer being full, and wherein the second interrupt is associated with a second priority that is greater than the first priority.
5. The controller apparatus of claim 1, wherein the buffer is a first-in first-out (FIFO) buffer.
6. The controller device of claim 1, wherein the controller device communicates with the memory device using a cache coherence interconnect protocol.
7. The controller device of claim 1, further comprising matching logic to output an identifier of the read operation and a physical address of the read operation to the buffer, wherein the error detection logic is an Error Correction Code (ECC) engine, wherein the ECC engine is to:
detecting the error from the read data;
outputting an error signal to the matching logic; and
The error information associated with the error is output to the buffer concurrently with the identifier and the physical address output by the matching logic.
8. The controller device of claim 7, further comprising:
an Address Register (AR) bus coupled to the matching logic, the AR bus to send read commands to the memory device, each read command including an identifier and an associated physical address; and
A read bus coupled to the match logic, the ECC engine, and the buffer control logic, the read bus to receive read data and an associated identifier from the memory device, wherein:
the matching logic to store the identifier and associated physical address of each of the read commands sent on the bus;
The ECC engine is to receive the read data via the read bus;
The matching logic is to: receiving, via the read bus, a respective identifier corresponding to the read data, and receiving the error signal from the ECC engine;
The matching logic is to: locating the associated physical address of the respective identifier received from the read bus and outputting the identifier and the associated physical address to the buffer in response to the error signal; and
The buffer control logic is to provide the first signal via the read bus in response to the buffer being full.
9. The controller device of claim 1, wherein the error detection logic is to detect a plurality of errors caused by word line faults in the memory device.
10. An integrated circuit, comprising:
A first interface coupled to one or more host systems;
A second interface coupled to one or more memory devices;
An Error Correction Code (ECC) engine to detect burst error information in data read from the one or more memory devices;
an error reporting engine comprising a first-in first-out (FIFO) buffer coupled to the ECC engine to store the burst error information and to set one or more interrupts; and
A management processor coupled to the ECC engine and the error reporting engine, the management processor to read the burst error information from the FIFO buffer and clear the one or more interrupts, wherein the error reporting engine is to prevent overwriting or overflow of the burst error information in the FIFO buffer.
11. The integrated circuit of claim 10, further comprising a memory controller, wherein the error reporting engine is to send a signal to the memory controller to prevent the overwriting or overflowing in the FIFO buffer in response to the FIFO buffer being full.
12. The integrated circuit of claim 11, wherein the error reporting engine is further to:
Transmitting a first interrupt to the management processor in response to the burst error information being detected by the ECC engine; and
A second interrupt is sent to the management processor in response to the FIFO buffer meeting a fill condition less than full of the FIFO buffer, wherein the second interrupt comprises a higher priority than the first interrupt.
13. The integrated circuit of claim 10, wherein the management processor comprises an interrupt handling routine to read the burst error information from the FIFO buffer and clear the one or more interrupts during a first amount of time, wherein the burst error information includes error information regarding at least two errors detected in a second amount of time, the second amount of time being less than the first amount of time.
14. The integrated circuit of claim 10, wherein the integrated circuit is a memory expansion chip coupled to a single host system through a cache coherence interconnect.
15. The integrated circuit of claim 10, wherein the integrated circuit is a multi-host memory pool chip coupled to a plurality of host systems through a plurality of cache coherence interconnects.
16. The integrated circuit of claim 10, further comprising a memory controller, wherein the error reporting engine comprises:
the FIFO buffer comprising a set of entries;
Matching logic including a buffer to store a set of read identifiers and corresponding Device Physical Addresses (DPAs); and
Buffer control logic to send a signal to the memory controller to prevent the overwriting or overflowing in the FIFO buffer in response to the FIFO buffer being full.
17. The integrated circuit of claim 10, wherein the error reporting engine comprises:
a first register to store a first indication that the FIFO buffer is full; and
A second register to store a second indication of the one or more interrupts.
18. A method of an integrated circuit, the method comprising:
Detecting burst error information in data read from one or more memory devices, wherein the burst error information includes error information regarding at least two errors detected in a first amount of time;
Storing the burst error information in a buffer;
Generating an interrupt to a management processor to read the burst error information and clear the interrupt for a second amount of time, the second amount of time being greater than the first amount of time; and
The buffer is prevented from being overwritten or overflowed.
19. The method of claim 18, wherein preventing the buffer from being overwritten or overflowed comprises: a signal is sent to a memory controller in response to the buffer being full.
20. The method of claim 18, wherein preventing the buffer from being overwritten or overflowed comprises:
transmitting a signal to a memory controller in response to the buffer being full;
transmitting a first interrupt to the management processor in response to the burst error information being detected; and
In response to the buffer meeting a fill condition less than the buffer full, a second interrupt is sent to the management processor, wherein the second interrupt includes a higher priority than the first interrupt.
CN202280077041.3A 2021-11-22 2022-11-14 Using buffer structures and signaling to record burst error information in dynamic random access memory (DRAM) Pending CN118284883A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202163282110P 2021-11-22 2021-11-22
US63/282,110 2021-11-22
PCT/US2022/049846 WO2023091377A1 (en) 2021-11-22 2022-11-14 Logging burst error information of a dynamic random access memory (dram) using a buffer structure and signaling

Publications (1)

Publication Number Publication Date
CN118284883A true CN118284883A (en) 2024-07-02

Family

ID=86397677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280077041.3A Pending CN118284883A (en) 2021-11-22 2022-11-14 Using buffer structures and signaling to record burst error information in dynamic random access memory (DRAM)

Country Status (4)

Country Link
US (1) US20240427661A1 (en)
EP (1) EP4437417A4 (en)
CN (1) CN118284883A (en)
WO (1) WO2023091377A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3158375B1 (en) * 2024-01-11 2025-12-12 St Microelectronics Int Nv INTERRUPTION MANAGEMENT OF AN INTEGRATED CIRCUIT MEMORY CONTROLLER

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4485470A (en) * 1982-06-16 1984-11-27 Rolm Corporation Data line interface for a time-division multiplexing (TDM) bus
DE3866902D1 (en) * 1987-03-10 1992-01-30 Siemens Ag METHOD AND DEVICE FOR CONTROLLING THE ERROR CORRECTION WITHIN A DATA TRANSFER CONTROLLER FOR MOVING PERIPHERAL STORAGE, IN PARTICULAR DISK STORAGE, DATA READING SYSTEM.
US5276662A (en) * 1992-10-01 1994-01-04 Seagate Technology, Inc. Disc drive with improved data transfer management apparatus
US8510606B2 (en) * 2010-02-04 2013-08-13 Randolph Eric Wight Method and apparatus for SAS speed adjustment
US8438344B2 (en) * 2010-03-12 2013-05-07 Texas Instruments Incorporated Low overhead and timing improved architecture for performing error checking and correction for memories and buses in system-on-chips, and other circuits, systems and processes
US8914708B2 (en) * 2012-06-15 2014-12-16 International Business Machines Corporation Bad wordline/array detection in memory
US9081666B2 (en) * 2013-02-15 2015-07-14 Seagate Technology Llc Non-volatile memory channel control using a general purpose programmable processor in combination with a low level programmable sequencer
US10042700B2 (en) * 2016-05-28 2018-08-07 Advanced Micro Devices, Inc. Integral post package repair
US10387319B2 (en) * 2017-07-01 2019-08-20 Intel Corporation Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features
US11886312B2 (en) * 2020-04-07 2024-01-30 Intel Corporation Characterizing error correlation based on error logging for computer buses
US20230401311A1 (en) * 2022-06-13 2023-12-14 Rambus Inc. Determining integrity-driven error types in memory buffer devices

Also Published As

Publication number Publication date
US20240427661A1 (en) 2024-12-26
WO2023091377A1 (en) 2023-05-25
EP4437417A4 (en) 2026-01-21
EP4437417A1 (en) 2024-10-02

Similar Documents

Publication Publication Date Title
CN105589762B (en) Memory device, memory module and method for error correction
CN102246155B (en) Error detection in a multi-processor data processing system
KR101464092B1 (en) Dynamic physical memory replacement through address swapping
US7971112B2 (en) Memory diagnosis method
US7716430B2 (en) Separate handling of read and write of read-modify-write
US9880896B2 (en) Error feedback and logging with memory on-chip error checking and correcting (ECC)
CN101615145B (en) Method and device for improving reliability of data caching of memorizer
CN111221775B (en) Processor, cache processing method and electronic equipment
JP4395425B2 (en) Data processing apparatus and method for processing corrupted data values
TWI226546B (en) Method for checking address of data to be transferred in DMA mode and DMA controller
US9384091B2 (en) Error code management in systems permitting partial writes
US9454451B2 (en) Apparatus and method for performing data scrubbing on a memory device
US8122308B2 (en) Securely clearing an error indicator
CN102541756A (en) Cache memory system
US20030140285A1 (en) Processor internal error handling in an SMP server
KR20080022181A (en) Mechanism for storing and extracting trace information using the internal memory of microcontrollers
CN1963950A (en) Semiconductor storage device equipped with ecc function
US20070234112A1 (en) Systems and methods of selectively managing errors in memory modules
CN115756911A (en) Memory fault processing method, equipment and storage medium
CN118284883A (en) Using buffer structures and signaling to record burst error information in dynamic random access memory (DRAM)
US11080124B2 (en) System and method for targeted efficient logging of memory failures
US8910004B2 (en) Information processing apparatus, and method of controlling information processing apparatus
US8402320B2 (en) Input/output device including a mechanism for error handling in multiple processor and multi-function systems
CN100357898C (en) Error detection method when linked list running
US8533565B2 (en) Cache controller and cache controlling method

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination