大規(guī)模眾核微處理器互連網(wǎng)絡(luò)體系結(jié)構(gòu)及性能分析研究
發(fā)布時間:2018-07-16 11:16
【摘要】:基于多核甚至眾核設(shè)計的高性能處理器,是未來艾級高性能計算機的支撐技術(shù)。高帶寬、低延遲、低功耗和強擴展性的互連網(wǎng)絡(luò)對于釋放處理器核強大的并行計算能力、提高眾核處理器的性能有十分重要的意義。目前,眾核系統(tǒng)的設(shè)計挑戰(zhàn)中,互連通信逐漸成為制約系統(tǒng)性能提升的瓶頸。新興的3D集成技術(shù)和硅基光子器件在芯片功能、集成密度和功耗方面有獨特優(yōu)勢。這些新技術(shù)、新器件的發(fā)展成熟為解決眾核系統(tǒng)互連瓶頸帶來新的機會。 本文以研究眾核系統(tǒng)互連瓶頸為出發(fā)點,探索眾核微處理器互連網(wǎng)絡(luò)的創(chuàng)新型體系結(jié)構(gòu),并利用網(wǎng)絡(luò)演算理論對眾核互連網(wǎng)絡(luò)進行建模與分析。主要研究內(nèi)容包括四個方面: (1)眾核系統(tǒng)片上核間互連網(wǎng)絡(luò)體系結(jié)構(gòu) 核間傳輸?shù)膱笪囊钥刂茍笪臑橹鳎瑢崟r性有著極高的要求。隨著計算核節(jié)點數(shù)增多,傳輸延遲成為限制大規(guī)模眾核處理器核間互連網(wǎng)絡(luò)性能的首要因素。以Mesh為代表的簡單低維片上網(wǎng)絡(luò)結(jié)構(gòu),雖然布線簡單,但由于其網(wǎng)絡(luò)傳輸跳步數(shù)隨著系統(tǒng)節(jié)點規(guī)模呈比例增長,很難滿足大規(guī)模眾核芯片的低延遲傳輸需求。利用3D集成技術(shù),本文提出了一種三維扁平蝴蝶形網(wǎng)絡(luò)的拓撲結(jié)構(gòu),用于大規(guī)模眾核處理器的核間電報文傳輸。采用整數(shù)線性規(guī)劃模型,我們克服了蝶形網(wǎng)絡(luò)中高階路由器和長互連線的布線挑戰(zhàn),成功地將扁平蝴蝶形網(wǎng)絡(luò)嵌入到三維疊層中。扁平蝴蝶形拓撲是一種高維拓撲結(jié)構(gòu),擴展性強,尤其適合大規(guī)模計算核節(jié)點之間的互連。三維蝶形網(wǎng)絡(luò)在保證Mesh連通性的同時增加了額外的捷徑鏈路,同時利用高速的垂直互連線,實現(xiàn)了核間報文的快速傳遞。實驗結(jié)果表明,三維蝶形網(wǎng)絡(luò)能夠有效的降低核間互連延遲,顯著的提升眾核處理器性能。 (2)眾核微處理器光訪存網(wǎng)絡(luò)體系結(jié)構(gòu) 訪存互連對眾核處理器至關(guān)重要,如果不能快速的存取數(shù)據(jù),眾核處理器強大的并行計算能力將很難發(fā)揮。隨著單片上集成的處理器核數(shù)越來越多,訪存通信帶寬需求也急劇增長。傳統(tǒng)的基于電IO管腳的“處理器-存儲器”互連方案在大規(guī)模眾核芯片中遇到了挑戰(zhàn),電互連方式很難在滿足嚴格的功耗預(yù)算的前提下,為片上眾核提供足夠大的訪存帶寬。利用新興的硅基光電子器件和3D集成技術(shù),我們提出了一種高帶寬、低功耗的光訪存網(wǎng)絡(luò)方案,用于眾核處理器與DRAM之間的互連通信。這種基于光突發(fā)交換協(xié)議的訪存網(wǎng)絡(luò)采用光互連接口代替電IO管腳,能夠?qū)崿F(xiàn)眾核處理器和存儲器的高帶寬無縫互連。除了帶寬優(yōu)勢外,與以往的光訪存網(wǎng)絡(luò)相比,新方案的波長資源利用率得到了極大的提高,進一步提高了訪存通信的功耗效率。實驗結(jié)果表明,基于光突發(fā)交換協(xié)議的訪存網(wǎng)絡(luò)的功耗效率比光線路交換的訪存網(wǎng)絡(luò)提高了近2倍,比電接口方案提高了6倍。 (3)芯片尺度光網(wǎng)絡(luò)中的電控制層擁塞避免方案 由于光緩存、光邏輯器件缺失,光電混合網(wǎng)絡(luò)大都采用電控制層,負責資源仲裁、鏈路控制。在芯片尺度光突發(fā)交換網(wǎng)絡(luò)研究中,我們發(fā)現(xiàn),大量的細粒度光突發(fā)報文、嚴格的傳輸延遲限制和中等的網(wǎng)絡(luò)工作頻率限制了光網(wǎng)絡(luò)的電控制層處理能力,極易導致嚴重的網(wǎng)絡(luò)擁塞。因而,我們提出了一套流量整形方案,解決電控制層網(wǎng)絡(luò)擁塞問題。在注入網(wǎng)絡(luò)前,系統(tǒng)中所有報文流首先進行全局協(xié)調(diào)和整形,確保中間任何節(jié)點上的控制報文聚合流速率不會超過其最大處理能力,以達到減輕控制層擁塞的目的。我們采用優(yōu)化算法,選取報文流整形器的整形參數(shù)(比如,報文流速度和報文突發(fā)性參數(shù))。這種擁塞控制方案在一定程度上,為各個報文流的端到端傳輸進行資源預(yù)約,在帶寬方面提供基本的服務(wù)質(zhì)量保證,可以有效的緩解由控制層擁塞引起的光突發(fā)報文丟失現(xiàn)象;诤铣闪髁亢驼鎸嵾\用軌跡的實驗表明,這種新方法能有效避免控制層擁塞,降低報文丟失率,提高芯片尺度光突發(fā)交換網(wǎng)絡(luò)的系統(tǒng)性能。 (4)芯片尺度光互連網(wǎng)絡(luò)性能分析 芯片尺度光互連網(wǎng)絡(luò)的設(shè)計需要平衡多方面的因素,包括網(wǎng)絡(luò)延遲、吞吐量、能耗和硅片面積占用。這些系統(tǒng)級互連參數(shù)的選擇直接影響整個芯片的性能,因而進行片上網(wǎng)絡(luò)的性能分析,對系統(tǒng)的設(shè)計具有重要意義。為此,我們開展了芯片尺度光網(wǎng)絡(luò)的解析建模工作。利用隨機網(wǎng)絡(luò)演算理論,我們建立了光突發(fā)交換網(wǎng)絡(luò)的存儲資源需求模型,以及光器件的波長資源需求估算模型。仿真實驗與數(shù)值分析的結(jié)果表明,這些解析模型計算得到的邊界相當緊致。利用這些隨機網(wǎng)絡(luò)演算分析模型,我們可以快速評估眾核系統(tǒng)光互連網(wǎng)絡(luò)的系統(tǒng)級設(shè)計參數(shù),比如存儲器資源需求、傳輸延遲、光器件資源需求等。在設(shè)計初期,建模分析網(wǎng)絡(luò)的性能,還可以提前降低設(shè)計風險?偟恼f來,我們的解析模型刻畫了系統(tǒng)性能與網(wǎng)絡(luò)負載、體系結(jié)構(gòu)之間的關(guān)系,有助于迅速找出影響性能的關(guān)鍵因素和設(shè)計瓶頸,促進設(shè)計空間收斂。 綜上所述,本文研究了眾核系統(tǒng)的互連瓶頸問題,提出了新的網(wǎng)絡(luò)體系結(jié)構(gòu),并基于網(wǎng)絡(luò)演算理論,,對該體系結(jié)構(gòu)進行了解析建模和性能分析。本文理論與實際結(jié)合緊密,為眾核處理器互連瓶頸問題提供了新的解決方案,對推動高性能處理器技術(shù)發(fā)展做出了積極的貢獻,并進一步擴展了網(wǎng)絡(luò)演算理論的運用領(lǐng)域。
[Abstract]:High performance processor based on multi core and even kernel design is the support technology of high performance computer in the future. High bandwidth, low delay, low power and strong scalability interconnect network is very important to release the processor's powerful parallel computing power and improve the performance of the core processor. In the battle, interconnect communication has gradually become a bottleneck restricting the performance of the system. The new 3D integration technology and silicon based photonic devices have unique advantages in chip function, integration density and power consumption. These new technologies and new devices are mature to bring new opportunities to solve the bottleneck of interconnect in the core system.
This paper, based on the research of the bottleneck of the interconnect of the public nuclear system, explores the innovative architecture of the interconnect network of the core microprocessor, and uses the network calculus theory to model and analyze the interconnected network. The main research contents include four aspects:
(1) intercore interconnection network architecture of many core systems
The message transmitted between the nuclei is dominated by the control message, and it has a high requirement for real time. With the increasing number of nodes in the computation, the transmission delay is the primary factor restricting the performance of the interkernel interconnected networks of large mass core processors. The simple low dimension network structure represented by Mesh is simple, but the number of jumps in the network is due to its network transmission. As the scale of the system nodes is increasing proportionately, it is difficult to meet the demand for the low delay transmission of large mass core chips. By using 3D integration technology, a topology of a three-dimensional flat butterfly network is proposed in this paper, which is used for the transmission of interkernel telegraph between large mass core processors. The integer linear programming model is used to overcome the butterfly network. The flat butterfly network is successfully embedded in the 3D stack. The flat butterfly topology is a high dimensional topology with strong scalability and especially suitable for the interconnection between the large computing nodes. The 3D butterfly network increases the extra shortcut link while guaranteeing Mesh connectivity. At the same time, high speed vertical interconnects have been used to achieve fast transmission of internuclear messages. The experimental results show that the three-dimension butterfly network can effectively reduce internuclear interconnect delay and significantly improve the performance of the multiprocessor.
(2) the architecture of optical access network for many core processors
Memory access interconnection is very important for many nuclear processors. If the data can not be accessed quickly, the powerful parallel computing power of the core processors will be difficult to play. With the increasing number of core processors integrated with the monolithic processor, the demand for memory access communication bandwidth is also increasing. The traditional "processor memory" interconnection scheme based on the electric IO pins is large There is a challenge in large scale nuclear chips. Electrical interconnection is difficult to provide large enough memory bandwidth for all cores on the premise of satisfying the strict power budget. Using the new silicon based optoelectronic devices and 3D integration technology, we have proposed a high bandwidth, low power optical access network scheme for the public core processor and the DRAM. Interconnect communication. This network based on optical burst switching protocol uses optical interconnection interfaces instead of electrical IO pins to achieve high bandwidth and seamless interconnection of all nuclear processors and memory. In addition to bandwidth advantages, compared with the previous optical memory network, the utilization of the new scheme has been greatly improved. The experimental results show that the power efficiency of the memory access network based on the optical burst switching protocol is nearly 2 times higher than that of the optical line switched network, and the specific power interface scheme is 6 times higher.
(3) electrical control layer congestion avoidance scheme in chip scale optical network
Because of optical caching, optical logic devices are missing, optoelectronic hybrid networks mostly use electric control layer, responsible for resource arbitration and link control. In the study of chip scale optical burst switching network, we found that a large number of fine-grained optical burst messages, strict transmission delay constraints and medium network operating frequencies limit the electrical control layer of optical networks. It is very easy to cause serious network congestion. Therefore, we propose a flow shaping scheme to solve the congestion problem of the electric control layer network. Before the injection network, all message flows in the system are first coordinated and plastic to ensure that the rate of convergence of the control report on any node does not exceed its maximum processing capacity. In order to reduce the congestion of the control layer, we use the optimization algorithm to select the shaping parameters of the message flow shaper (such as the speed of the message flow and the burst parameters of the message). This congestion control scheme, to some extent, provides the resources for the end to end transmission of each message stream, and provides the basic quality of service for the bandwidth. It can effectively alleviate the loss of the burst message caused by the congestion of the control layer. The experiment based on the synthetic traffic and the real application trajectory shows that this new method can effectively avoid the congestion of the control layer, reduce the loss rate of the message, and improve the system performance of the chip scale optical burst switching network.
(4) performance analysis of chip scale optical interconnection network
The design of a chip scale optical interconnection network requires a balance of factors, including network delay, throughput, energy consumption and silicon area occupation. The selection of these system level interconnection parameters directly affects the performance of the whole chip. Therefore, the performance analysis of the on-chip network is important to the design of the system. Therefore, we have developed a chip. The analytic modeling work of the scale optical network. Using the stochastic network calculus theory, we set up the storage resource requirement model of the optical burst switching network and the estimation model of the wavelength resource requirements of the optical devices. The simulation experiment and the numerical analysis show that the boundary of these analytical models is quite compact. We can quickly evaluate the system level design parameters of the optical interconnection network of many nuclear systems, such as memory resource requirements, transmission delay, optical device resource requirements and so on. In the early design, modeling analysis network performance can also reduce design risk in advance. In general, our analytical model portrays the system performance. The relationship with network load and architecture helps to find out the key factors and design bottlenecks which affect the performance quickly, and promote the design space convergence.
To sum up, this paper studies the interconnection bottleneck problem of the multikernel system and proposes a new network architecture. Based on the network calculus theory, the analytical modeling and performance analysis of the architecture are carried out. This paper combines the theory with the reality, provides a new solution for the bottleneck problem of the interconnect of the core processors, and promotes the high performance service. It has made positive contributions to the development of science and technology, and has further expanded the application field of network calculus theory.
【學位授予單位】:國防科學技術(shù)大學
【學位級別】:博士
【學位授予年份】:2012
【分類號】:TP332
本文編號:2126215
[Abstract]:High performance processor based on multi core and even kernel design is the support technology of high performance computer in the future. High bandwidth, low delay, low power and strong scalability interconnect network is very important to release the processor's powerful parallel computing power and improve the performance of the core processor. In the battle, interconnect communication has gradually become a bottleneck restricting the performance of the system. The new 3D integration technology and silicon based photonic devices have unique advantages in chip function, integration density and power consumption. These new technologies and new devices are mature to bring new opportunities to solve the bottleneck of interconnect in the core system.
This paper, based on the research of the bottleneck of the interconnect of the public nuclear system, explores the innovative architecture of the interconnect network of the core microprocessor, and uses the network calculus theory to model and analyze the interconnected network. The main research contents include four aspects:
(1) intercore interconnection network architecture of many core systems
The message transmitted between the nuclei is dominated by the control message, and it has a high requirement for real time. With the increasing number of nodes in the computation, the transmission delay is the primary factor restricting the performance of the interkernel interconnected networks of large mass core processors. The simple low dimension network structure represented by Mesh is simple, but the number of jumps in the network is due to its network transmission. As the scale of the system nodes is increasing proportionately, it is difficult to meet the demand for the low delay transmission of large mass core chips. By using 3D integration technology, a topology of a three-dimensional flat butterfly network is proposed in this paper, which is used for the transmission of interkernel telegraph between large mass core processors. The integer linear programming model is used to overcome the butterfly network. The flat butterfly network is successfully embedded in the 3D stack. The flat butterfly topology is a high dimensional topology with strong scalability and especially suitable for the interconnection between the large computing nodes. The 3D butterfly network increases the extra shortcut link while guaranteeing Mesh connectivity. At the same time, high speed vertical interconnects have been used to achieve fast transmission of internuclear messages. The experimental results show that the three-dimension butterfly network can effectively reduce internuclear interconnect delay and significantly improve the performance of the multiprocessor.
(2) the architecture of optical access network for many core processors
Memory access interconnection is very important for many nuclear processors. If the data can not be accessed quickly, the powerful parallel computing power of the core processors will be difficult to play. With the increasing number of core processors integrated with the monolithic processor, the demand for memory access communication bandwidth is also increasing. The traditional "processor memory" interconnection scheme based on the electric IO pins is large There is a challenge in large scale nuclear chips. Electrical interconnection is difficult to provide large enough memory bandwidth for all cores on the premise of satisfying the strict power budget. Using the new silicon based optoelectronic devices and 3D integration technology, we have proposed a high bandwidth, low power optical access network scheme for the public core processor and the DRAM. Interconnect communication. This network based on optical burst switching protocol uses optical interconnection interfaces instead of electrical IO pins to achieve high bandwidth and seamless interconnection of all nuclear processors and memory. In addition to bandwidth advantages, compared with the previous optical memory network, the utilization of the new scheme has been greatly improved. The experimental results show that the power efficiency of the memory access network based on the optical burst switching protocol is nearly 2 times higher than that of the optical line switched network, and the specific power interface scheme is 6 times higher.
(3) electrical control layer congestion avoidance scheme in chip scale optical network
Because of optical caching, optical logic devices are missing, optoelectronic hybrid networks mostly use electric control layer, responsible for resource arbitration and link control. In the study of chip scale optical burst switching network, we found that a large number of fine-grained optical burst messages, strict transmission delay constraints and medium network operating frequencies limit the electrical control layer of optical networks. It is very easy to cause serious network congestion. Therefore, we propose a flow shaping scheme to solve the congestion problem of the electric control layer network. Before the injection network, all message flows in the system are first coordinated and plastic to ensure that the rate of convergence of the control report on any node does not exceed its maximum processing capacity. In order to reduce the congestion of the control layer, we use the optimization algorithm to select the shaping parameters of the message flow shaper (such as the speed of the message flow and the burst parameters of the message). This congestion control scheme, to some extent, provides the resources for the end to end transmission of each message stream, and provides the basic quality of service for the bandwidth. It can effectively alleviate the loss of the burst message caused by the congestion of the control layer. The experiment based on the synthetic traffic and the real application trajectory shows that this new method can effectively avoid the congestion of the control layer, reduce the loss rate of the message, and improve the system performance of the chip scale optical burst switching network.
(4) performance analysis of chip scale optical interconnection network
The design of a chip scale optical interconnection network requires a balance of factors, including network delay, throughput, energy consumption and silicon area occupation. The selection of these system level interconnection parameters directly affects the performance of the whole chip. Therefore, the performance analysis of the on-chip network is important to the design of the system. Therefore, we have developed a chip. The analytic modeling work of the scale optical network. Using the stochastic network calculus theory, we set up the storage resource requirement model of the optical burst switching network and the estimation model of the wavelength resource requirements of the optical devices. The simulation experiment and the numerical analysis show that the boundary of these analytical models is quite compact. We can quickly evaluate the system level design parameters of the optical interconnection network of many nuclear systems, such as memory resource requirements, transmission delay, optical device resource requirements and so on. In the early design, modeling analysis network performance can also reduce design risk in advance. In general, our analytical model portrays the system performance. The relationship with network load and architecture helps to find out the key factors and design bottlenecks which affect the performance quickly, and promote the design space convergence.
To sum up, this paper studies the interconnection bottleneck problem of the multikernel system and proposes a new network architecture. Based on the network calculus theory, the analytical modeling and performance analysis of the architecture are carried out. This paper combines the theory with the reality, provides a new solution for the bottleneck problem of the interconnect of the core processors, and promotes the high performance service. It has made positive contributions to the development of science and technology, and has further expanded the application field of network calculus theory.
【學位授予單位】:國防科學技術(shù)大學
【學位級別】:博士
【學位授予年份】:2012
【分類號】:TP332
【參考文獻】
相關(guān)博士學位論文 前2條
1 李煥忠;基于隨機網(wǎng)絡(luò)演算的性能分析技術(shù)研究[D];國防科學技術(shù)大學;2011年
2 錢悅;片上網(wǎng)絡(luò)演算模型及性能分析[D];國防科學技術(shù)大學;2010年
本文編號:2126215
本文鏈接:http://www.lk138.cn/kejilunwen/jisuanjikexuelunwen/2126215.html
最近更新
教材專著