Kyushu University Academic Staff Educational and Research Activities Database
List of Papers
Inoue Koji Last modified date:2021.06.08

Professor / Advanced Information and Communication Technology / Department of Advanced Information Technology / Faculty of Information Science and Electrical Engineering


Papers
1. Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kousuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, and Koji Inoue, Superconductor Computing for Neural Networks, IEEE Micro, Volume 41, Issue 3, 2021.06.
2. F. Mehdipour, H. Noori, H. Honda, K. Inoue, and K. Murakami, A Gravity-Directed Temporal Partitioning Approach, IEICE Electronics Express, Vol. 5, No. 10, pp.366-373, 2008.05.
3. F. Mehdipour, H. Noori, H. Honda, K. Inoue, and K. Murakami, A Gravity-Directed Temporal Partitioning Approach, IEICE Electronics Express, Vol. 5, No. 10, pp.366-373, 2008.03.
4. H. Noori, F. Mehdipour, K. Murakami, K. Inoue, and M. S. Zamani, A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions, IEICE Transactions on Electronics, vol. E91-C, no.4, pp.497-508, 2008.04.
5. H. Noori, M. Goudarzi, K. Inoue, and K. Murakami, Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems, IEICE Transactions on Electronics, vol. E91-C, no.4, pp.418-431,, 2008.04.
6. N. Takagi, K. Murakami, A. Fujimaki, N. Yoshikawa, K. Inoue, and H. Honda, Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits
, IEICE Transactions on Electronics, 2008.07.
7. F. Mehdipour, H. Noori, M. S. Zamani, K. Inoue, and K. Murakami, Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs
, IEICE Transactions on Electronics, 2007.12.
8. M. Sakamoto, A. Katsuno, G. Sugizaki, T. Yoshida, A. Inoue, K. Inoue, and K. Murakami, A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips
, IEICE Transactions on Electronics, vol. E90-C, no.10, 2007.10.
9. K. Inoue, Energy-Security Tradeoff in a Secure Cache Architecture Against Buffer Overflow Attacks, Workshop on Architectural Support for Security and Anti-Virus (WASSA), 2004.10.
10. Hideki MIWA Ryutaro SUSUKITA Hidetomo SHIBAMURA Tomoya HIRAO Jun MAKI Makoto YOSHIDA Takayuki KANDO Yuichiro AJIMA Ikuo MIYOSHI Toshiyuki SHIMIZU Yuji OINAGA Hisashige ANDO Yuichi INADOMI Koji INOUE Mutsumi AOYAGI Kazuaki MURAKAMI, NSIM: An Interconnection Network Simulator for Extreme-Scale Parallel Computers, IEICE TRANSACTIONS on Information and Systems, 2011.12.
11. 上野伸也, 橋口慎哉, 福本尚人, 井上弘士, 村上和彰, 3次元積層SRAM/DRAMハイブリッド・キャッシュ, 情報処理学会論文誌ACS, 2012.01.
12. 福本 尚人, 井上 弘士, 村上 和彰, 演算/メモリ性能バランスを考慮したマルチコア向けオンチップメモリ貸与法, 情報処理学会論文誌ACS, 2011.05.
13. Son-Truong NGUYEN, Masaaki KONDO, Tomoya HIRAO, Inoue Koji, A Prototype System for Many-core Architecture SMYLEref with FPGA Evaluation Boards, IEICE Transactions on Information and Systems, 2013.08.
14. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Hiroshi Sasaki, Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors, IEEE Computer Architecture Letters, 2017.03.
15. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Dependence Graph Model for Accurate Critical Path Analysis on Out-of-Order Processors, Journal of Information Processing, 2017.12.
16. Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue, Towards Ultra High-Speed Cryogenic Single-Flux-Quantum Computing, IEICE Transactions on Electronics, 2018.05.
17. Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Koji Inoue, Parallel Precomputation with Input Value Prediction for Model Predictive Control Systems, 2018.12.
18. Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, and Masaya Notomi, An Integrated Nanophotonic Parallel Adder, ACM Journal on Emerging Technologies in Computing Systems (JETC), 2018.07.
19. Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa, Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded Programs, IEICE Transactions on Information and Systems, 2018.09.
20. Yusuke Inoue, Takatsugu Ono, Koji Inoue, Real-time Frame-Rate Control for Energy-Efficient On-Line Object Tracking, IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences, , 2018.12.
21. Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti Sarangi, Koji Inoue, Radio Propagation Characteristics-Based Spoofing Attack Prevention on Wireless Connected Devices, IPSJ ACS, 2019.01.
22. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Critical Path based Microarchitectural Bottleneck Analysis for Out-of-Order Execution, IEICE Transactions, 2019.06.
23. Novel frontier of photonics for data processing—Photonic accelerator, Novel frontier of photonics for data processing—Photonic accelerator, APL Photonics, https://doi.org/10.1063/1.5108912, 4, 090901, 2019.09.
24. Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue, 32 GHz 6.5 mW Gate-Level-Pipelined 4-Bit Processor using Superconductor Single-Flux-Quantum Logic, 2020 IEEE Symposium on VLSI Circuits, VLSI Circuits 2020 2020 IEEE Symposium on VLSI Circuits, VLSI Circuits 2020 - Proceedings, 10.1109/VLSICircuits18222.2020.9162826, 2020.06, A Single-Flux-Quantum (SFQ) 4-bit throughput-oriented processor has successfully been demonstrated at up to 32 GHz with the measured power consumption of 6.5 mW. This is the first implementation of the gate-level-pipelined processor, and it achieves 2.5 Tera-Operations Per Watt (TOPS/W) by circuit and architectural optimizations..
25. Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Inoue Koji, Enhancing a manycore-oriented compressed cache for GPGPU, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 22-31, 2020.01, GPUs can achieve high performance by exploiting massive-thread parallelism. However, some factors limit performance on GPUs, one of which is the negative effects of L1 cache misses. In some applications, GPUs are likely to suffer from L1 cache conflicts because a large number of cores share a small L1 cache capacity. A cache architecture that is based on data compression is a strong candidate for solving this problem as it can reduce the number of cache misses. Unlike previous studies, our data compression scheme attempts to exploit the value locality existing within not only intra cache lines but also inter cache lines. We enhance the structure of a last-level compression cache proposed for general purpose manycore processors to optimize against shared L1 caches on GPUs. The experimental results reveal that our proposal outperforms the other compression cache for GPUs by 11 points on average..
26. Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima, An open source FPGA-optimized out-of-order RISC-V soft processor, 18th International Conference on Field-Programmable Technology, ICFPT 2019 Proceedings - 2019 International Conference on Field-Programmable Technology, ICFPT 2019, 10.1109/ICFPT47387.2019.00016, 63-71, 2019.12, High-performance soft processors in field-programmable gate arrays (FPGAs) have become increasingly important as recent large FPGA systems have relied on soft processors to run many complex workloads, like a network software stack. An out-of-order (OoO) superscalar approach is a good candidate to improve performance in such cases, as evidenced from OoO hard processor studies. Recent studies have revealed, however, that conventional OoO processor components do not fit well in an FPGA, and it is thus important to carefully design such components for FPGA characteristics. Hence, we propose the RSD processor: a new, open-source OoO RISC-V soft processor optimized for an FPGA. The RSD supports many aggressive OoO execution features, like speculative scheduling, OoO memory instruction execution and disambiguation, a memory dependence predictor, and a non-blocking cache. While the RSD supports such aggressive features, it also leverages FPGA characteristics. Therefore, it consumes fewer FPGA resources than are consumed by existing OoO soft processors, which do not support such aggressive features well. We first introduce the end result of the RSD microarchitecture design and then describe several novel optimization techniques. The RSD achieves up to 2.5-times higher Dhrystone MIPS while using 60% fewer registers and 64% fewer lookup tables (LUTs) as compared to state-of-the-art, open-source OoO processors..
27. Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele, Evaluating the Impact of Energy Efficient Networks on HPC Workloads, 26th Annual IEEE International Conference on High Performance Computing, HiPC 2019 Proceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019, 10.1109/HiPC.2019.00044, 301-310, 2019.12, Interconnection networks grow larger as supercomputers include more nodes and require higher bandwidth for performance. This scaling significantly increases the fraction of power consumed by the network, by increasing the number of network components (links and switches). Typically, network links consume power continuously once they are turned on. However, recent proposals for energy efficient interconnects have introduced low-power operation modes for periods when network links are idle. Low-power operation can increase messaging time when switching a link from low-power to active operation. We extend the TraceR-CODES network simulator for power modeling to evaluate the impact of energy efficient networking on power and performance. Our evaluation presents the first study on both single-job and multi-job execution to realistically simulate power consumption and performance under congestion for a large-scale HPC network. Results on several workloads consisting of HPC proxy applications show that single-job and multi-job execution favor different modes of low power operation to have significant power savings at the cost of minimal performance degradation..
28. Ken Ichi Kitayama, Masaya Notomi, Makoto Naruse, Koji Inoue, Satoshi Kawakami, Atsushi Uchida, Novel frontier of photonics for data processing-Photonic accelerator, APL Photonics, 10.1063/1.5108912, 4, 9, 2019.09, In the emerging Internet of things cyber-physical system-embedded society, big data analytics needs huge computing capability with better energy efficiency. Coming to the end of Moore's law of the electronic integrated circuit and facing the throughput limitation in parallel processing governed by Amdahl's law, there is a strong motivation behind exploring a novel frontier of data processing in post-Moore era. Optical fiber transmissions have been making a remarkable advance over the last three decades. A record aggregated transmission capacity of the wavelength division multiplexing system per a single-mode fiber has reached 115 Tbit/s over 240 km. It is time to turn our attention to data processing by photons from the data transport by photons. A photonic accelerator (PAXEL) is a special class of processor placed at the front end of a digital computer, which is optimized to perform a specific function but does so faster with less power consumption than an electronic general-purpose processor. It can process images or time-serial data either in an analog or digital fashion on a real-time basis. Having had maturing manufacturing technology of optoelectronic devices and a diverse array of computing architectures at hand, prototyping PAXEL becomes feasible by leveraging on, e.g., cutting-edge miniature and power-efficient nanostructured silicon photonic devices. In this article, first the bottleneck and the paradigm shift of digital computing are reviewed. Next, we review an array of PAXEL architectures and applications, including artificial neural networks, reservoir computing, pass-gate logic, decision making, and compressed sensing. We assess the potential advantages and challenges for each of these PAXEL approaches to highlight the scope for future work toward practical implementation..
29. Ikki Nagaoka, Masamitsu Tanaka, Kyosuke Sano, Taro Yamashita, Akira Fujimaki, Koji Inoue, Demonstration of an Energy-Efficient, Gate-Level-Pipelined 100 TOPS/W Arithmetic Logic Unit Based on Low-Voltage Rapid Single-Flux-Quantum Logic, 17th IEEE International Superconductive Electronics Conference, ISEC 2019 ISEC 2019 - International Superconductive Electronics Conference, 10.1109/ISEC46533.2019.8990905, 2019.07, We report the successful operation of an energy-efficient 8-bit arithmetic logic unit (ALU) based on bit-parallel, gate-Ievel-pipelining, and low-voltage rapid single-flux-quantum (LV-RSFQ) approaches. We implemented the ALU using a 10-kA/cm2 Nb process. The bias voltage was optimized to obtain high energy efficiency. Although lowed bias voltage leads to difficulty in timing design, we solved the problem by precise timing control. The operating frequency reached 30 GHz. Thanks to these high-throughput and low-energy technologies, we realized highly energy-efficient operation over 100 tera-operations per second per watt (TOPS/W)..
30. Omar M. Saad, Ahmed Shalaby, Inoue Koji, Mohammed S. Sayed, Hardware friendly algorithm for earthquakes discrimination based on wavelet filter bank and support vector machine, 2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018, 10.1109/JEC-ECC.2018.8679531, 115-118, 2019.04, Discrimination between earthquakes and explosion is one of the main challenges in the field of seismology. In some cases, the explosions recorded as an earthquake or vice verse, which can contaminate the seismic catalog. Rapid discrimination is required to support the real-time seismic application. The discrimination algorithm is based on a wavelet filter bank to extract the discriminative features, and support vector machine (SVM) as a classifier. Therefore; we propose to optimize the hardware implementation of the discrimination algorithm on Field Programmable Gate Array (FPGA). First, we implement the wavelet filter bank using optimized lifting scheme. Then, we utilize the linear classifier to implement the SVM classifier. Finally, we optimize the hardware resources of the discrimination algorithm to be utilized on low-cost FPGA called TE0711 board (Xilinx Artix7). The implemented design is utilized 1.2% and 39.8% of the FPGA's Look Up Table (LUT) and register resources, respectively..
31. Takatsugu Ono, Zhe Chen, Inoue Koji, Improving lifetime in MLC phase change memory using slow writes, 2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018, 10.1109/JEC-ECC.2018.8679540, 65-68, 2019.04, This paper reports the performance and endurance impacts of a slow-write approach for a multi-level cell (MLC) of phase change memory (PCM). An MLC improves the density of PCM, but the endurance is a critical issue. To extend the lifetime of the cell, a slow-write approach is one of the techniques that is used. However, the slow-write approach increases the program execution time because it takes a long time. In this paper, we discuss three types of slow-write approach for MLC and evaluate the endurance and performance quantitatively to understand the effectiveness of our approach. Our evaluation results show that one of the approaches enhances the endurance of MLC PCM 1.57 times with a 1.41 % performance degradation on average compared with the conventional write operation..
32. Koji Inoue, Message from the Prof. Koji Inoue, 2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018, 10.1109/JEC-ECC.2018.8679541, IV, 2019.04.
33. Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki, 29.3 A 48GHz 5.6mW Gate-Level-Pipelined Multiplier Using Single-Flux Quantum Logic, 2019 IEEE International Solid-State Circuits Conference, ISSCC 2019 2019 IEEE International Solid-State Circuits Conference, ISSCC 2019, 10.1109/ISSCC.2019.8662351, 460-462, 2019.03, A multiplier based on superconductor single-flux-quantum (SFQ) logic is demonstrated up to 48GHz with the measured power consumption of 5.6 mW. The multiplier performs 8 × 8 - bit signed multiplication every clock cycle. The design is based on a bit-parallel, gate-level-pipelined structure that exploits ultimately high-throughput performance of SFQ logic. The test chip fabricated using a 1.0- μ {m}, 9-layer process consists of 20,251 Nb/AlOx/Nb Josephson junctions (JJs). The correctness of operation is verified by on-chip high-speed testing..
34. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Critical path based microarchitectural bottleneck analysis for out-of-order execution, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 10.1587/transfun.E102.A.758, E102A, 6, 758-766, 2019.01, SUMMARY Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks..
35. Keiichiro Fukazawa, Masatsugu Ueda, Yuichi Inadomi, Mutsumi Aoyagi, Takayuki Umeda, Koji Inoue, Performance Analysis of CPU and DRAM Power Constrained Systems with Magnetohydrodynamic Simulation Code, 20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018 Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, 10.1109/HPCC/SmartCity/DSS.2018.00113, 626-631, 2019.01, Presently the power consumption of supercomputer system becomes a critical issue to develop the exascale supercomputer system. On the other hand, the power consumption character of applications is not so considered by the applications developers because their main interest is how fast to run their applications. In this study, we examine and evaluate the power consumption behavior of our Magnetohydrodynamic simulation code which solves the planetary magnetosphere under the constrained power of CPU and DRAM on the x86 computer system. As the results, we found there are some regions in the simulation code which decrease the calculation performance or do not affect the performance under the power capping. This indicates the capability of power optimization without performance degradation using the dynamic power capping in running the application. In addition, we obtained the specific power consumption combinations between CPU and DRAM which greatly affect the calculation performance..
36. Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti R. Sarangi, Koji Inoue, Radio propagation characteristics-based spoofing attack prevention on wireless connected devices, Journal of information processing, 10.2197/ipsjjip.27.322, 27, 322-334, 2019, A spoofing attack is a critical issue in wireless communication in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Not depending on information processing such as encryption technology, this method can be applied to sensing devices etc. which commonly have many resource restrictions. We call the space from which attacks achieve success as the “attack space.” In order to confine the attack space inside of the target system to prevent spoofing attacks from the outside, formulation of the relationship between combinations of transceivers and the attack space is necessary. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which uses these RPCs simultaneously. We take a tire pressure monitoring system (TPMS) as a case study of this method and execute a security evaluation based on radio-wave-propagation simulation. The simulation results assuming multiple noise environments all indicate that it is possible to eliminate the attack possibility from a distant location..
37. Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Inoue Koji, Parallel precomputation with input value prediction for model predictive control systems, IEICE Transactions on Information and Systems, 10.1587/transinf.2018PAP0003, E101D, 12, 2864-2877, 2018.12, We propose a parallel precomputation method for real-time model predictive control. The key idea is to use predicted input values produced by model predictive control to solve an optimal control problem in advance. It is well known that control systems are not suitable for multi- or many-core processors because feedback-loop control systems are inherently based on sequential operations. However, since the proposed method does not rely on conventional thread-/data-level parallelism, it can be easily applied to such control systems without changing the algorithm in applications. A practical evaluation using three real-world model predictive control system simulation programs demonstrates drastic performance improvement without degrading control quality offered by the proposed method..
38. Masaaki Kondo, Ikuo Miyoshi, Koji Inoue, Shinobu Miwa, Power management framework for post-petascale supercomputers, Advanced Software Technologies for Post-Peta Scale Computing The Japanese Post-Peta CREST Research Project, 10.1007/978-981-13-1924-2_13, 249-269, 2018.12, Power consumption is a first class design constraint for developing future exascale computing systems. To achieve exascale system performance with realistic power provisioning of 20-30MW, we need to improve power-performance efficiency significantly compared to today's supercomputer systems. In order to maximize effective performance within a power constraint, investigating how to optimize power resource allocation to each hardware component or each job submitted to the system is necessary. We have been conducting research and development on a software framework for code optimization and system power management for the power-constraint adaptive systems. We briefly introduce the research efforts for maximizing application performance under a given power constraint, power-aware resource manager, and power-performance simulation and analysis framework for future supercomputer systems..
39. Yusuke Inoue, Takatsugu Ono, Koji Inoue, Real-Time frame-rate control for energy-efficient on-line object tracking, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 10.1587/transfun.E101.A.2297, E101A, 12, 2297-2307, 2018.12, On-line object tracking (OLOT) has been a core technology in computer vision, and its importance has been increasing rapidly. Because this technology is utilized for battery-operated products, energy consumption must be minimized. This paper describes a method of adaptive frame-rate optimization to satisfy that requirement. An energy trade-off occurs between image capturing and object tracking. Therefore, the method optimizes the frame-rate based on always changed object speed for minimizing the total energy while taking into account the trade-off. Simulation results show a maximum energy reduction of 50.0%, and an average reduction of 35.9% without serious tracking accuracy degradation..
40. Omar M. Saad, Koji Inoue, Ahmed Shalaby, Lotfy Samy, Mohammed S. Sayed, Automatic Arrival Time Detection for Earthquakes Based on Stacked Denoising Autoencoder, IEEE Geoscience and Remote Sensing Letters, 10.1109/LGRS.2018.2861218, 15, 11, 1687-1691, 2018.11, The accurate detection of P-wave arrival time is imperative for determining the hypocenter location of an earthquake. However, precise detection of onset time becomes more difficult when the signal-to-noise ratio (SNR) of the seismic data is low, such as during microearthquakes. In this letter, a stacked denoising autoencoder (SDAE) is proposed to smooth the background noise. The SDAE acts as a denoising filter for the seismic data. In the proposed algorithm, the SDAE is utilized to reduce background noise such that the onset time becomes more clear and sharp. Afterward, a hard decision with one threshold is used to detect the onset time of the event. The proposed algorithm is evaluated on both synthetic and field seismic data. As a result, the proposed algorithm outperforms the short-time average/long-time average and the Akaike information criterion algorithms. The proposed algorithm accurately picks the onset time of 94.1% for 407 field seismic waveforms with a standard deviation error of 0.10 s. In addition, the results indicate that the proposed algorithm can pick arrival times accurately for weak SNR seismic data with SNR higher than -14 dB..
41. Omar M. Saad, Inoue Koji, Ahmed Shalaby, Lotfy Sarny, Mohammed S. Sayed, Autoencoder based Features Extraction for Automatic Classification of Earthquakes and Explosions, 17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018 Proceedings - 17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018, 10.1109/ICIS.2018.8466464, 445-450, 2018.09, Monitoring illegal explosions is mandatory for the safety of human life, environment, and protect the important buildings such as High-dam in Egypt. This kind of monitoring can be accomplished by detecting and identifying the explosions. If an illegal explosion happens such as quarry blast, an alarm should be reported to the government to take immediate action. However, the main problem is that many measured signals from received explosions are similar to earthquakes in their shape and both cannot differentiate from each other. Also, incorrect classification possibly will distort the real seismicity nature of the region. This problem motivates us to search for unique discriminating features to distinguish between earthquakes and explosions with precise accuracy. Therefore, in this paper, we propose to extract the discriminative features based on Autoencoder from the first few seconds after the P-wave arrival time of the event. The discriminative features are found to be in the first 60 samples after the arrival time of P-wave. Thus the first stage of the proposed algorithm is extracting the discriminative features via the Autoencoder. Then, softmax classifies the event based on these extracted features. The proposed algorithm achieves a classification accuracy of 98.55% when applied to 900 earthquakes and quarry blasts waveforms recorded by Egyptian National Seismic Network (ENSN)..
42. Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa, Evaluating energy-efficiency of DRAM channel interleaving schemes for multithreaded programs, IEICE Transactions on Information and Systems, 10.1587/transinf.2017EDP7296, E101D, 9, 2247-2257, 2018.09, The power consumption of server platforms has been increasing as the amount of hardware resources equipped on them is increased. Especially, the capacity of DRAM continues to grow, and it is not rare that DRAM consumes higher power than processors on modern servers. Therefore, a reduction in the DRAM energy consumption is a critical challenge to reduce the system-level energy consumption. Although it is well known that improving row buffer locality (RBL) and bank-level parallelism (BLP) is effective to reduce the DRAM energy consumption, our preliminary evaluation on a real server demonstrates that RBL is generally low across 15 multithreaded benchmarks. In this paper, we investigate the memory access patterns of these benchmarks using a simulator and observe that cache line-grained channel interleaving schemes, which are widely applied to modern servers including multiple memory channels, hurt the RBL each of the benchmarks potentially possesses. In order to address this problem, we focus on a row-grained channel interleaving scheme and compare it with three cache line-grained schemes. Our evaluation shows that it reduces the DRAM energy consumption by 16.7%, 12.3%, and 5.5%on average (up to 34.7%, 28.2%, and 12.0%) compared to the other schemes, respectively..
43. Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel Ellsworth, Barry Rountree, Martin Schulz, Analyzing resource trade-offs in hardware overprovisioned supercomputers, 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018 Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018, 10.1109/IPDPS.2018.00062, 526-535, 2018.08, Hardware overprovisioned systems have recently been proposed as a viable alternative for a power-efficient design of next-generation supercomputers. A key challenge for such systems is to determine the degree of overprovisioning, which refers to the number of extra nodes that need to be installed under a given power constraint. In this paper, we first show that the degree of overprovisioning depends on dynamic parameters, such as the job mix as well as the global power constraint, and that static decisions can result in limited system throughput. We then study an exhaustive combination of adaptive resource management strategies that span three job scheduling algorithms, four power capping techniques, and three node boot-up mechanisms to understand the trade-off space involved. We then draw conclusions about how these strategies can adaptively control the degree of overprovisioning and analyze their impact on job throughput and power utilization..
44. Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi, An integrated nanophotonic parallel adder, ACM Journal on Emerging Technologies in Computing Systems, 10.1145/3178452, 14, 2, 2018.07, Integrated optical circuits with nanophotonic devices have attracted significant attention due to their low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-transistor. This article first introduces the concept of optical pass-gate logic and then proposes a parallel adder circuit based on optical pass-gate logic. Experimental results obtained with an optoelectronic circuit simulator show the advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit..
45. Akihiko Shinya, Kengo Nozaki, Masaya Notomi, Tohru Ishihara, Koji Inoue, Ultralow-latency optical circuit based on optical pass gate logic, NTT Technical Review, 16, 7, 33-38, 2018.07, A novel light speed computing technology has been developed by NTT, Kyoto University, and Kyushu University that employs nanophotonic technology in critical paths and thus overcomes the problem of operational latency that is the chief limiting factor in conventional electronic circuits. The ultimate objective of this work is to develop an ultrahigh-speed optoelectronic arithmetic processor. This article provides an overview of our recent work and describes the successful implementation of this novel optical computing technology..
46. Susumu Mashimo, Ryota Shioya, Koji Inoue, VMOR
Microarchitectural Support for Operand Access in an Interpreter, IEEE Computer Architecture Letters, 10.1109/LCA.2018.2866243, 17, 2, 217-220, 2018.07, Dynamic scripting languages become very popular for high productivity. However, many of these languages have significant runtime overheads because they employ interpreter-based virtual machines. One of the major overheads for the interpreter is derived from operand accesses, which significantly increase memory accesses. We propose VMOR, microarchitectural support for the operand accesses in the interpreter. VMOR remaps operand values into floating-point physical registers, which are rarely used in the interpreter, and thus, VMOR effectively reduces the memory accesses..
47. Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue, Towards ultra-high-speed cryogenic single-flux-quantum computing, IEICE Transactions on Electronics, 10.1587/transele.E101.C.359, E101C, 5, 359-369, 2018.05, CMOS microprocessors are limited in their capacity for clock speed improvement because of increasing computing power, i.e., they face a power-wall problem. Single-flux-quantum (SFQ) circuits offer a solution with their ultra-fast-speed and ultra-low-power natures. This paper introduces our contributions towards ultra-high-speed cryogenic SFQ computing. The first step is to design SFQ microprocessors. From qualitatively and quantitatively evaluating past-designed SFQ microprocessors, we have found that revisiting the architecture of SFQ microprocessors and on-chip caches is the first critical challenge. On the basis of cross-layer discussions and analysis, we came to the conclusion that a bit-parallel gate-level pipeline architecture is the best solution for SFQ designs. This paper summarizes our current research results targeting SFQ microprocessors and onchip cache architectures..
48. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, CPCI Stack
Metric for Accurate Bottleneck Analysis on OoO Microprocessors, 5th International Symposium on Computing and Networking, CANDAR 2017 Proceedings - 2017 5th International Symposium on Computing and Networking, CANDAR 2017, 10.1109/CANDAR.2017.60, 166-172, 2018.04, Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks..
49. Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Inoue Koji, Wireless Spoofing-Attack Prevention Using Radio-Propagation Characteristics, 15th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 Proceedings - 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017, 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.94, 502-510, 2018.03, A spoofing attack is a critical issue in wireless communication in embedded systems in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Since RPCs are natural phenomena, this method does not depend on information processing such as encryption technology. We call the space from which attacks achieve success "attack space". By formulating the relationship between combinations of transceivers and the attack space, this method can be used in embedded systems. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which use these RPCs simultaneously for preventing wireless spoofing-attacks. We explain the results of a validity evaluation for the proposed model based on radio-wave-propagation simulation assuming free space and a noisy environment..
50. A. Shinya, T. Ishihara, K. Inoue, K. Nozaki, S. Kita, M. Notomi, Low-latency optical parallel adder based on a binary decision diagram with wavelength division multiplexing scheme, Optical Data Science: Trends Shaping the Future of Photonics 2018 Optical Data Science Trends Shaping the Future of Photonics, 10.1117/12.2296842, 2018.01, We propose an optical parallel adder based on a binary decision diagram that can calculate simply by propagating light through electrically controlled optical pass gates. The CARRY and CARRY operations are multiplexed in one circuit by a wavelength division multiplexing scheme to reduce the number of optical elements, and only a single gate constitutes the critical path for one digit calculation. The processing time reaches picoseconds per digit when we use a 100-μm-long optical path gates, which is ten times faster than a CMOS circuit..
51. Masaya Notomi, Kengo Nozaki, Shota Kita, Akihiko Shinya, Tohru Ishihara, Inoue Koji, Ultralow latency computation based on integrated nanophotonics, JSAP-OSA Joint Symposia, JSAP 2018 JSAP-OSA Joint Symposia, JSAP 2018, 2018.01, Moore's law for CMOS computers is still continuing, but its near-future saturation is now being discussed. One of the serious saturations is about its latency. The computation delay for a CMOS transistor is already saturated above 10 ps, which will be problematic when ultralow-latency response is required for broad-band data streams, even with parallelization or pipe-line processing. We regard that optical circuits may serve as ultralow-latency computation circuits if they are small enough and tightly combined with electronic circuits. The former requires nanophotonic devices/circuits and the former requires OE/EO conversion with ultrasmall capacitance..
52. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Dependence graph model for accurate critical path analysis on out-of-order processors, Journal of information processing, 10.2197/ipsjjip.25.983, 25, 983-992, 2017.12, The dependence graph model of out-of-order (OoO) instruction execution is a powerful representation used for the critical path analysis. However, most, if not all, of the previous models are out-of-date and lack enough detail to model modern OoO processors, or are too specific and complicated which limit their generality and applicability. In this paper, we propose an enhanced dependence graph model which remains simple but greatly improves the accuracy over prior models. The evaluation results using the gem5 simulator with configurations similar to Intel’s Haswell and Silvermont architecture show that the proposed enhanced model achieves CPI errors of 2.1% and 4.4% which are 90.3% and 77.1% improvements from the state-of-the-art model..
53. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Hiroshi Sasaki, Enhanced dependence graph model for critical path analysis on modern out-of-order processors, IEEE Computer Architecture Letters, 10.1109/LCA.2017.2684813, 16, 2, 111-114, 2017.07, The dependence graph model of out-of-order (OoO) instruction execution is a powerful representation used for the critical path analysis. However most, if not all, of the previous models are out-of-date and lack enough detail to model modern OoO processors, or are too specific and complicated which limit their generality and applicability. In this paper, we propose an enhanced dependence graph model which remains simple but greatly improves the accuracy over prior models. The evaluation results using the gem5 simulator show that the proposed enhanced model achieves CPI error of 2.1 percent which is a 90.3 percent improvement against the state-of-the-art model..
54. Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, Martin Schulz, Production Hardware Overprovisioning
Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework, 31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017 Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017, 10.1109/IPDPS.2017.107, 957-966, 2017.06, Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment..
55. Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa, Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping, 2016 High Performance Graph Data Management and Processing, HPGDMP 2016 Proceedings of HPGDMP 2016 High Performance Graph Data Management and Processing - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis, 10.1109/HPGDMP.2016.010, 17-24, 2017.01, Graph analysis applications have been widely used in real services such as road-traffic analysis and social network services. Breadth-first search (BFS) is one of the most representative algorithms for such applications; therefore, many researchers have tuned it to maximize performance. On the other hand, owing to the strict power constraints of modern HPC systems, it is necessary to improve power efficiency (i.e., performance per watt) when executing BFS. In this work, we focus on the power efficiency of DRAM and investigate the memory access pattern of a state-of-the-art BFS implementation using a cycle-accurate processor simulator. The results reveal that the conventional address mapping schemes of modern memory controllers do not efficiently exploit row buffers in DRAM. Thus, we propose a new scheme called per-row channel interleaving and improve the DRAM power efficiency by 30.3% compared to a conventional scheme for a certain simulator setting. Moreover, we demonstrate that this proposed scheme is effective for various configurations of memory controllers..
56. Jens Knoop, Wolfgang Karl, Martin Schulz, Koji Inoue, Preface, 30th International Conference on Architecture of Computing Systems, ARCS 2017 Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10172 LNCS, 2017.01.
57. Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi, An integrated optical parallel adder as a first step towards light speed data processing, 13th International SoC Design Conference, ISOCC 2016 ISOCC 2016 - International SoC Design Conference Smart SoC for Intelligent Things, 10.1109/ISOCC.2016.7799721, 123-124, 2016.12, Integrated optical circuits with nanophotonic devices have attracted significant attention due to its low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-Transistor. This paper first introduces a concept of the optical pass-gate logic, and then proposes a parallel adder circuit based on the optical passgate logic. Experimental results obtained with an optoelectronic circuit simulator show advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit..
58. Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue, Single-flux-quantum cache memory architecture, 13th International SoC Design Conference, ISOCC 2016 ISOCC 2016 - International SoC Design Conference Smart SoC for Intelligent Things, 10.1109/ISOCC.2016.7799755, 105-106, 2016.12, Single-flux-quantum (SFQ) logic is promising technology to realize an incredible microprocessor which operates over 100 GHz due to its ultra-fast-speed and ultra-lowpower natures. Although previous work has demonstrated prototype of an SFQ microprocessor, the SFQ based L1 cache memory has not well optimized: A large access latency and strictly limited scalability. This paper proposes a novel SFQ cache architecture to support fast accesses. The sub-Arrayed structure applied to the cache produces better scalability in terms of capacity. Evaluation results show that the proposed cache achieves 1.8X fast access speed..
59. Yoshihiro Tanaka, Keitaro Oka, Takatsugu Ono, Koji Inoue, Accuracy analysis of machine learning-based performance modeling for microprocessors, 4th International Japan-Egypt Conference on Electronic, Communication and Computers, JEC-ECC 2016 Proceedings of the 2016 4th International Japan-Egypt Conference on Electronic, Communication and Computers, JEC-ECC 2016, 10.1109/JEC-ECC.2016.7518973, 83-86, 2016.07, This paper analyzes accuracy of performance models generated by machine learning-based empirical modeling methodology. Although the accuracy strongly depends on the quality of learning procedure, it is not clear what kind of learning algorithms and training data set (or feature) should be used. This paper inclusively explores the learning space of processor performance modeling as a case study. We focus on static architectural parameters as training data set such as cache size and clock frequency. Experimental results show that a tree-based non-linear regression modeling is superior to a stepwise linear regression modeling. Another observation is that clock frequency is the most important feature to improve prediction accuracy..
60. Satoshi Matsuoka, Hideharu Amano, Kengo Nakajima, Koji Inoue, Tomohiro Kudoh, Naoya Maruyama, Kenjiro Taura, Takeshi Iwashita, Takahiro Katagiri, Toshihiro Hanawa, Toshio Endo, From FLOPS to BYTES
Disruptive change in high-performance computing towards the post-moore era, ACM International Conference on Computing Frontiers, CF 2016 2016 ACM International Conference on Computing Frontiers - Proceedings, 10.1145/2903150.2906830, 274-281, 2016.05, Slowdown and inevitable end in exponential scaling of processor performance, the end of the so-called"Moore's Law" is predicted to occur around 2025-2030 timeframe. Because CMOS semiconductor voltage is also approaching its limits, this means that logic transistor power will become constant, and as a result, the system FLOPS will cease to improve, resulting in serious consequences for IT in general, especially supercomputing. Existing attempts to overcome the end of Moore's law are rather limited in their future outlook or applicability. We claim that data-oriented parameters, such as bandwidth and capacity, or BYTES, are the new parameters that will allow continued performance gains for periods even after computing performance or FLOPS ceases to improve, due to continued advances in storage device technologies and optics, and manufacturing technologies including 3-D packaging. Such transition from FLOPS to BYTES will lead to disruptive changes in the overall systems from applications, algorithms, software to architecture, as to what parameter to optimize for, in order to achieve continued performance growth over time. We are launching a new set of research efforts to investigate and devise new technologies to enable such disruptive changes from FLOPS to BYTES in the Post-Moore era, focusing on HPC, where there is extreme sensitivity to performance, and expect the results to disseminate to the rest of IT..
61. Satoshi Imamura, Keitaro Oka, Yuichiro Yasui, Yuichi Inadomi, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Koji Inoue, Takatsugu Ono, Evaluating the impacts of code-level performance tunings on power efficiency, 4th IEEE International Conference on Big Data, Big Data 2016 Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016, 10.1109/BigData.2016.7840624, 362-369, 2016.01, As the power consumption of HPC systems will be a primary constraint for exascale computing, a main objective in HPC communities is recently becoming to maximize power efficiency (i.e., performance per watt) rather than performance. Although programmers have spent a considerable effort to improve performance by tuning HPC programs at a code level, tunings for improving power efficiency is now required. In this work, we select two representative HPC programs (Graph500 and SDPARA) and evaluate how traditional code-level performance tunings applied to these programs affect power efficiency. We also investigate the impacts of the tunings on power efficiency at various operating frequencies of CPUs and/or GPUs. The results show that the tunings significantly improve power efficiency, and different types of tunings exhibit different trends in power efficiency by varying CPU frequency. Finally, the scalability and power efficiency of state-of-the-art Graph500 implementations are explored on both a single-node platform and a 960-node supercomputer. With their high scalability, they achieve 27.43 MTEPS/Watt with 129.76 GTEPS on the single-node system and 4.39 MTEPS/Watt with 1,085.24 GTEPS on the supercomputer..
62. Yuichi Inadomi, Tapasya Patki, Koji Inoue, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi, Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing, International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 Proceedings of SC 2015 The International Conference for High Performance Computing, Networking, Storage and Analysis, 10.1145/2807591.2807638, 2015.11, A key challenge in next-generation supercomputing is to effectively schedule limited power resources. Modern processors suffer from increasingly large power variations due to the chip manufacturing process. These variations lead to power inhomogeneity in current systems and manifest into performance inhomogeneity in power constrained environments, drastically limiting supercomputing performance. We present a first-of-its-kind study on manufacturing variability on four production HPC systems spanning four microarchitectures, analyze its impact on HPC applications, and propose a novel variation-aware power budgeting scheme to maximize effective application performance. Our low-cost and scalable budgeting algorithm strives to achieve performance homogeneity under a power constraint by deriving application-specific, module-level power allocations. Experimental results using a 1,920 socket system show up to 5.4X speedup, with an average speedup of 1.8X across all benchmarks when compared to a variation-unaware power allocation scheme..
63. José Ayala, Fumio Arakawa, Inoue Koji, Message from the IEEE MCSoC-15 Program Co-Chairs, 9th IEEE International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2015 Proceedings - IEEE 9th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2015, 10.1109/MCSoC.2015.5, xi, 2015.11.
64. Keitaro Oka, Wenhao Jia, Margaret Martonosi, Koji Inoue, Characterization and cross-platform analysis of high-throughput accelerators, 2015 15th IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2015 ISPASS 2015 - IEEE International Symposium on Performance Analysis of Systems and Software, 10.1109/ISPASS.2015.7095797, 161-162, 2015.04, Today's computer systems often employ high-throughput accelerators (such as Intel Xeon Phi coprocessors and NVIDIA Tesla GPUs) to improve the performance of some applications or portions of applications. While such accelerators are useful for suitable applications, it remains challenging to predict which workloads will run well on these platforms and to predict the resulting performance trends for varying input..
65. Takeshi Soga, Hiroshi Sasaki, Tomoya Hirao, Masaaki Kondo, Koji Inoue, A flexible hardware barrier mechanism for many-core processors, 2015 20th Asia and South Pacific Design Automation Conference, ASP-DAC 2015 20th Asia and South Pacific Design Automation Conference, ASP-DAC 2015, 10.1109/ASPDAC.2015.7058982, 61-68, 2015.03, This paper proposes a new hardware barrier mechanism which offers the flexibility to select which cores should join the synchronization, allowing for executing multiple multi-threaded applications by dividing a many-core processor into several groups. Experimental results based on an RTL simulation show that our hardware barrier achieves a 66-fold reduction in latency over typical software based implementations, with a hardware overhead of the processor of only 1.8%. Additionally, we demonstrate that the proposed mechanism is sufficiently flexible to cover a variety of core groups with minimal hardware overhead..
66. Satoshi Imamura, Hiroshi Sasaki, Koji Inoue, Dimitrios S. Nikolopoulos, Power-capped DVFS and thread allocation with ANN models on modern NUMA systems, 32nd IEEE International Conference on Computer Design, ICCD 2014 2014 32nd IEEE International Conference on Computer Design, ICCD 2014, 10.1109/ICCD.2014.6974701, 324-331, 2014.12, Power capping is an essential function for efficient power budgeting and cost management on modern server systems. Contemporary server processors operate under power caps by using dynamic voltage and frequency scaling (DVFS). However, these processors are often deployed in non-uniform memory access (NUMA) architectures, where thread allocation between cores may significantly affect performance and power consumption. This paper proposes a method which maximizes performance under power caps on NUMA systems by dynamically optimizing two knobs: DVFS and thread allocation. The method selects the optimal combination of the two knobs with models based on artificial neural network (ANN) that captures the nonlinear effect of thread allocation on performance. We implement the proposed method as a runtime system and evaluate it with twelve multithreaded benchmarks on a real AMD Opteron based NUMA system. The evaluation results show that our method outperforms a naive technique optimizing only DVFS by up to 67.1%, under a power cap..
67. Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres, Power and performance characterization and modeling of GPU-accelerated systems, 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014 Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014, 10.1109/IPDPS.2014.23, 113-122, 2014.01, Graphics processing units (GPUs) provide an order-of-magnitude improvement on peak performance and performance-per-watt as compared to traditional multicore CPUs. However, GPU-accelerated systems currently lack a generalized method of power and performance prediction, which prevents system designers from an ultimate goal of dynamic power and performance optimization. This is due to the fact that their power and performance characteristics are not well captured across architectures, and as a result, existing power and performance modeling approaches are only available for a limited range of particular GPUs. In this paper, we present power and performance characterization and modeling of GPU-accelerated systems across multiple generations of architectures. Characterization and modeling both play a vital role in optimization and prediction of GPU-accelerated systems. We quantify the impact of voltage and frequency scaling on each architecture with a particularly intriguing result that a cutting-edge Kepler-based GPU achieves energy saving of 75% by lowering GPU clocks in the best scenario, while Fermi- and Tesla-based GPUs achieve no greater than 40% and 13%, respectively. Considering these characteristics, we provide statistical power and performance modeling of GPU-accelerated systems simplified enough to be applicable for multiple generations of architectures. One of our findings is that even simplified statistical models are able to predict power and performance of cutting-edge GPUs within errors of 20% to 30% for any set of voltage and frequency pair..
68. Keiichiro Fukazawa, Masatsugu Ueda, Mutsumi Aoyagi, Tomonori Tsuhata, Kyohei Yoshida, Aruta Uehara, Masakazu Kuze, Yuichi Inadomi, Koji Inoue, Power consumption evaluation of an MHD simulation with CPU power capping, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2014 Proceedings - 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2014, 10.1109/CCGrid.2014.47, 612-617, 2014.01, Recently to achieve the Exa-flops next generation computer system, the power consumption becomes the important issue. On the other hand, the power consumption character of application program is not so considered now. In this study we examine the power character of our Magneto hydrodynamic (MHD) simulation code for the global magnetosphere to evaluate the power consumption behavior of the simulation code under the CPU power capping on the parallel computer system. As a result, it is confirmed that there are different power consumption parts in the MHD simulation code, which the execution performance decreases or does not change under the CPU power capping. This indicates the capability of performance optimization with the power capping..
69. Hiroshi Sasaki, Satoshi Imamura, Koji Inoue, Coordinated power-performance optimization in manycores, 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT 2013 PACT 2013 - Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 10.1109/PACT.2013.6618803, 51-61, 2013.11, Optimizing the performance in multiprogrammed environments, especially for workloads composed of multi-threaded programs is a desired feature of runtime management system in future manycore processors. At the same time, power capping capability is required in order to improve the reliability of microprocessor chips while reducing the costs of power supply and thermal budgeting. This paper presents a sophisticated runtime coordinated power-performance management system called C-3PO, which optimizes the performance of manycore processors under a power constraint by controlling two software knobs: thread packing, and dynamic voltage and frequency scaling (DVFS). The proposed solution distributes the power budget to each program by controlling the workload threads to be executed with appropriate number of cores and operating frequency. The power budget is distributed carefully in different forms (number of allocated cores or operating frequency) depending on the power-performance characteristics of the workload so that each program can effectively convert the power into performance. The proposed system is based on a heuristic algorithm which relies on runtime prediction of power and performance via hardware performance monitoring units. Empirical results on a 64-core platform show that C-3PO well outperforms traditional counterparts across various PARSEC workload mixes..
70. Junya Kaida, Yuko Hara-Azumi, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Koji Inoue, Static mapping of multiple data-parallel applications on embedded many-core SoCs, IEICE Transactions on Information and Systems, 10.1587/transinf.E96.D.2268, E96-D, 10, 2268-2271, 2013.10, This paper studies the static mapping of multiple applications on embedded many-core SoCs. The mapping techniques proposed in this paper take into account both inter-application and intra-application parallelism in order to fully utilize the potential parallelism of the manycore architecture. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. Experiments show the effectiveness of the proposed techniques..
71. Keitarou Oka, Hiroshi Sasaki, Koji Inoue, Line sharing cache
Exploring cache capacity with frequent line value locality, 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013, 10.1109/ASPDAC.2013.6509677, 669-674, 2013.05, This paper proposes a new last level cache architecture called line sharing cache (LSC), which can reduce the number of cache misses without increasing the size of the cache memory. It stores lines which contain the identical value in a single line entry, which enables to store greater amount of lines. Evaluation results show performance improvements of up to 35% across a set of SPEC CPU2000 benchmarks..
72. Koji Inoue, SMYLE project
Toward high-performance, low-power computing on manycore-processor SoCs, 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013, 10.1109/ASPDAC.2013.6509655, 558-560, 2013.05, This paper introduces a manycore research project called SMYLE (Scalable ManYcore for Low Energy computing). The aims of this project are: 1) proposing a manycore SoC architecture and developing a suitable programming and execution environment, 2) designing a domain specific manycore system for emerging video mining applications, and 3) releasing developed software tools and FPGA emulation environments to accelerate manycore research and development in the community. The project started in December 2010 with full support from the New Energy and Industrial Technology Development Organization (NEDO)..
73. M. Kondo, S. T. Nguyen, T. Hirao, T. Soga, H. Sasaki, K. Inoue, SMYLEref
A reference architecture for manycore-processor SoCs, 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013, 10.1109/ASPDAC.2013.6509656, 561-564, 2013.05, Nowadays, the trend of developing micro-processor with tens of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. We are currently developing a many-core processor architecture for embedded systems as a part of a NEDO's project. This paper introduces the many-core architecture called SMYLEref along whit the concept of Virtual Accelerator on Many-core, in which many cores on a chip are utilized as a hardware platform for realizing multiple virtual accelerators. We are developing its prototype system with off-the-shelf FPGA evaluation boards. In this paper, we introduce the architecture of SMYLEref and the detail of the prototype system. In addition, several initial experiments with the prototype system are also presented..
74. Son Truong Nguyen, Masaaki Kondo, Tomoya Hirao, Koji Inoue, A prototype system for many-core architecture SMYLEref with FPGA evaluation boards, IEICE Transactions on Information and Systems, 10.1587/transinf.E96.D.1645, E96-D, 8, 1645-1653, 2013.01, Nowadays, the trend of developing micro-processor with hundreds of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. Generally, three major issues required to be resolved includes: 1) realizing efficient massively parallel processing, 2) reducing dynamic power consumption, and 3) improving software productivity. To deal with these issues, we propose a solution to use many lowperformance but small and very low-power cores to obtain very high performance, and develop a referential many-core architecture and a program development environment. This paper introduces a many-core architecture named SMYLEref and its prototype system with off-the-shelf FPGA evaluation boards. The initial evaluation results of several SPLASH2 benchmark programs conducted on our developed 128-core platform are also presented and discussed in this paper..
75. Lovic Gauthier, Shinya Ueno, Inoue Koji, Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2013 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2013, 10.1109/CASES.2013.6662514, 2013.01, This paper presents a hybrid compile and run-time memory management technique for a 3D-stacked reconfigurable accelerator including a memory layer composed of multiple memory units whose parallel access allows a very high bandwidth. The technique inserts allocation, free and data transfers into the code for using the memory layer and avoids memory overflows by adding a limited number of additional copies to and from the host memory. When compile-time information is lacking, the technique relies on run-time decisions for controlling these memory operations. Experiments show that, compared to a pessimistic approach, the overhead for avoiding overflows can be cut on average by 27%, 45% and 63% when the size of each memory unit is respectively 1kB, 128kB and 1MB..
76. Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres, Power and performance of GPU-accelerated systems
A closer look, 2013 IEEE International Symposium on Workload Characterization, IISWC 2013 Proceedings - 2013 IEEE International Symposium on Workload Characterization, IISWC 2013, 10.1109/IISWC.2013.6704675, 109-110, 2013.01.
77. Satoshi Kawakami, Akihito Iwanaga, Inoue Koji, Many-core acceleration for model predictive control systems, 1st International Workshop on Many-Core Embedded Systems, MES 2013, in Conjunction with the 40th Annual IEEE/ACM International Symposium on Computer Architecture, ISCA 2013 1st International Workshop on Many-Core Embedded Systems, MES 2013 - In Conjunction with the 40th Annual IEEE/ACM International Symposium on Computer Architecture, ISCA 2013, 10.1145/2489068.2489071, 17-24, 2013, This paper proposes a novel many-core execution strategy for real-time model predictive controls. The key idea is to exploit predicted input values, which are produced by the model predictive control itself, to speculatively solve an op- timal control problem. It is well known that control appli- cations are not suitable for multi- or many-core processors, because feedback-loop systems inherently stand on sequen- tial operations. Since the proposed scheme does not rely on conventional thread-/data-level parallelism, it can be easily applied to such control systems. An analytical evaluation using a real application demonstrates the potential of per- formance improvement achieved by the proposed speculative executions..
78. Farhad Mehdipour, Krishna C. Nunna, Koji Inoue, Kazuaki J. Murakami, A three-dimensional integrated accelerator, 15th Euromicro Conference on Digital System Design, DSD 2012 Proceedings - 15th Euromicro Conference on Digital System Design, DSD 2012, 10.1109/DSD.2012.15, 148-151, 2012.12, We propose a three-dimensional (3D) reconfigurable data-path accelerator which is capable of running partitioned large data flow graphs (DFGs) on the layers of 3D stack, while inter-layer connections are implemented by means of through-silicon vias (TSVs). A tool for mapping data flow graphs has been developed, and a key 3D-specific problem namely routing nets on 3D architecture has been discussed in details as well. Conducted experiments demonstrate smaller footprint area and higher performance for the 3D accelerator comparing with 2D counterpart..
79. Hiroshi Sasaki, Koji Inoue, Teruo Tanimoto, Hiroshi Nakamura, Scalability-based manycore partitioning, 21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012 Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 10.1145/2370816.2370833, 107-116, 2012.10, Multicore processors have been popular for years, and the industry is gradually shifting towards the era of manycore processors. Single-thread performance of microprocessors is not growing at a historical rate, but the existence of a num- ber of active processes in the computer system and the con- tinuing development of multi-threaded applications benefit from the growing core counts to sustain system throughput. This trend brings us a situation where a number of paral- lel applications simultaneously being executed on a single system. Since multi-threaded applications try to maximize its throughput by utilizing the whole system, each of them usually create equal or larger number of threads compared to underlying logical core counts. This introduces much greater number of threads to be co-scheduled in the entire system. However, each program has different characteristics (or scalability) and contends for shared resources, which are the CPU cores and memory hierarchies, with each other. Therefore, it is clear that OS thread scheduling will play a major role in achieving high system performance under such conditions. We develop a sophisticated scheduler that (1) dynamically predicts the scalability of programs via the use of hardware performance monitoring units, (2) decides the optimal number of cores to be allocated for each program, and (3) allocates the cores to programs while maximizing the system utilization to achieve fair and maximum perfor- mance. The evaluation results on a 48-core AMD Opteron system show improvements over the Linux scheduler for a variety of multiprogramming workloads..
80. Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki Murakami, Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization, Journal of Supercomputing, 10.1007/s11227-010-0505-0, 60, 2, 196-222, 2012.05, Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance and energy efficiency of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple-exits custom instruction and intuition behind it are introduced. Conditional execution capability has been added to the RFU to support the multi-exit feature of custom instructions. Because the proposed RFU has limitations on hardware resources (i.e., connections and processing elements), an integrated mapping-temporal partitioning framework is proposed to guarantee that the generated custom instructions can be mapped on the RFU (mappable custom instructions). Experimental results show that multi-exit custom instructions enhance the performance and energy efficiency by an average of 32% and 3% compared to custom instructions limited to one basic block, respectively. A maximum speedup of 4.9, compared to a single-issue embedded processor, and an average speedup of 1.9 was achieved on MiBench benchmark suite. The maximum and average energy saving are 56% and 22%, respectively. These performance and energy efficiency are obtained at the cost of 30% area overhead..
81. Hiroshi Sasaki, Inoue Koji, Teruo Tanimoto, Hiroshi Nakamura, Scalability-based manycore partitioning, Quaternary International, 10.1145/2370816.2370833, 107-116, 2012, Multicore processors have been popular for years, and the industry is gradually shifting towards the era of manycore processors. Single-thread performance of microprocessors is not growing at a historical rate, but the existence of a num- ber of active processes in the computer system and the con- tinuing development of multi-threaded applications benefit from the growing core counts to sustain system throughput. This trend brings us a situation where a number of paral- lel applications simultaneously being executed on a single system. Since multi-threaded applications try to maximize its throughput by utilizing the whole system, each of them usually create equal or larger number of threads compared to underlying logical core counts. This introduces much greater number of threads to be co-scheduled in the entire system. However, each program has different characteristics (or scalability) and contends for shared resources, which are the CPU cores and memory hierarchies, with each other. Therefore, it is clear that OS thread scheduling will play a major role in achieving high system performance under such conditions. We develop a sophisticated scheduler that (1) dynamically predicts the scalability of programs via the use of hardware performance monitoring units, (2) decides the optimal number of cores to be allocated for each program, and (3) allocates the cores to programs while maximizing the system utilization to achieve fair and maximum perfor- mance. The evaluation results on a 48-core AMD Opteron system show improvements over the Linux scheduler for a variety of multiprogramming workloads..
82. Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Koji Inoue, Task mapping techniques for embedded many-core SoCs, 2012 International SoC Design Conference, ISOCC 2012 ISOCC 2012 - 2012 International SoC Design Conference, 10.1109/ISOCC.2012.6407075, 204-207, 2012, This paper proposes static task mapping techniques for embedded many-core SoCs. The proposed techniques take into account both task and data parallelisms of the tasks in order to efficiently utilize the potential parallelism of the many-core architecture. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. In addition, a static mapping technique considering dynamic task switching is proposed. Experimental results show the effectiveness of the proposed techniques..
83. Hideki Miwa, Ryutaro Susukita, Hidetomo Shibamura, Tomoya Hirao, Jun Maki, Makoto Yoshida, Takayuki Kando, Yuichiro Ajima, Ikuo Miyoshi, Toshiyuki Shimizu, Yuji Oinaga, Hisashige Ando, Yuichi Inadomi, Koji Inoue, Mutsumi Aoyagi, Kazuaki Murakami, NSIM
An interconnection network simulator for extreme-scale parallel computers, IEICE Transactions on Information and Systems, 10.1587/transinf.E94.D.2298, E94-D, 12, 2298-2308, 2011.12, In the near future, interconnection networks of massively parallel computer systems will connect more than a hundred thousands of computing nodes. The performance evaluation of the interconnection networks can provide real insights to help the development of efficient communication library. Hence, to evaluate the performance of such interconnection networks, simulation tools capable of modeling the networks with sufficient details, supporting a user-friendly interface to describe communication patterns, providing the users with enough performance information, completing simulations within a reasonable time, are a real necessity. This paper introduces a novel interconnection network simulator NSIM, for the evaluation of the performance of extreme-scale interconnection networks. The simulator implements a simplified simulation model so as to run faster without any loss of accuracy. Unlike the existing simulators, NSIM is built on the execution-driven simulation approach. The simulator also provides a MPI-compatible programming interface. Thus, the simulator can emulate parallel program execution and correctly simulate point-to-point and collective communications that are dynamically changed by network congestion. The experimental results in this paper showed sufficient accuracy of this simulator by comparing the simulator and the real machine. We also confirmed that the simulator is capable of evaluating ultra large-scale interconnection networks, consumes smaller memory area, and runs faster than the existing simulator. This paper also introduces a simulation service built on a cloud environment. Without installing NSIM, users can simulate interconnection networks with various configurations by using a web browser..
84. Takaaki Hanada, Hiroshi Sasaki, Koji Inoue, Kazuaki Murakami, Performance evaluation of 3D stacked multi-core processors with temperature consideration, 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011, 10.1109/3DIC.2012.6263025, 2011.12, 3D stacked multi-core processor is one of the applications of 3D integration technology. It achieves high bandwidth access to last level cache and allows to increase the number of cores while maintaining the package area. Although, 3D multi-core temperature increases with the number of stacked dies because of the escalating power density and thermal resistivity. Therefore, 3D multi-cores require lower clock frequencies for keeping the temperature under a safe constraint, so that performance is not always improved. In this paper, we evaluate the performance of 3D stacked multi-cores running under temperature constraints, and we show that there is a trade-off between clock frequency and parallel capability..
85. Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, Kazuaki Murakami, 3D implemented SRAM/DRAM hybrid cache architecture for high-performance and low power consumption, 54th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2011 54th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2011, 10.1109/MWSCAS.2011.6026484, 2011.10, This paper introduces our research status focusing on 3D-implemented microprocessors. 3D-IC is one of the most interesting techniques to achieve high-performance, low-power VLSI systems. Stacking multiple dies makes it possible to implement microprocessor cores and large caches (or DRAM) into the same chip. Although this kind of integration has a great potential to bring a breakthrough in computer systems, its efficiency strongly depends on the characteristics of target application programs. Unfortunately, applying die stacking implementation causes performance degradation for some programs. To tackle this issue, we introduce a novel cache architecture consisting of a small but fast SRAM and a stacked large DRAM. The cache attempts to adapt to varying behavior of application programs in order to compensate for the negative impact of the die stacking approach..
86. Naehyuck Chang, Hiroshi Nakamura, Kenichi Osada, Massimo Poncino, Koji Inoue, Message from the chairs, 17th IEEE/ACM International Symposium on Low Power Electronics and Design, ISLPED 2011 Proceedings of the International Symposium on Low Power Electronics and Design, 10.1109/ISLPED.2011.5993616, iii-iv, 2011.09.
87. Farhad Mehdipour, Hiroaki Honda, Koji Inoue, Hiroshi Kataoka, Kazuaki Murakami, A design scheme for a reconfigurable accelerator implemented by single-flux quantum circuits, Journal of Systems Architecture, 10.1016/j.sysarc.2010.07.009, 57, 1, 169-179, 2011.01, A large-scale reconfigurable data-path processor (LSRDP) implemented by single-flux quantum (SFQ) circuits is introduced which is integrated to a general purpose processor to accelerate data flow graphs (DFGs) extracted from scientific applications. A number of applications are discovered and analyzed throughout the LSRDP design procedure. Various design steps and particularly the DFG mapping process are discussed and our techniques for optimizing the area of accelerator will be presented as well. Different design alternatives are examined through exploring the LSRDP design space and an appropriate architecture is determined for the accelerator. Primary experiments demonstrate capability of the designed architecture to achieve performance values up to 210 Gflops for attempted applications..
88. Farhad Mehdipour, Krishna Chaitanya Nunna, Lovic Gauthier, Koji Inoue, Kazuaki Murakami, A thermal-aware mapping algorithm for reducing peak temperature of an accelerator deployed in a 3D stack, 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011, 10.1109/3DIC.2012.6263034, 2011, Thermal management is one of the main concerns in three-dimensional integration due to difficulty of dissipating heat through the stack of the integrated circuit. In a 3D stack involving a data-path accelerator, a base processor and memory components, peak temperature reduction is targeted in this paper. A mapping algorithm has been devised in order to distribute operations of data flow graphs evenly over the processing elements of the target accelerator in two steps involving thermal-aware partitioning of input data flow graphs, and thermal-aware mapping of the partitions onto the processing elements. The efficiency of the proposed technique in reducing peak temperature is demonstrated throughout the experiments..
89. Farhad Mehdipour, Hiroaki Honda, Koji Inoue, Kazuaki Murakami, Hardware and software requirements for implementing a high-performance superconductivity circuits-based accelerator, 3rd Asia Symposium on Quality Electronic Design, ASQED 2011 Proceedings of the 3rd Asia Symposium on Quality Electronic Design, ASQED 2011, 10.1109/ASQED.2011.6111751, 229-235, 2011, Single-Flux Quantum based large-scale data-path processor (SFQ-LSRDP) is a reconfigurable computing system which is implemented by means of superconductivity circuits. SFQ-LSRDP has a capability of accelerating data flow graphs (DFGs) extracted from scientific applications. Using an alternative technology instead of CMOS circuits for implementing such hardware entails considering particular constraints and conditions from the architecture and tools development perspectives. In this paper, we will introduce hardware specifications of the LSRDP and the tool chain developed for implementing applications. Placing and routing data flow graphs is a fundamental part to develop applications on the SFQ-LSRDP. Algorithms for placing DFG operations and routing nets corresponding to the edges of data flow graphs will be discussed in more details. These algorithms have been applied on a number of data flow graphs and the results demonstrate their efficiency. Further, simulation results demonstrates remarkable performance numbers in the range of hundreds of Gflops for the proposed architecture..
90. Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami, Routing architecture and algorithms for a superconductivity circuits-based computing hardware, 2011 Canadian Conference on Electrical and Computer Engineering, CCECE 2011 2011 Canadian Conference on Electrical and Computer Engineering, CCECE 2011, 10.1109/CCECE.2011.6030605, 977-980, 2011, Dedicated tools for placing and routing data flow graphs extracted from computation-intensive applications are basic requirements for developing applications on a large-scale reconfigurable data-path processor (LSRDP) implemented by superconductivity circuits. Using an alternative technology instead of CMOS circuits for implementing such hardware entails considering particular constraints and conditions from the architecture and tools development perspectives. The main contribution of this work is to introduce an operand routing network (ORN) architecture as well as algorithms for routing the nets corresponding to the edges of the data flow graphs. Further, a micro-routing algorithm is proposed for routing and configuring the ORNs internally. These algorithms have been applied on a number of data flow graphs from target applications and the results demonstrate their efficacy..
91. Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Irina Kataeva, Kazuaki Murakami, Hiroyuki Akaike, Akira Fujimaki, Mapping scientific applications on a large-scale data-path accelerator implemented by Single-Flux Quantum (SFQ) circuits, Design, Automation and Test in Europe Conference and Exhibition, DATE 2010 DATE 10 - Design, Automation and Test in Europe, 10.1109/date.2010.5456902, 993-996, 2010, To overcome issues originating from the CMOS technology, a large-scale reconfigurable data-path (LSRDP) processor based on single-flux quantum circuits is introduced. LSRDP is augmented to a general purpose processor to accelerate the execution of data flow graphs (DFGs) extracted from scientific applications. Procedure of mapping large DFGs onto the LSRDP is discussed and our proposed techniques for reducing area of the accelerator within the design procedure will be introduced as well..
92. Inoue Koji, Hamid Noori, Farhad Mehdipour, Takaaki Hanada, Kazuaki Murakami, ALU-array based reconfigurable accelerator for energy efficient executions, 2009 International SoC Design Conference, ISOCC 2009 2009 International SoC Design Conference, ISOCC 2009, 10.1109/SOCDC.2009.5423898, 157-160, 2009.12, This paper introduces an energy efficient acceleration technique for embedded microprocessors. By means of supporting an ALU-array based coarse-grain reconfigurable functional unit, well customized special instructions are identified and executed for each application program. Since the reconfigurable functional unit can execute several dependent instructions (a sequence of instructions), simultaneously, the performance of the base microprocessor can dramatically be improved. In addition, this kind of direct execution is very energy efficient because it reduces activation counts of hardware components such as instruction cache, branch predictor, register-file accesses, and so on..
93. Takatsugu Ono, Inoue Koji, Kazuaki Murakami, Adaptive cache-line size management on 3D integrated microprocessors, 2009 International SoC Design Conference, ISOCC 2009 2009 International SoC Design Conference, ISOCC 2009, 10.1109/SOCDC.2009.5423920, 472-475, 2009.12, The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the cache miss penalty because large amount of data can be transferred from the main memory to the cache at a time. If a large cache line size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable line-size cache scheme. In this paper, we apply it to an L1 data cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data cache and stacked DRAM energy consumption up to 75%, compared to a conventional cache..
94. Naoto Fukumoto, Kenichi Imazato, Inoue Koji, Kazuaki Murakami, Performance balancing
Software-based on-chip memory management for effective CMP executions, 10th MEDEA Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture, MEDEA '09, held in conjunction with the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT 2009 Proceedings of the 10th MEDEA Workshop on MEmory Performance DEaling with Applications, Systems and Architecture, MEDEA '09, held in conjunction with the PACT 2009 Conference, 10.1145/1621960.1621966, 28-34, 2009.12, This paper proposes the concept of performance balancing, and reports its performance impact on a Chip multiprocessor (CMP). Integrating multiple processor cores into a single chip, or CMPs, can achieve higher peak performance by means of exploiting thread level parallelism. However, the off-chip memory bandwidth which does not scale with the number of cores tends to limit the potential of CMPs. To solve this issue, the technique proposed in this paper attempts to make a good balance between computation and memorization. Unlike conventional parallel executions, this approach exploits some cores to improve the memory performance. These cores devote the on-chip memory hardware resources to the remaining cores executing the parallelized threads. In our evaluation, it is observed that our approach can achieve 31% of performance improvement compared to a conventional parallel execution model in the specified program..
95. Farhad Mehdipour, Hamid Noori, Inoue Koji, Kazuaki Murakami, Rapid design space exploration of a reconfigurable instruction-set processor, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 10.1587/transfun.E92.A.3182, E92-A, 12, 3182-3192, 2009.12, Multitude parameters in the design process of a reconfigurable instruction-set processor (RISP) may lead to a large design space and remarkable complexity. Quantitative design approach uses the data collected from applications to satisfy design constraints and optimize the design goals while considering the applications' characteristics; however it highly depends on designer observations and analyses. Exploring design space can be considered as an effective technique to find a proper balance among various design parameters. Indeed, this approach would be computationally expensive when the performance evaluation of the design points is accomplished based on the synthesis-and-simulation technique. A combined analytical and simulation-based model (CAnSO**) is proposed and validated for performance evaluation of a typical RISP. The proposed model consists of an analytical core that incorporates statistics collected from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. CAnSO has clear speed advantages and therefore it can be used for easing a cumbersome design space exploration of a reconfigurable RISP processor and quick performance evaluation of slightly modified architectures..
96. Irina Kataeva, Hiroyuki Akaike, Akira Fujimaki, Nobuyuki Yoshikawa, Naofumi Takagi, Koji Inoue, Hiroaki Honda, Kazuaki Murakami, An operand routing network for an SFQ reconfigurable Data-Paths processor, IEEE Transactions on Applied Superconductivity, 10.1109/TASC.2009.2018534, 19, 3, 665-669, 2009.06, We report the progress in the development of an Operand Routing Network (ORN) for an SFQ Reconfigurable Data-Paths processor (SFQ-RDP). The SFQ-RDP is implemented as a two-dimensional array of Floating-Point Units (FPU), outputs of which can be connected to the inputs of one or more FPUs in the next row via ORN. We have considered two architectures of the ORN: one is based on NDRO switches and the other-on crossbar switches. The comparison shows that the crossbar-based ORN has better performance due to the regular pipelined structure. We have designed a crossbar switch with a multicasting function and a l-to-2 ORN prototype for 2.5 kA/cm2 process. The circuits have been experimentally tested at the frequencies up to 36 GHz..
97. Takatsugu Ono, Koji Inoue, Kazuaki Murakami, Kenji Yoshida, Reducing On-Chip DRAM energy via data transfer size optimization, IEICE Transactions on Electronics, 10.1587/transele.E92.C.433, E92-C, 4, 433-443, 2009.01, This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines..
98. Farhad Mehdipour, Hamid Noori, Bahman Javadi, Hiroaki Honda, Koji Inoue, Kazuaki Murakami, A combined analytical and simulation-based model for performance evaluation of a reconfigurable instruction set processor, Asia and South Pacific Design Automation Conference 2009, ASP-DAC 2009 Proceedings of the ASP-DAC 2009 Asia and South Pacific Design Automation Conference 2009, 10.1109/ASPDAC.2009.4796540, 564-569, 2009, Performance evaluation is a serious challenge in designing or optimizing reconfigurable instruction set processors. The conventional approaches based on synthesis and simulations are very time consuming and need a considerable design effort. A combined analytical and simulation-based model (CAnSO ) is proposed and validated for performance evaluation of a typical reconfigurable instruction set processor. The proposed model consists of an analytical core that incorporates statistics gathered from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. Compared to cycle-accurate simulation results, CAnSO proves almost 2% variation in the speedup measurement..
99. Koji Inoue, Koji Kai, Fumio Arakawa, Akihiko Inoue, Yoshio Hirose, Shorin Kyo, Keiji Kimura, Morihiro Kuga, Masaaki Kondo, Toshinori Sato, Makoto Satoh, Hiroyuki Tomiyama, Hiroshi Nakamura, Hiroo Hayashi, Masanori Hariyama, Hiroki Matsutani, Kunio Uchiyama, Foreword
Special section on hardware and software technologies on advanced microprocessors, IEICE Transactions on Electronics, 10.1587/transele.E92.C.1231, E92-C, 10, 2009.
100. Naoto Fukumoto, Tomonobu Mihara, Inoue Koji, Kazuaki Murakami, Analyzing the impact of data prefetching on chip multiprocessors, 13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008 13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008, 10.1109/APCSAC.2008.4625454, 2008.11, Data prefetching is a well known approach to compensating for poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed single-core architectures. In Chip Multiprocessor (or CMP) chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMP should be different from traditional single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetches issued during program execution. Then, we discuss quantitatively the effect of prefetching to memory performance. The experimental results show that the negative effect of invalidation of prefetched cache blocks is very small. In addition, it is observed that the current prefetch algorithms do not exploit effectively the feature of CMPs, i.e. cache-to-cache on-chip data transfer..
101. Hamid Noori, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, Morteza Saheb Zamani, An architecture framework for an adaptive extensible processor, Journal of Supercomputing, 10.1007/s11227-008-0174-4, 45, 3, 313-340, 2008.09, To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problem with this approach is the immense cost and the long times required to design a new processor for each application. As a solution to this issue, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chip-fabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units (FUs). A systematic quantitative approach is used for determining the appropriate structure of the reconfigurable functional unit (RFU). We also introduce an integrated framework for generating mappable CIs on the RFU. Using this architecture, performance is improved by up to 1.33, with an average improvement of 1.16, compared to a 4-issue in-order RISC processor. By partitioning the configuration memory, detecting similar/subset CIs and merging small CIs, the size of the configuration memory is reduced by 40%..
102. Junpei Zushi, Gang Zeng, Hiroyuki Tomiyama, Hiroaki Takada, Inoue Koji, Improved policies for Drowsy caches in embedded processors, 4th IEEE International Symposium on Electronic Design, Test and Applications, DELTA 2008 Proceedings - 4th IEEE International Symposium on Electronic Design, Test and Applications, DELTA 2008, 10.1109/DELTA.2008.70, 362-367, 2008.09, In the design of embedded systems, especially batterypowered systems, it is important to reduce energy consumption. Cache are now used not only in general-purpose processors but also in embedded processors. As feature sizes shrink, the leakage energy has contributed to a significant portion of total energy consumption. To reduce the leakage energy of cache, the Drowsy cache was proposed, in which the cache lines are periodically moved to the lowleakage mode without loss of its content. However, when a cache line in the low-leakage mode is accessed, one or more clock cycles are required to transition the cache line back to the normal mode before its content can be accessed. As a result, these penalty cycles may significantly degrade the cache performance, especially in embedded processors without out-of-order execution. In this paper, we propose four mode transition policies which aim at high energy reduction with the minimum performance degradation. We also compare our policies with existing policies in the context of embedded processors. Experimental results demonstrate the effectiveness of the proposed policies..
103. Hamid Noori, Maziar Goudarzi, Inoue Koji, Kazuaki Murakami, Improving energy efficiency of configurable caches via temperature-aware configuration selection, IEEE Computer Society Annual Symposium on VLSI: Trends in VLSI Technology and Design, ISVLSI 2008 Proceedings - IEEE Computer Society Annual Symposium on VLSI Trends in VLSI Technology and Design, ISVLSI 2008, 10.1109/ISVLSI.2008.24, 363-368, 2008.09, Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Our results show that using a Temperature-Aware Configurable Cache (TACC), up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner case temperature (100°C). The TACC also enhances the performance by up to 28% and 17% for the instruction and data cache, respectively..
104. Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Inoue Koji, Kazuaki Murakami, Design space exploration for a coarse grain accelerator, 2008 Asia and South Pacific Design Automation Conference, ASP-DAC 2008 Asia and South Pacific Design Automation Conference, ASP-DAC, 10.1109/ASPDAC.2008.4484039, 685-690, 2008.08, In the design process of a reconfigurable accelerator employing in an embedded system, multitude parameters may result in remarkable complexity and a large design space. Design space exploration as an alternative to the quantitative approach can be employed to find a right balance between the different design parameters. In this paper, a hybrid approach is introduced to analytically explore the design space for a coarse grain accelerator and determine a wise design point exploiting data extracted from applications, quantitatively. It also provides flexibility for taking into account new design constraints as well as new characteristics of applications. Furthermore, this approach is a methodological approach which reduces the design time and results in a point which satisfies the design goals..
105. Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami, A gravity-directed temporal partitioning approach, IEICE Electronics Express, 10.1587/elex.5.366, 5, 10, 366-373, 2008.05, Reconfiguration latency has a significant impact on the system performance in reconfigurable systems. A temporal partitioning approach is introduced for partitioning data flow graphs for a reconfigurable system comprising a partial programmable fine-grained hardware. Residing eligibility inspired from the Universal gravitation law is introduced to depict the eligibility of a node to stay in succeeding configurations (partitions) and to prohibit it from being swapped in/out. Partitioning based on residing eligibility causes fewer nodes with different functionalities to be assigned to subsequent partitions. Thus, reconfiguration overhead time and also unused hardware space decreases due to common parts in consecutive configurations..
106. Hamid Noori, Farhad Mehdipour, Inoue Koji, Kazuaki Murakami, A reconfigurable functional unit with conditional execution for multi-exit custom instructions, IEICE Transactions on Electronics, 10.1093/ietele/e91-c.4.497, E91-C, 4, 497-508, 2008.04, Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of these custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. A quantitative approach is utilized to propose an efficient architecture for the RFU and fix its constraints. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple exits custom instructions are proposed. Conditional execution has been added to the RFU to support the multi-exit feature of custom instructions. Experimental results show that multi-exit custom instructions enhance the performance by an average of 67% compared to custom instructions limited to one basic block. A maximum speedup of 4.7, compared to a general embedded processor, and an average speedup of 1.85 was achieved on MiBench benchmark suite..
107. Naofumi Takagi, Kazuaki Murakami, Akira Fujimaki, Nobuyuki Yoshikawaj, Koji Inoue, Hiroaki Honda, Proposal of a desk-side supercomputer with reconfigurable data-paths using rapid single-flux-quantum circuits, IEICE Transactions on Electronics, 10.1093/ietele/e91-c.3.350, E91-C, 3, 350-355, 2008.03, We propose a desk-side supercomputer with large-scale reconfigurable data-paths (LSRDPs) using superconducting rapid single- flux-quantum (RSFQ) circuits. It has several sets of computing unit which consists of a general-purpose microprocessor, an LSRDP and a memory. An LSRDP consists of a lot of, e.g., a few thousand, floating-point units (FPUs) and operand routing networks (ORNs) which connect the FPUs. We reconfigurethe LSRDP to fit a computation, i.e., a group of floating- point operations, which appears in a 'for' loop of numerical programs by setting the route in ORNs before the execution of the ioop. We propose to implement the LSRDPs by RSFQ circuits. The processors and the memories can be implemented by semiconductor technology. We expect that a 10 TFLOPS supercomputer, as well as a refrigerating engine, will be housed in a desk-side rack, using a near-future RSFQ process technology, such as 0.35 m process..
108. Hamid Noori, Maziar Goudarzi, Inoue Koji, Kazuaki Murakami, Temperature-aware configurable cache to reduce energy in embedded systems, IEICE Transactions on Electronics, 10.1093/ietele/e91-c.4.418, E91-C, 4, 418-431, 2008.01, Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for 40% or more of the total energy consumed in these systems. Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Moreover, temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Consequently, a Temperature-Aware Configurable Cache (TACC) is an effective way to save energy in finer technologies when the embedded system is used in different temperatures. Our results show that using a TACC, up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner-case temperature (100°C). Furthermore, the 'I'ACC also enhances the performance by up to 28% for the instruction cache and up to 17% for the data cache..
109. Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki Murakami, Enhancing energy efficiency of processor-based embedded systems through post-fabrication ISA extension, ISLPED'08: 13th ACM/IEEE International Symposium on Low Power Electronics and Design ISLPED'08 Proceedings of the 2008 International Symposium on Low Power Electronics and Design, 10.1145/1393921.1393987, 241-246, 2008, Application-specific instruction set extension is an effective technique for reducing accesses to components such as on- and off-chip memories, register file and enhancing the energy efficiency. However, the addition of custom functional units to the base processor is required for supporting custom instructions, which due to the increase of manufacturing and design costs in new nanometer-scale technologies and shorter time-to-market, is becoming an issue. To address above issues, in our proposed approach, an optimized reconfigurable functional unit is used instead, and instruction set customization is done after chip-fabrication. Therefore, while maintaining the flexibility of a conventional microprocessor, the low-energy feature of customization is applicable. Experimental results show that the maximum and average energy savings are 67% and 22%, respectively for our proposed architecture framework..
110. Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami, Performance evaluation of a reconfigurable set processor, 2008 International SoC Design Conference, ISOCC 2008 2008 International SoC Design Conference, ISOCC 2008, 10.1109/SOCDC.2008.4815603, I184-I187, 2008, Performance evaluation is a serious challenge in designing optimizing reconfigurable instruction set processors. A combined and simulation-based model (CAnSO?) is proposed and for performance evaluation of a typical reconfigurable set processor. The proposed model consists of an core that incorporates statistics gathered from cycleaccurate to make a reasonable evaluation. CAnSO has speed advantages and compared to cycle-accurate simulation, proves almost 2% variation in the speedup measurement..
111. Ryutaro Susukita, Yasunori Kimura, Hisashige Ando, Hidemi Komatsu, Mutsumi Aoyagi, Motoyoshi Kurokawa, Hiroaki Honda, Kazuaki J. Murakami, Yuichi Inadomi, Hidetomo Shibamura, Koji Inoue, Shuji Yamamura, Shigeru Ishizuki, Yunqing Yu, Performance prediction of large-scale parallell system and application using macro-level simulation, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, 10.1109/SC.2008.5220091, 2008, To predict application performance on an HPC system is an important technology for designing the computing system and developing applications. However, accurate prediction is a challenge, particularly, in the case of a future coming system with higher performance. In this paper, we present a new method for predicting application performance on HPC systems. This method combines modeling of sequential performance on a single processor and macro-level simulations of applications for parallel performance on the entire system. In the simulation, the execution flow is traced but kernel computations are omitted for reducing the execution time. Validation on a real terascale system showed that the predicted and measured performance agreed within 10% to 20 %. We employed the method in designing a hypothetical petascale system of 32768 SIMD-extended processor cores. For predicting application performance on the petascale system, the macro-level simulation required several hours..
112. Kazuaki J. Murakami, Feng Long Gu, Mutsumi Aoyagi, Takeshi Nanri, Koji Inoue, At the cutting edge of a petascale computing world
An overview of Petascale System Interconnect project, 5th International Conference on Computational Methods in Science and Engineering, ICCMSE 2007 Computational Methods in Science and Engineering - Theory and Computation Old Problems and New Challenges, Lectures Presented at the Int. Conf. Computational Methods in Sci. Eng. 2007 ICCMSE 2007, 10.1063/1.2827008, 23-38, 2007.12, This talk presents an overview of the Petascale System Interconnect (PSI) project. The PSI project is one of the national projects on "Fundamental Technologies for the Next Generation Supercomputing" of MEXT (Ministry of Education, Culture Sports, Science and Technology), Japan. The goal of the PSI project is to develop technologies enabling petascale supercomputing systems with hundreds of thousands of computing nodes. The PSI project consists of three subprojects to tackle with the three fundamental technologies: subproject 1 is for the small and efficient optical packet switches; subproject 2 is for the low-cost & high-performance MPI communications; and subproject 3 is for the methodologies of evaluating and estimating the performance of petascale systems. With the successful completion of the PSI project, the Japan Next-Generation Supercomputer R&D Center (NSC) will take the technologies to build Japan's next generation supercomputer, which is expected to be over 70 times faster than the current fastest supercomputers..
113. Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Koji Inoue, Kazuaki Murakami, Improving performance and energy saving in a reconfigurable processor via accelerating control data flow graphs, IEICE Transactions on Information and Systems, 10.1093/ietisy/e90-d.12.1956, E90-D, 12, 1956-1966, 2007.12, Extracting frequently executed (hot) portions of the application and executing their corresponding data flow graph (DFG) on the hardware accelerator brings about more speedup and energy saving for embedded systems comprising a base processor integrated with a tightly coupled accelerator. Extending DFGs to support control instructions and using Control DFGs (CDFGs) instead of DFGs results in more coverage of application code portion are being accelerated hence, more speedup and energy saving. In this paper, motivations for extending DFGs to CDFGs and handling control instructions are introduced. In addition, basic requirements for an accelerator with conditional execution support are proposed. Then, two algorithms are presented for temporal partitioning of CDFGs considering the target accelerator architectural constraints. To demonstrate effectiveness of the proposed ideas, they are applied to the accelerator of a reconfigurable processor called AMBER. Experimental results approve the remarkable effectiveness of covering control instructions and using CDFGs versus DFGs in the aspects of performance and energy reduction..
114. S. Iwasaki, M. Tanaka, Y. Yamanashi, H. Park, H. Akaike, A. Fujimaki, N. Yoshikawa, N. Takagi, K. Murakami, H. Honda, K. Inoue, Design of a reconfigurable data-path prototype in the single-flux-quantum circuit, Superconductor Science and Technology, 10.1088/0953-2048/20/11/S06, 20, 11, S328-S331, 2007.11, We have designed a reconfigurable data-path (RDP) prototype based on the single-flux-quantum (SFQ) circuit. The RDP serves as an accelerator for a high performance computer and is composed of many stages of the array of floating point number processing units (FPUs) connected by reconfigurable operand routing networks (ORNs). The FPU array usually includes shift-registers (SRs) in order that the data is forwarded to the next stage without calculation. The data-path is reconfigured so as to reflect a long repeat instruction appearing in large-scale calculations. We can implement parallel and pipelined processing without memory access in such calculations, reducing the required bandwidth between a memory and a microprocessor. The SFQ high speed network switches and bit-serial/slice FPUs realize reduction in the circuit areas and in the power consumption compared to semiconductor devices when we make up the RDP by using the SFQ circuit. As a first step of the development of the SFQ-RDP, we design a 2 × 2 RDP prototype composed of double arrays of dual arithmetic logic units (ALUs). The prototype also has dual SRs in each array and four ORNs. We use bit-serial ALUs designed to operate at 25GHz. Each ORN behaves like a 4 × 2 crossbar switch. We have demonstrated the reconfiguration in the RDP prototype made up of 15 050 Josephson junctions though only some of the functions of ALUs are available..
115. Hamid Noon, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, Maziar Goudarzi, Generating and executing multi-exit custom instructions for an adaptive extensible processor, 2007 Design, Automation and Test in Europe Conference and Exhibition Proceedings - 2007 Design, Automation and Test in Europe Conference and Exhibition, DATE 2007, 10.1109/DATE.2007.364612, 325-330, 2007, To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problems of this approach are immense cost and long time of designing. To address these issues, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chip-fabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units with the capability of conditional execution. Unlike previous proposed CIs, ours can include multiple exits. Experimental results show that multi-exit CIs enhance the performance by 46% in average compared to CIs limited to one basic block. A maximum speedup of 2.89 compared to a 4-issue in-order RISC processor, and a speedup of 1.66 in average, was achieved on MiBench benchmark suite..
116. Hamid Noori, Farhad Mehdipour, Morteza Saheb Zamani, Koji Inoue, Kazuaki Murakami, Handling control data flow graphs for a tightly coupled reconfigurable accelerator, 3rd International Conference on Embedded Software and Systems, ICESS 2007 Embedded Software and Systems - Third International Conference, ICESS 2007, Proceedings, 10.1007/978-3-540-72685-2_24, 249-260, 2007, In an embedded system including a base processor integrated with a tightly coupled accelerator, extracting frequently executed portions of the code (hot portion) and executing their corresponding data flow graph (DFG) on the accelerator brings about more speedup. In this paper, we intend to present our motivations for handling control instructions in DFGs and extending them to Control DFGs (CDFGs). In addition, basic requirements for an accelerator with conditional execution support are proposed. Moreover, some algorithms are presented for temporal partitioning of CDFGs considering the target accelerator architectural specifications. To show the effectiveness of the proposed ideas, we applied mem to the accelerator of an extensible processor called AMBER. Experimental results represent the effectiveness of covering control instructions and using CDFGs versus DFGs..
117. Hiroaki Honda, Tetsuo Hayashi, Yuichi Inadomi, Koji Inoue, Kazuaki J. Murakami, Implementation and evaluation of Fock matrix calculation program on the Cell processor, International Conference on Computational Methods in Science and Engineering 2007, ICCMSE 2007 Computation in Modern Science and Engineering - Proceedings of the International Conference on Computational Methods in Science and Engineering 2007 (ICCMSE 2007), 10.1063/1.2836167, 64-67, 2007, Various processor architectures have been proposed until today, and the performance has improved remarkably. Recently, the Chip Multi-processors (CMPs), which has many processor cores onto a chip, are proposed for further performance improvement. The Cell processor is one of such CMP and shows high computational performance. Although this processor is designed for the multimedia, that high performance character can be utilized to molecular orbital calculation. In this study we implemented Fock matrix construction program on the Cell processor, and evaluated computational performance. As a result, there were two kinds of main stalls by the branch prediction and the data alignment, which are controlled by software mechanism for the simplification of the Cell processor hardware. It is possible to improve the performance about 30%, if the branch prediction hit ratio could be improved to 99%. For data alignment stall, a part of stalls, which is originated by data shuffle pipeline, could be decreased by preparing hardware data alignment mechanism..
118. Toshiya Takami, Jun Maki, Jun'ichi Ooba, Yuuichi Inadomi, Hiroaki Honda, Ryutaro Susukita, Koji Inoue, Taizo Kobayashi, Rie Nogita, Mutsumi Aoyagi, Multi-physics extension of OpenFMO framework, International Conference on Computational Methods in Science and Engineering 2007, ICCMSE 2007 Computation in Modern Science and Engineering - Proceedings of the International Conference on Computational Methods in Science and Engineering 2007 (ICCMSE 2007), 10.1063/1.2835969, 122-125, 2007, OpenFMO framework, an open-source software (OSS) platform for Fragment Molecular Orbital (FMO) method, is extended to multi-physics simulations (MPS). After reviewing the several FMO implementations on distributed computer environments, the subsequent development planning corresponding to MPS is presented. It is discussed which should be selected as a scientific software, lightweight and reconfigurable form or large and self-contained form..
119. Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki Murakami, The effect of temperature on cache size tuning for low energy embedded systems, 17th Great Lakes Symposium on VLSI, GLSVLSI'07 GLSVLSI'07 Proceedings of the 2007 ACM Great Lakes Symposium on VLSI, 10.1145/1228784.1228891, 453-456, 2007, Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for about 40% or more of the total energy consumed in these systems. In older technology nodes, active power was the primary contributor to total power dissipation of a CMOS design. However, with the scaling of feature sizes, the share of leakage in total power consumption of digital systems continues to grow. Temperature is a factor which exponentially increases the leakage current. In this paper, we show the effects of temperature on the selection of optimal cache size for low energy embedded systems. Our results show that for a given application, the optimal cache size selection is affected by the temperature. Our experiments have been done for 100nm technology. Our study reveals that the cache size selection for different temperatures depends on the rate at which cache miss increases when reducing the cache size. When the miss rate increases sharply the optimal point is the same for all examined temperatures, however when it becomes smoother, the optimal point for different temperatures begin to get farther..
120. Koji Inoue, Return address protection on cache memories, IEICE Transactions on Electronics, 10.1093/ietele/e89-c.12.1937, E89-C, 12, 1937-1947, 2006.12, The present paper proposes a novel cache architecture, called SCache, to detect buffer overflow attacks at run time. In addition, we evaluate the energy-security efficiency of the proposed architecture. On a return-address store, SCache generates one or more copies of the return address value and saves them as read only in the cache area. The number of copies generated strongly affects both energy consumption and vulnerability. When the return address is loaded (or popped), the cache compares the value loaded from the memory stack with the corresponding copy existing in the cache. If they are not the same, then return-address corruption has occurred. In the present study, the proposed approach is shown to protect more than 99.5 of return-address loads from the threat of buffer overflow attacks, while increasing the total cache-energy consumption by, at worst, approximately 23, compared to a well-known low-power cache. Furthermore, we explore the tradeoff between energy consumption and security, and our experimental results show that an energy-aware SCache model provides relatively higher security with only a 10 increase in energy consumption..
121. Hidetoshi Onodera, Makoto Ikeda, Tohru Ishihara, Tsuyoshi Isshiki, Koji Inoue, Kenichi Okada, Seiji Kajihara, Mineo Kaneko, Hiroshi Kawaguchi, Shinji Kimura, Morihiro Kuga, Atsushi Kurokawa, Takashi Sato, Toshiyuki Shibuya, Yoichi Shiraishi, Kazuyoshi Takagi, Atsushi Takahashi, Yoshinori Takeuchi, Nozomu Togawa, Hiroyuki Tomiyama, Yuichi Nakamura, Kiyoharu Hamaguchi, Yukiya Miura, Shin Ichi Minato, Ryuichi Yamaguchi, Masaaki Yamada, Yasushi Yuminaka, Takayuki Watanabe, Masanori Hashimoto, Masayuki Miyazaki, Special section on VLSI Design and CAD Algorithms, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 10.1093/ietfec/e89-a.12.3377, E89-A, 12, 3377, 2006.12.
122. Hamid Noori, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, Morteza Sahebzamani, A reconfigurable functional unit for an adaptive dynamic extensible processor, 2006 International Conference on Field Programmable Logic and Applications, FPL Proceedings - 2006 International Conference on Field Programmable Logic and Applications, FPL, 10.1109/FPL.2006.311313, 781-784, 2006, This paper presents a reconfigurable functional unit (RFU) for an adaptive dynamic extensible processor. The processor can tune its extended instructions to the target applications, after chip-fabrication. The custom instructions (CIs) are generated deploying the hot basic blocks during the training mode. In the normal mode, CIs are executed on the RFU. A quantitative approach was used for designing the RFU. The RFU is a matrix of functional units with 8 inputs and 6 outputs. Performance is enhanced up to 1.25 using the proposed RFU for 22 applications of Mibench. This processor needs no extra opcodes for CIs, new compiler, source code modification and recompilation..
123. Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Mehdi Sedighi, Koji Inoue, An integrated temporal partitioning and mapping framework for handling custom instructions on a reconfigurable functional unit, 11th Asia-Pacific Conference on Advances in Computer Systems Architecture, ACSAC 2006 Advances in Computer Systems Architecture - 11th Asia-Pacific Conference, ACSAC 2006, Proceedings, 10.1007/11859802_18, 219-230, 2006, Extensible processors allow customization for an application by extending the core instruction set architecture. Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit. Custom instructions (CIs) usually are extracted from critical portions of applications. This paper presents approaches for CI generation with respect to the RFU constraints to improve speedup of the extensible processor. First, our proposed RFU architecture for an adaptive dynamic extensible processor called AMBER is described. Then, an integrated temporal partitioning and mapping framework is presented to partition and map the CIs on the RFU. In this framework, a mapping aware temporal partitioning algorithm is used to generate CIs which are mappable on the RFU. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. In addition, a mapping algorithm is presented which supports CIs with critical path length more than the RFU depth..
124. Farhad Mehdipour, Hamid Noon, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi, Custom instruction generation using temporal partitioning techniques for a reconfigurable functional unit, International Conference on Embedded and Ubiquitous Computing, EUC 2006 Embedded and Ubiquitous Computing - International Conference, EUC 2006, Proceedings, 10.1007/11802167_73, 722-731, 2006, Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit (RFU). Custom instructions (CIs) are usually extracted from critical portions of applications. It may not be possible to meet all of the RFU constraints when CIs are generated. This paper addresses the generation of mappable CIs on an RFU. In this paper, our proposed RFU architecture for an adaptive dynamic extensible processor is described. Then, an integrated framework for temporal partitioning and mapping is presented to partition and map the CIs on RFU. In this framework, two mapping aware temporal partitioning algorithms are used to generate CIs. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. Using this framework brings about more speedup for the extensible processor..
125. Koji Inoue, Lock and unlock
A data management algorithm for a security-aware cache, ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems, 10.1109/ICECS.2006.379629, 1093-1096, 2006, This paper proposes an efficient cache line management algorithm for a security-aware cache architecture (SCache). SCache attempts to detect the corruption of return address values at runtime. When a return address store is executed, the cache generates a replica of the return address. This copied data is treated as read only. Subsequently, when the corresponding return address load is performed, the cache verifies the return address value loaded from the memory stack by means of comparing it with the replica data. Unfortunately, since the replica data is also a candidate for cache line replacements, SCache does not work well for application programs that cause higher cache miss rates. To resolve this issue, a lock and unlock data management algorithm is proposed in order to improve the security of SCache. The experimental results show that a proposed SCache model can protect about 99% of return address loads from the threat of buffer overflow attacks, while it worsens the processor performance by only 1%, compared with a non-secure conventional cache..
126. Koji Inoue, Supporting a dynamic program signature
An intrusion detection framework for microprocessors, ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems, 10.1109/ICECS.2006.379744, 160-163, 2006, To address computer security issues, a hardware-based intrusion detection technique is proposed. This uses the dynamic program execution behavior for authentication. Based on secret key information, an execution behavior is determined. Next, a secure compiler constructs object code which generates the predetermined execution behavior at runtime. During program execution, a secure profiler monitors the execution behavior. If the profiler cannot detect the expected behavior, it sends an alarm signal to the microprocessor for terminating program execution. Since attack code cannot anticipate the execution behavior required, malicious attacks can be detected and prohibited at the start of program execution..
127. Hidekazu Tanaka, Koji Inoue, Adaptive mode control for low-power caches based on way-prediction accuracy, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 10.1093/ietfec/e88-a.12.3274, E88-A, 12, 3274-3281, 2005.12, This paper proposes a novel cache architecture for low power consumption, called "Adaptive Way-Predicting Cache (AWP cache)." The AWP cache has multi-operation modes and dynamically adapts the operation mode based on the accuracy of way-prediction results. A confidence counter for way prediction is implemented to each cache set. In order to analyze the effectiveness of the AWP cache, we perform a SRAM design using 0.18 μm CMOS technology and cycle-accurate processor simulations. As the results, for a benchmark program (179.art), it is observed that a performance-aware AWP cache reduces the 49% of performance overhead caused by an original way-predicting cache to 17%. Furthermore, a energy-aware AWP cache achieves 73% of energy reduction, whereas that obtained from the original way-predicting scheme is only 38%, compared to an non-optimized conventional cache. For the consideration of energy-performance efficiency, we see that the energy-aware AWP cache produces better results; the energy-delay product of conventional organization is reduced to only 35% in average which is 6% better than the original way-predicting scheme..
128. Vasily G. Moshnyaga, Koji Inoue, Low-power cache design, Low-Power Processors and Systems on Chips, 8-1-8-21, 2005.01, Cache memories are the most area-and energy-consuming units in today’s microprocessors. As the speed disparity between processor and external memory increases, designers try to put large multilevel caches on a chip to reduce the number of external memory accesses and thus boost the system performance. (See Table 8.1 for a survey of the on-die caches for several recent high-end microprocessors.) On-chip data and instruction caches are implemented using arrays of densely packed static RAM cells. The device count for the caches often exceeds the number of transistors devoted to the processor’s datapath and controller. For example, the Alpha21364 [3] and PA-RISC Maco [5] microprocessors have over 90% of their transistors in RAM, with most of them dedicated for caches; the Itanium2 [1] has 80% in caches, the IBM G5 [7] has 72%, the PowerPC [8] has 71%, and Strong-ARM110 [9] has 70%. Due to the large load capacitance and high access rate, these caches account for significant portion of the overall power dissipation (e.g., 35% in Itanium2 [1]; 43% in Strong-ARM [9]). Therefore optimizing caches for power is increasingly important. Although much work on energy reduction has taken place in the circuit and technology domains [10,11], interest in cache design for power efficiency at the architectural level continues to increase. Architecture is the entry point in cache design hierarchy, and decisions taken at this level can drastically affect the efficiency of design..
129. Shigeharu Matsusaka, Koji Inoue, A cost effective spatial redundancy with data-path partitioning, 3rd International Conference on Information Technology and Applications, ICITA 2005 Proceedings - 3rd International Conference on Information Technology and Applications, ICITA 2005, 10.1109/ICITA.2005.7, 51-56, 2005, In order to maintain the high reliability of a computer system, it is necessary to detect the failure leading to a fault. In general, fault can be detected by exploiting time redundancy or spatial redundancy. However, it negatively affects on either hardware cost or processor performance. To solve the cost-performance issue, in this paper, we propose a concept of cost-effective approach to achieve spatial redundancy for dependable processors. In addition, we perform a primly evaluation for the impact of our method on processor performance..
130. Reiko Komiya, Koji Inoue, Vasily G. Moshnyaga, Kazuaki Murakami, Quantitative evaluation of state-preserving leakage reduction algorithm for L1 data caches, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 10.1093/ietfec/e88-a.4.862, E88-A, 4, 862-868, 2005, As the transistor feature sizes and threshold voltages reduce, leakage energy consumption has become an inevitable issue for high-performance microprocessor designs. Since on-chip caches are major contributors of the leakage, a number of researchers have proposed efficient leakage reduction techniques. However, it is still not clear that 1) what kind of algorithm can be considered and 2) how much they have impact on energy and performance. To answer these questions, we explore runtime cache management algorithm, and evaluate the energy-performance efficiency for several alternatives..
131. Vasily G. Moshnyaga, Inoue Koji, Low-power cache design, Low-Power Electronics Design, 25-1-25-21, 2004.01, Cache memories are the most area- and energy-consuming units in today’s microprocessors. As the speed disparity between processor and external memory increases, designers try to put large multilevel caches on a chip to reduce the number of external memory accesses and thus boost the system performance. (See Table 25.1 for a survey of the on-die caches for several recent high-end microprocessors.) On-chip data and instruction caches are implemented using arrays of densely packed static RAM cells. The device count for the caches often exceeds the number of transistors devoted to the processor’s datapath and controller. For example, the Alpha21364 [3] and PA-RISC Maco [5] microprocessors have over 90% of their transistors in RAM, with most of them dedicated for caches; the Itanium2 [1] has 80% in caches, the IBM G5 [7] has 72%, the PowerPC [8] has 71%, and Strong-ARM110 [9] has 70%. Due to the large load capacitance and high access rate, these caches account for significant portion of the overall power dissipation (e.g., 35% in Itanium2 [1]; 43% in Strong-ARM [9]). Therefore optimizing caches for power is increasingly important. Although much work on energy reduction has taken place in the circuit and technology domains [10,11], interest in cache design for power efficiency at the architectural level continues to increase. Architecture is the entry point in cache design hierarchy, and decisions taken at this level can drastically affect the efficiency of design..
132. Koji Inoue, Hidekazu Tanaka, Vasily G. Moshnyaga, Kazuaki Murakami, A low-power I-cache design with tag-comparison reuse, 2004 International Symposium on System-on-Chip 2004 International Symposium on System-on-Chip Proceedings, 61-67, 2004, This paper reports design and evaluation results of a low-energy I-cache architecture, called history-based tag-comparison (HBTC) cache. The HBTC cache attempts to re-use tag-comparison results to detect and eliminate unnecessary memory-array activations. We have performed cycle accurate simulations, and have designed an SRAM core based on a 0.18 μm CMOS technology. As a result, it has been observed that the HBTC approach can achieve 60% of energy reduction, with only 0.3% performance degradation, compared to a conventional cache. Furthermore, we have also evaluated the potential of the HBTC cache by combining with other low-energy techniques..
133. Kenichi Tanamachi, Inoue Koji, Vasily G. Moshnyaga, Designing a TCP/IP core for power consumption analysis, Proceedings of 2004 IEEE Asia-Pacific Conference on Advanced System Integrated Circuits Proceedings of 2004 IEEE Asia-Pacific Conference on Advanced System Integrated Circuits, 412-413, 2004, The designing of a low-power TCP/IP hardcore for pervasive computing was discussed. In order to implement the TCP/IP operations in hardware, the TCP and IP functions were partitioned into four modules which were port_ctr, data_ctr, window_ctr, and checksum. It was found that the data_ctr consumed about 22-30% of total power. The power consumption of TCP core and IP core were compared and it was found that the power consumed by the TCP was almost double of that of the IP core..
134. Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga, Reducing access count to register-files through operand reuse, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2823, 112-121, 2003.12, This paper proposes an approach for reducing access count to register-files based on operand data reuse. The key idea is to compare source and destination operands of the current instruction with the corresponding operands of the previous instructions and if they are the same, skip the register file activation during the operand fetch thus saving energy consumption. Simulations show that by using this technique we can decrease the total number of register-file accesses up to 62% on peak and by 39% on average in comparison to a conventional approach with only 3% processor area overhead..
135. Koji Inoue, Vasily G. Moshnyaga, Kazuaki Murakami, Instruction encoding for reducing power consumption of I-ROMs based on execution locality, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E86-A, 4, 799-805, 2003.04, In this paper, we propose an instruction encoding scheme to reduce power consumption of instruction ROMs. The power consumption of the instruction ROM strongly depends on the switching activity of bit-lines due to their large load capacitance. In our approach, the binary-patterns to be assigned as op-codes are determined based on the frequency of instructions in order to reduce the number of bit-line dis-charging. Simulation results show that our approach can reduce 40% of bit-line switchings from a conventional organization..
136. Y. Nishida, Inoue Koji, V. G. Moshnyaga, A zero-value prediction technique for fast DCT computation, 2003 IEEE Workshop on Signal Processing Systems, SIPS 2003 2003 IEEE Workshop on Signal Processing Systems Design and Implementation, SIPS 2003, 10.1109/SIPS.2003.1235663, 2003-January, 165-170, 2003.01, The paper proposes a new computationally efficient technique for DCT operation. Unlike related research, the technique reduces the number of computations by predicting the effect of quantization on DCT and avoiding calculations of those DCT values which lead to zero elements in the block after quantization. Experimental evaluation on a number of video benchmarks shows that our method is able to reduce the total number of computations by 29% for DCT and by 59% for quantization while maintaining high image quality..
137. Koji Inoue, Vasily Moshnyaga, Kazuaki Murakami, Dynamic tag-check omission
A low power instruction cache architecture exploiting execution footprints, 2nd International Workshop on Power-Aware Computer Systems, PACS 2002 Power-Aware Computer Systems - 2nd International Workshop, PACS 2002, Revised Papers, 10.1007/3-540-36612-1_2, 18-32, 2003, This paper proposes an architecture for low-power directmapped instruction caches, called “history-based tag-comparison (HBTC) cache”. The HBTC cache attempts to detect and omit unnecessary tag checks at run time. Execution footprints are recorded in an extended BTB (Branch Target Buffer), and are used to know the cache residence of target instructions before starting cache access. In our simulation, it is observed that our approach can reduce the total count of tag checks by 90 %, resulting in 15 % of cache-energy reduction, with less than 0.5 % performance degradation..
138. Inoue Koji, Vasily G. Moshnyaga, Kazuaki Murakami, Omitting cache look-up for high-performance, low-power microprocessors, IEICE Transactions on Electronics, E85-C, 2, 279-287, 2002.02, In this paper, we propose a novel architecture for low-power direct-mapped instruction caches, called "history-based tag-comparison (HBTC) cache". The cache attempts to reuse tag-comparison results for avoiding unnecessary tag checks. Execution footprints are recorded into an extended BTB (Branch Target Buffer). In our evaluation, it is observed that the energy for tag comparison can be reduced by more than 90% in many applications..
139. Inoue Koji, Vasily G. Moshnyaga, Kazuaki Murakami, Trends in high-performance, low-power cache memory architectures, IEICE Transactions on Electronics, E85-C, 2, 304-314, 2002.02, One of uncompromising requirements from portable computing is energy efficiency, because that affects directly the battery life. On the other hand, portable computing will target more demanding applications, for example moving pictures, so that higher performance is still required. Cache memories have been employed as one of the most important components of computer systems. In this paper, we briefly survey architectural techniques for high performance, low power cache memories..
140. Koji Inoue, Vasily G. Moshnyaga, Kazuaki Murakami, A low energy set-associative I-cache with extended BTB, Proceedings-IEEE International Conference on Computer Design: VLSI in Computers and Processors, 10.1109/ICCD.2002.1106768, 187-192, 2002.01, This paper proposes a low-energy instruction-cache architecture, called history-based tag-comparison (HBTC) cache. The HBTC cache attempts to re-use tag-comparison results for avoiding unnecessary way activation in set-associative caches. The cache records tag-comparison results in an extended BTB, and re-uses them for directly selecting only the hit-way which includes the target instruction. In our simulation, it is observed that the HBTC cache can achieve 62% of energy reduction, with less than 1% performance degradation, compared with a conventional cache..
141. Jun Ni Ohban, V. G. Moshnyaga, K. Inoue, Multiplier energy reduction through bypassing of partial products, Asia-Pacific Conference on Circuits and Systems, APCCAS 2002 Proceedings - APCCAS 2002 Asia-Pacific Conference on Circuits and Systems, 10.1109/APCCAS.2002.1115097, 13-17, 2002.01, The design of portable battery operated multimedia devices requires energy-efficient multiplication circuits. This paper presents a novel approach to reduce power consumption of digital multiplier based on dynamic bypassing of partial products. The bypassing elements incorporated into the multiplier hardware eliminate redundant signal transitions, which appear within the carry-save adders when the partial product is zero. Simulations on the real-life DCT data show that the proposed approach can improve power saving of related methods by 12%, while jointly with them, it reduces the power consumption of a 16x16 digital CMOS multiplier by 31%, with 25% area overhead and less than 4% performance degradation in the worst case. The circuit implementation is outlined..
142. Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga, Register file energy reduction by operand data reuse, 12th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS 2002 Integrated Circuit Design Power and Timing Modeling, Optimization and Simulation - 12th International Workshop, PATMOS 2002, Proceedings, 10.1007/3-540-45716-x_28, 278-288, 2002.01, This paper presents an experimental study of register file utilization in conventional RISC-type data path architecture to determine benefits that we can expect to achieve by eliminating unnecessary register file reads and writes. Our analysis shows that operand bypassing, enhanced for operand-reuse can discard the register file accesses up to 65% as a peak and by 39% on average for tested benchmark programs..
143. K. Inoue, V. G. Moshnyaga, K. Murakami, Reducing power consumption of instruction ROMs by exploiting instruction frequency, Asia-Pacific Conference on Circuits and Systems, APCCAS 2002 Proceedings - APCCAS 2002 Asia-Pacific Conference on Circuits and Systems, 10.1109/APCCAS.2002.1115094, 1-6, 2002, This paper proposes a new approach to reducing the power consumption of instruction ROMs for embedded systems. The power consumption of instruction ROMs strongly depends on the switching activity of bit-lines. If a read bit-value indicates '0', the precharged bitline is discharged. In this scenario, a bit-line switching takes place and consumes power. Otherwise, the precharged bit-line level is maintained until the next access, thus no bit-line switching occurs. In our approach, the binary-patterns to be assigned to op-codes are determined based on the frequency of instructions for reducing the bit-line switching activity. Application programs are analyzed in advance, and then binary-patterns including many '1's' are assigned to the most frequently referenced instructions. In our evaluation, it is observed that the proposed approach can reduce bit-line switching by 40%..
144. Koji Inoue, Koji Kai, Kazuaki Murakami, Performance/energy efficiency of variable line-size caches for intelligent memory systems, 2nd International Workshop on Intelligent Memory Systems, IMS 2000 Intelligent Memory Systems - 2nd International Workshop, IMS 2000, Revised Papers, 10.1007/3-540-44570-6_13, 169-178, 2001.
145. Inoue Koji, Koji Kai, Kazuaki Murakami, A high-performance/low-power on-chip memory-path architecture with variable cache-line size, IEICE Transactions on Electronics, E83-C, 11, 1716-1722, 2000, This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. In our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model..
146. Koji Inoue, Koji Kai, Kazuaki Murakami, Dynamically variable line-size cache architecture for merged DRAM/Logic LSIs, IEICE Transactions on Information and Systems, E83-D, 5, 1048-1057, 2000, This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called dynamically variable line-size cache (D-VLS caché). The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth on merged DRAM/logic LSIs appropriately. In our evaluation, it is observed that an average memory-access time improvement achieved by a directmapped D-VLS cache is about 20% compared to a conventional direct-mapped cache with fixed 32-byte lines. This performance improvement is better than that of a doubled-size conventional direct-mapped cache*..
147. Koji Hashimoto, Hiroto Tomita, Inoue Koji, Katsuhiko Metsugi, Kazuaki Murakami, Shinjiro Inabata, So Yamada, Nobuaki Miyakawa, Hajime Takashima, Kunihiro Kitamura, Shigeru Obara, Takashi Amisaki, Kazutoshi Tanabe, Umpei Nagashima, MOE
A special-purpose parallel computer for high-speed, large-scale molecular orbital calculation, 1999 ACM/IEEE Conference on Supercomputing, SC 1999 ACM/IEEE SC 1999 Conference, SC 1999, 10.1109/SC.1999.10000, 1999.01, We are constructing a high-performance, special-purpose parallel machine for ab initio Molecular Orbital calculations, called MOE (Molecular Orbital calculation Engine). The sequential execution time is O(N4) where N is the number of basis functions, and most of time is spent to the calculations of electron repulsion integrals (ERIs). The calculation of ERIs have a lot of parallelism of O(N4), and therefore MOE tries to exploit the parallelism. This paper discuss the MOE architecture and examines important aspects of architecture design, which is required to calculate ERIs according to the "Obara method". We conclude that n-way parallelization is the most cost-effective, hence we designed the MOE prototype system with a host computer and many processing nodes. The processing node includes a 76 bit oating-point MULTIPLY-and-ADD unit and internal memory, etc., and it performs ERI computations efficiently. We estimate that the prototype system with 100 processing nodes calculate the energy of proteins in a few days..
148. Koji Inoue, Koji Kai, Kazuaki Murakami, High bandwidth, variable line-size cache architecture for merged DRAM/Logic LSIs, IEICE Transactions on Electronics, E81-C, 9, 1438-1447, 1998, Merged DRAM/logic LSIs could provide high on-chip memory bandwidth by interconnecting logic portions and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit such high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses. This approach tends to increase the cache-line size if we attempt to improve the attainable memory bandwidth. Larger cache lines, however, might worsen the system performance if programs running on the LSIs do not have enough spatial locality of references and cache misses frequently take place. This paper describes a novel cache architecture suitable for merged DRAM/logic LSIs, called variable line-size cache or VLS cache, for resolving the above-mentioned dilemma. The VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and, at the same time, alleviate the negative effects of larger cache-line size by partitioning each large cache line into multiple sub-lines and allowing every sub-line to work as an independent cache line. The number of sub-lines involved when a cache replacement occurs can be determined depending on the characteristics of programs. This paper also evaluates the cost/performance improvements attainable by the VLS cache and compares it with those of conventional cache architectures. As a result, it is observed that a VLS cache reduces the average memory-access time by 16.4% while it increases the hardware cost by only 13%, compared to a conventional direct-mapped cache with fixed 32-byte lines..