Kyushu University Academic Staff Educational and Research Activities Database
List of Presentations
Inoue Koji Last modified date:2021.06.08

Professor / Advanced Information and Communication Technology / Department of Advanced Information Technology / Faculty of Information Science and Electrical Engineering


Presentations
1. Susumu Mashimo, Ryota Shioya, Koji Inoue, Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core, International Conference on High Performance Computing in Asia-Pacific Region, 2020.01.
2. Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue,, Enhancing a manycore-oriented compressed cache for GPGPU, International Conference on High Performance Computing in Asia-Pacific Region, 2020.01.
3. Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue,, 32 GHz 6.5 mW Gate-Level-Pipelined 4-bit Processor using Superconductor Single-Flux-Quantum Logic, 2020 Symposia on VLSI Technology and Circuits, 2020.06.
4. Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, and Koji Inoue,, Practical error modeling toward realistic NISQ simulation, The First International Workshop on Quantum Computing: Circuits Systems Automation and Applications, 2020.07.
5. Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, and Koji Inoue,, How many trials do we need for reliable NISQ computing?, The First International Workshop on Quantum Computing: Circuits Systems Automation and Applications, 2020.07.
6. Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kousuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, and Koji Inoue, SuperNPU: An Extremely Fast Neural Processing Unit Using Superconducting Logic Devices, IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020.10, Superconductor single-flux-quantum (SFQ) logic family has been recognized as a highly promising solution for the post-Moore's era, thanks to its ultra-fast and low-power switching characteristics. Therefore, researchers have made a tremendous amount of effort in various aspects to promote the technology and automate its circuit design process (e.g., low-cost fabrication, design tool development). However, there has been no progress in designing a convincing SFQ-based architectural unit due to the architects' lack of understanding of the technology's potentials and limitations at the architecture level. In this paper, we present how to architect an SFQ-based architectural unit by providing design principles with an extreme-performance neural processing unit (NPU). To achieve the goal, we first implement an architecture-level simulator to model an SFQ-based NPU accurately. We validate this model using our die-level prototypes, design tools, and logic cell library. This simulator accurately measures the NPU's performance, power consumption, area, and cooling overheads. Next, driven by the modeling, we identify key architectural challenges for designing a performance-effective SFQ-based NPU (e.g., expensive on-chip data movements and buffering). Lastly, we present SuperNPU, our example SFQ-based NPU architecture, which effectively resolves the challenges. Our evaluation shows that the proposed design outperforms a conventional state-of-the-art NPU by 23 times. With free cooling provided as done in quantum computing, the performance per chip power increases up to 490 times. Our methodology can also be applied to other architecture designs with SFQ-friendly characteristics..
7. G Georgakoudis, N Jain, T Ono, K Inoue, S Miwa, A Bhatele, Evaluating the Impact of Energy Efficient Networks on HPC Workloads, 26th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 2020.01.
8. Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Enhancing a manycore-oriented compressed cache for GPGPU, International Conference on High Performance Computing in Asia-Pacific Region, 2020.01.
9. Susumu Mashimo, Ryota Shioya, Koji Inoue, Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core, International Conference on High Performance Computing in Asia-Pacific Region, 2020.01.
10. Susumu Mashimo, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima, Koji Inoue, Ryota Shioya, An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor, IEEE International Conference on Field Programmable Technology, 2019.12.
11. Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki, A 48GHz 5.6mW gate-level-pipelined multiplier using single-flux quantum logic, IEEE International Solid-State Circuits Conference (ISSCC 2019), 2019.02.
12. Takatsugu Ono, Zhe Chen and Koji Inoue, Improving Lifetime in MLC Phase Change Memory using Slow Writes, International Japan-Africa Conference on Electronics, Communication and Computations, 2018.12.
13. Yusuke Inoue, Takatsugu Ono and Koji Inoue, Situation-Based Dynamic Frame-Rate Control for On-Line Object Tracking, International Japan-Africa Conference on Electronics, Communication and Computations, 2018.12.
14. Masamitsu Tanaka, Yuki Hatanaka, Yuichi Matsui, Ikki Nagaoka, Koki Ishida, Kyosuke Sano, Taro Yamashita, Takatsugu Ono, Koji Inoue, Akira Fujimaki, 30-GHz Operation of Datapath for Bit-Parallel, Gate-Level-Pipelined Rapid Single-Flux-Quantum Microprocessors, Applied Superconductivity Conference, 2018.10.
15. Omar M. Saad, K. Inoue, Ahmed Shalaby, Lotf Samy, and Mohammed S. Sayed, Autoencoder based Features Extraction for Automatic Classification of Earthquakes and Explosions, the 17th IEEE/ACIS International Conference on Computer and Information Science, 2018.06.
16. Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel Ellsworth, Barry Rountree, Martin Schulz, Analyzing Resource Trade-offs in Hardware-overprovisioned Supercomputers, the 32nd International Parallel and Distributed Processing, 2018.05.
17. Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Koji Inoue, Wireless Spoofing-Attack PreventionUsing Radio-Propagation Characteristics, IEEE International Conference on Dependable, Autonomic and Secure Computing, 2017.11.
18. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, CPCI Stack: Metric for Accurate Bottleneck Analysis on OoO Microprocessors, International Symposium on Computing and Networking, 2017.11.
19. Masamitsu Tanaka, Ryo Sato, Yuki Hatanaka, Yuichi Matsui, Hiroyuki Akaike, Akira Fujimaki, Koki Ishida, Takatsugu Ono, Koji Inoue, High-Throughput Bit-Parallel Arithmetic Logic Unit Using Rapid Single-Flux-Quantum Logic, International Superconductive Electronics Conference, 2017.06.
20. Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, and Martin Schulz, Production Hardware Overprovisioning: Real-world Performance Optimization using an Extensible Power-aware Resource Management Framework, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017), 2017.05.
21. 今村 智史, Keitaro Oka, Yuichiro Yasui, 稲富 雄一, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Inoue Koji, Takatsugu Ono, Evaluating the Impacts of Code-Level Performance Tunings on Power Efficiency, IEEE International Conference on Big Data, 2016.12.
22. 今村 智史, Yuichiro Yasui, Inoue Koji, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa, Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping, the 1st High Performance Graph Data Management and Processing workshop, 2016.11.
23. Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Inoue Koji, Single-Flux-Quantum Cache Memory Architecture, International SoC Design Conference, 2016.10.
24. 稲富 雄一, Tapasya Patki, Inoue Koji, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi, Analyzing and Mitigating the Impact of Manufacturing Variability in Power-Constrained Supercomputing, The International Conference for High Performance Computing, Networking, Storage and Analysis , 2015.11.
25. Takeshi Soga, Hiroshi Sasaki, Tomoya Hirao, Masaaki Kondo, Inoue Koji, A flexible hardware barrier mechanism for many-core processors, Asia and South Pacific Design Automation Conference, 2015.01.
26. Satoshi Imamura, Hiroshi Sasaki, Inoue Koji, Dimitrios S. Nikolopoulos, Power-capped DVFS and thread allocation with ANN models on modern NUMA systems, IEEE International Conference on Computer Design, 2014.10.
27. Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Inoue Koji, Masato Edahiro, Martin Peres, Power and Performance Characterization and Modeling of GPU-accelerated Systems, the 28th IEEE International Parallel & Distributed Processing Symposium, 2014.05.
28. FUKAZAWA Keiichiro, Tomonori Tsuhata, Kyohei Yoshida, Masakazu Kuze, Masatsugu Ueda, 稲富 雄一, Inoue Koji, Performance and Power Consumption Evaluation of MHD Simulation for Magnetosphere on Parallel Computer System with CPU Power Capping, Extreme Green & Energy Efficiency in Large Scale Distributed Systems, 2014.05.
29. Hiroshi Sasaki, Satoshi Imamura, Inoue Koji, Coordinated Power-Performance Optimization in Manycores, the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013.09.
30. Satoshi Kawakami, Akihito Iwanaga, Inoue Koji, Many-core Acceleration for Model Predictive Control Systems, Int’l Workshop on Manycore Embedded Systems, 2013.06.
31. Keitaro Oka, Hiroshi Sasaki, Inoue Koji, Line Sharing Cache: Exploring Cache Capacity with Frequent Line Value Locality, Asia and South Pacific Design Automation Conference, 2013.01.
32. Inoue Koji, SMYLEProject:TowardHigh-Performance,Low-PowerComputingonManycore-Processor SoCs, Asia and South Pacific Design Automation Conference (ASP-DAC), 2013.01.
33. Masaaki Kondo, Son Truong Nguyen, Takeshi Soga, Tomoya Hirao, Hiroshi Sasaki, Inoue Koji, SMYLEref: A Reference Architecture for Manycore-Processor SoCs, Asia and South Pacific Design Automation Conference (ASP-DAC), 2013.01, , , , , Hiroshi Sasaki, and Koji Inoue,
".
34. Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Inoue Koji, Task Mapping Techniques for Embedded Many-core SoCs, International SoC Design Conference, 2012.11.
35. Yuki Abe, Hiroshi Sasaki, Martin Peres, Inoue Koji, Kazuaki Murakami, Shinpei Kato, Power and Performance Analysis of GPU-Accelerated Systems, Workshop on Power-Aware Computing and Systems, 2012.10.
36. Farhad Mehdipour, Krishna Chaitanya Nunna, Inoue Koji, Kazuaki Murakami, A Three-Dimensional Integrated Accelerator, Euromicro Conference on Digital System Design, 2012.09.
37. Yuki Abe, 佐々木 広, Inoue Koji, Kazuaki Murakami, Shinpei Kato, On the Power and Performance Analysis of GPU-Accelerated Systems, Poster session, 2012 USENIX Annual Technical Conference, 2012.06.
38. Hiroaki Honda, Farhad Mehdipour, Hiroshi Kataoka, Inoue Koji, Kazuaki J. Murakami, Performance evaluations of finite difference applications realized on a single flux quantum circuits-based reconfigurable accelerator, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2011, APSIPA ASC 2011, 2011.12, Hardware accelerators integrating to general purpose processors are increasingly employed to achieve lower power consumption and higher processing speed, however, energy consumption of high performance accelerators has become a great issue on large scale parallel computer system. We have investigated the applicability of Single-Flux-Quantum (SFQ) circuits as a part of superconductivity technology in high-performance computing systems. Although it is possible to develop extraordinary low power processor by SFQ devices, conditional branch and loop back controls are difficult to be implemented by current SFQ technology. Therefore, we have proposed Reconfigurable Data- Path (RDP) accelerator which is avoiding those limitations of SFQ technology, while trying to get benefits of these circuits. In this research, we have implemented two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications for investigating efficiency of using SFQ-RDP accelerator. According to performance evaluation results for above applications, execution times are 50.6 and 79.0 times smaller than those of the general purpose processor, and comparable with ones reported for GPU (Graphics Processing Units).Hardware accelerators integrating to general purpose processors are increasingly employed to achieve lower power consumption and higher processing speed, however, energy consumption of high performance accelerators has become a great issue on large scale parallel computer system. We have investigated the applicability of Single-Flux-Quantum (SFQ) circuits as a part of superconductivity technology in high-performance computing systems. Although it is possible to develop extraordinary low power processor by SFQ devices, conditional branch and loop back controls are difficult to be implemented by current SFQ technology. Therefore, we have proposed Reconfigurable Data-Path (RDP) accelerator which is avoiding those limitations of SFQ technology, while trying to get benefits of these circuits. In this research, we have implemented two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications for investigating efficiency of using SFQ-RDP accelerator. According to performance evaluation results for above applications, execution times are 50.6 and 79.0 times smaller than those of the general purpose processor, and comparable with ones reported for GPU (Graphics Processing Units)..
39. This paper reports design and evaluation results of a low-energy I-cache architecture, called history-based tag-comparison (HBTC) cache. The HBTC cache attempts to re-use tag-comparison results to detect and eliminate unnecessary memory-array activations. We have performed cycle accurate simulations, and have designed an SRAM core based on a 0.18 um CMOS technology. As a result, it has been observed that the HBTC approach can achieve 60% of energy reduction, with only 0.3% performance degradation, compared to a conventional cache. Furthermore, we have also evaluated the potential of the HBTC cache by combining with other low-energy techniques..
40. In this paper, we propose a cache architecture, called SCache, to detect buffer-overflow attacks at run time. Furthermore, the energy-security efficiency of SCache is discussed. SCache generates replica cache lines on each return-address store, and compares the original value loaded from the memory stack to the replica one on the corresponding return-address load. The number and the placement policy of the replica line strongly affect both energy and vulnerability. In our evaluation, it is observed that SCache can protect more than 99.3% of return-address loads from buffer-overflow attacks, while it increases total cache energy consumption by about 23%, compared to a well-known low-power cache..
41. A number of techniques to reduce cache leakage have so far been proposed. However, it is not clear that 1) what kind of algorithm can be considered and 2) how much they have impact on energy and performance. To answer the questions, this paper classifies cache-leakage reduction techniques and evaluates their energy-performance efficiency. As a result, we have found that an approach employed by the Drowsy cache [?] achieves the best energy-performance efficiency with low complexity. Moreover, we investigate the potential of the approach on multi-thread program executions..
42. Vasily G. Moshnyaga, Koji Inoue, Mizuka Fukagawa, Reducing energy consumption of video memory by bit-width compression, Proceedings of the 2002 International Symposium on Low Power Electronics and Design, 2002.01, A new architectural technique to reduce energy dissipation of video memory is propose. Unlike existing approaches, the technique exploits the pixel correlation in video sequences, dynamically adjusting the memory bit-width to the number of bits changed per pixel. Instead of treating the data bits independently, we group the most significant bits together, activating the corresponding group of bit-lines adaptively to data variation. The method is not restricted to the specific bit-patterns nor depends on the storage phase. It works equally well on read and write accesses, as well as during precharging. Simulation results show that using this method we can reduce the total energy consumption of video memory by 20% without affecting the picture quality..
43. Koji Inoue, V. G. Moshnyaga, K. Murakami, A history-based i-cache for low-energy multimedia applications, Proceedings of the 2002 International Symposium on Low Power Electronics and Design, 2002, This paper proposes a history-based tag-comparison scheme for reducing energy consumption of direct-mapped instruction caches. The proposed cache efficiently exploits program-execution footprints recorded in the Branch Target Buffer (BTB), and attempts to detect and eliminate unnecessary tag checks at run time. Simulation results show that our approach can eliminate up to 95% of tag checks, saving the cache energy by 17%, while affecting the processor performance by only 0.2%..
44. Inoue Koji, Koji Kai, Kazuaki Murakami, Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs, Proceedings of the 1999 5th International Symposium on High-Performance Computer Architecture, HPCA, 1999.01, This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called `dynamically variable line-size cache (D-VLS cache)'. The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is about 27%, compared to a conventional direct-mapped cache with fixed 32-byte lines..
45. Koji Inoue, Tohru Ishihara, Kazuaki Murakami, Way-predicting set-associative cache for high performance and low energy consumption, Proceedings of the 1999 International Conference on Low Power Electronics and Design (ISLPED), 1999, This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache..