Faculty Profiles - KOJI INOUE

Information

写真a

KOJI INOUE

Organization

Faculty of Information Science and Electrical Engineering Department of Advanced Information Technology Professor
Center for Japan-Egypt Cooperation in Science and Technology(E-JUST Center) （Concurrent）
Research Institute for Information Technology （Concurrent）
System LSI Research Center （Concurrent）
Joint Graduate School of Mathematics for Innovation （Concurrent）
School of Sciences Department of Physics（Concurrent）
School of Engineering Department of Electrical Engineering and Computer Science（Concurrent）
Graduate School of Information Science and Electrical Engineering Department of Information Science and Technology（Concurrent）

Contact information

Tel

0928023793

Profile

In next social infrastructures based on advanced information technology, microprocessor systems will deeply infiltrate into our daily lives, for example, electric government, electric money, ubiquitous computing, and so on. To achieve steady social environment, we explore architectural supports for high-performance, low-energy, secure computing. We also design real VLSI chips to evaluate our ideas.

Homepage

https://sites.google.com/view/kojiinoue-en/

External link

Research Areas

Manufacturing Technology (Mechanical Engineering, Electrical and Electronic Engineering, Chemical Engineering) / Control and system engineering

Degree

Engineering

Research History

株式会社　横河電機（1996年4月〜1996年12月）
福岡大学（2001年4月〜2004年8月）

Education

Kyushu Institute of Technology Graduate School, Division of Information Engineering

- 1996

　 More details

researchmap
Kyushu Institute of Technology 情報工学研究科情報科学

- 1996

　 More details

Country：Japan

researchmap
Kyushu Institute of Technology Faculty of Computer Science and Systems Engineering

- 1994

　 More details

researchmap
Kyushu Institute of Technology School of Computer Science and Systems Engineering Department of Artificial Intelligence

- 1994

　 More details

Country：Japan

researchmap

Research Interests・Research Keywords

Research theme： Next-Generation Computer System Architecture

Keyword： Superconductor Computing, Quantum Computing, Photonic Computing, Processor, Multi-Core, Many-Core, Memory Architecture, SOC, HIgh-Performance, Low-Power, Dependable

Research period： 2004.9

Awards

Design Contest Award Honorable Mention

2017.8 IEEE The 23rd International Symposium on Low Power Electronics and Design (ISLPED)

　More details

1.6-mW, 56-GHz Arithmetic Logic Unit Based on Superconductor Single-Flux-Quantum Logic Circuit
2011年ハイパフォーマンスコンピューティングと計算科学シンポジウム最優秀論文賞

2011.1
平成20年度科学技術分野の文部科学大臣表彰若手科学者賞

2008.4 文部科学省
第15回回路とシステム（軽井沢）ワークショップ奨励賞

2003.1

　More details

若手奨励賞
第4回 LSI IPデザイン・アワードチャレンジ賞

2002.1

　More details

LSI IPデザイン・アワードチャレンジ賞
情報処理学会創立40周年記念論文賞

2001.1

　More details

情報処理学会創立40周年記念論文賞

▼display all

Papers

SuperCore: An Ultra-Fast Superconducting Processor for Cryogenic Applications Reviewed International journal

Junhyuk Choi, Ilkwon Byun, Juwon Hong, Dongmoon Min, Junpyo Kim, Jungmin Cho, Hyeonseong Jeong, Masamitsu Tanaka, Koji Inoue, and Jangwoo Kim

Proceedings of the Annual International Symposium on Microarchitecture 1532 - 1547 2024.11 （ ISSN:10724451 ISBN:9798350350579 ）

　More details

Language：English Publishing type：Research paper (other academic) Publisher：Proceedings of the Annual International Symposium on Microarchitecture Micro

Superconductor single-flux-quantum (SFQ) logic family has been recognized as a promising technology for cryogenic applications (e.g., quantum computing, astronomy, metrology) thanks to its ultra-fast and low-energy characteristics. Therefore, recent efforts in SFQ-based computing have focused on developing fast and low-power SFQ processors for cryogenic applications. However, there still has been little progress toward a convincing SFQ processor design due to the critical performance challenges originating from its extremely deep pipeline. In this paper, we propose a super-fast and low-power in-order SFQ processor by tackling the challenges from the deep pipeline. First, we develop a minimal-depth SFQ processor pipeline with novel architecture-level ideas. Next, we conduct in-depth performance analyses and identify three real performance bottlenecks in the deeply pipelined SFQ processors (i.e., stall/flush logic, RAW stall, fetch unit). Finally, we propose SuperCore, our super-fast SFQ-based processor architecture, with three SFQ-friendly solutions that effectively resolve the identified bottlenecks. With our solutions applied, SuperCore achieves 11 times speed-up over the SFQ processor baseline. In addition, SuperCore achieves six times speed-up and consumes up to 193 times less power compared to in-order CMOS processors running at 4K.

DOI： 10.1109/MICRO61859.2024.00112

Scopus
QIsim: Architecting 10+K Qubit QC Interfaces Toward Quantum Supremacy Reviewed

Dongmoon Min, Junpyo Kim, Junhyuk Choi, Ilkwon Byun, Masamitsu Tanaka, Koji Inoue, Jangwoo Kim

Proceedings of the 50th Annual International Symposium on Computer Architecture 1 - 16 2023.6 （ ISSN:10636897 ISBN:9798400700958 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：ACM

量子コンピュータにおける量子ビットと古典処理のインタフェースアーキテクチャの探索と提案。

DOI： 10.1145/3579371.3589036

Scopus

CiNii Research

researchmap
Q3DE: A fault-tolerant quantum computer architecture for multi-bit burst errors by cosmic rays. Reviewed

Yasunari Suzuki, Takanori Sugiyama, Tomochika Arai, Wang Liao, Koji Inoue, Teruo Tanimoto

MICRO 2022-October 1110 - 1125 2022.10 （ ISSN:10724451 ISBN:9781665462723 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：IEEE

宇宙船が量子ビットの誤り耐性に与える影響を分析し、この問題を解決する誤り訂正アルゴリズムとアーキテクチャを提案。

DOI： 10.1109/MICRO56248.2022.00079

Scopus

CiNii Research

researchmap

Other Link： https://dblp.uni-trier.de/db/conf/micro/micro2022.html#SuzukiSALIT22
XQsim: modeling cross-technology control processors for 10+K qubit quantum computers.

Ilkwon Byun, Junpyo Kim, Dongmoon Min, Ikki Nagaoka, Kosuke Fukumitsu, Iori Ishikawa, Teruo Tanimoto, Masamitsu Tanaka, Koji Inoue, Jangwoo Kim

ISCA 366 - 382 2022.6 （ ISSN:10636897 ISBN:9781450386104 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：ACM

量子誤り訂正アーキテクチャの探索と改善に関する提案。

DOI： 10.1145/3470496.3527417

Scopus

CiNii Research

researchmap

Other Link： https://dblp.uni-trier.de/db/conf/isca/isca2022.html#ByunKMNFITTIK22
Superconductor Computing for Neural Networks.

Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kosuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, Koji Inoue

IEEE Micro 41 ( 3 ) 19 - 26 2021.5

　More details

Language：Others Publishing type：Research paper (scientific journal)

超伝導単一磁束量子回路を用いたAIアクセラレータアーキテクチャの提案。

DOI： 10.1109/MM.2021.3070488
SuperNPU: An Extremely Fast Neural Processing Unit Using Superconducting Logic Devices.

Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kosuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, Koji Inoue

53rd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO) 58 - 72 2020.10

　More details

Language：Others Publishing type：Research paper (other academic)

超伝導単一磁束量子回路を用いたAIアクセラレータアーキテクチャの提案。

DOI： 10.1109/MICRO50266.2020.00018
Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing.

Yuichi Inadomi, Tapasya Patki, Koji Inoue, Mutsumi Aoyagi, Barry Rountree, Martin Schulz 0001, David K. Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC) 78 - 12 2015.11

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/2807591.2807638
Performance prediction of large-scale parallell system and application using macro-level simulation.

Ryutaro Susukita, Hisashige Ando, Mutsumi Aoyagi, Hiroaki Honda, Yuichi Inadomi, Koji Inoue, Shigeru Ishizuki, Yasunori Kimura, Hidemi Komatsu, Motoyoshi Kurokawa, Kazuaki J. Murakami, Hidetomo Shibamura, Shuji Yamamura, Yunqing Yu

Proceedings of the ACM/IEEE Conference on High Performance Computing(SC) 20 - 20 2008.11

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/SC.2008.5220091
Design and Implementation of Opto-Electrical Hybrid Floating-Point Multipliers

Inaba T., Ono T., Inoue K., Kawakami S.

IEICE Transactions on Information and Systems E108.D ( 1 ) 2 - 11 2025.1 （ ISSN:09168532 ）

　More details

Publisher：IEICE Transactions on Information and Systems

The performance improvement by CMOS circuit technology is reaching its limits. Many researchers have been studying computing technologies that use emerging devices to challenge such critical issues. Nanophotonic technology is a promising candidate for tackling the issue due to its ultra-low latency, high bandwidth, and low power characteristics. Although previous research develops hardware accelerators by exploiting nanophotonic circuits for AI inference applications, there has never been considered for the acceleration of training that requires complex Floating-Point (FP) operations. In particular, the design balance between optical and electrical circuits has a critical impact on the latency, energy, and accuracy of the arithmetic system, and thus requires careful consideration of the optimal design. In this study, we design three types of Opto-Electrical Floating-point Multipliers (OEFMs): accuracy-oriented (Ao-OEFM), latency-oriented (Lo-OEFM), and energy-oriented (Eo-OEFM). Based on our evaluation, we confirm that Ao-OEFM has high noise resistance, and Lo-OEFM and Eo-OEFM still have sufficient calculation accuracy. Compared to conventional electrical circuits, Lo-OEFM achieves an 87% reduction in latency, and Eo-OEFM reduces energy consumption by 42%.

DOI： 10.1587/transinf.2024PAP0003

Scopus
Approximate SFQ-based Computing Architecture Modeling with Device-level Guidelines

Mundhe P., Hano Y., Kawakami S., Tanimoto T., Tanaka M., Inoue K., Byun I.

IEEE Computer Architecture Letters 24 ( 2 ) 253 - 256 2025 （ ISSN:15566056 ）

　More details

Publisher：IEEE Computer Architecture Letters

Single-flux-quantum (SFQ) logic has emerged as a promising post-Moore technology thanks to its ultra-fast and lowenergy operation. However, despite progress in various fields, its feasibility is questionable due to the prohibitive cooling cost. Proven conventional ideas, such as approximate computing, may help to resolve this challenge. However, introducing such ideas has been impossible due to the complex performance, power, and error trade-offs originating from the unique SFQ device characteristics. This work introduces approximate SFQ-based computing (AxSFQ) with an architecture modeling framework and essential design guidelines. Our optimized device-level AxSFQ showcases 30˜ 100 times energy efficiency improvement, which motivates further circuit and architecture-level exploration

DOI： 10.1109/LCA.2025.3573740

Scopus
C3-VQA: Cryogenic Counter-Based Coprocessor for Variational Quantum Algorithms

Yosuke Ueno, Satoshi Imamura, Yuna Tomida, Teruo Tanimoto, Masamitsu Tanaka, Yutaka Tabuchi, Koji Inoue, Hiroshi Nakamura

IEEE Transactions on Quantum Engineering 6 1 - 17 2025 （ eISSN:2689-1808 ）

　More details

Publishing type：Research paper (scientific journal) Publisher：Institute of Electrical and Electronics Engineers (IEEE)

Cryogenic quantum computers play a leading role in demonstrating quantum advantage. Given the severe constraints on the cooling capacity in cryogenic environments, thermal design is crucial for the scalability of these computers. The sources of heat dissipation include passive inflow via intertemperature wires and the power consumption of components located in the cryostat, such as wire amplifiers and quantum-classical interfaces. Thus, a critical challenge is to reduce the number of wires by reducing the required intertemperature bandwidth while maintaining minimal additional power consumption in the cryostat. One solution to address this challenge is near-data processing using ultralow-power computational logic within the cryostat. Based on the workload analysis and domain-specific system design focused on variational quantum algorithms (VQAs), we propose the cryogenic counter-based coprocessor for VQAs (C3-VQA) to enhance the design scalability of cryogenic quantum computers under the thermal constraint. The C3-VQA utilizes single-flux-quantum logic, which is an ultralow-power superconducting digital circuit that operates at the 4 K environment. The C3-VQA precomputes a part of the expectation value calculations for VQAs and buffers intermediate values using simple bit operation units and counters in the cryostat, thereby reducing the required intertemperature bandwidth with small additional power consumption. Consequently, the C3-VQA reduces the number of wires, leading to a reduction in the total heat dissipation in the cryostat. Our evaluation shows that the C3-VQA reduces the total heat dissipation at the 4 K stage by 30% and 81% under sequential-shot and parallel-shot execution scenarios, respectively. Furthermore, a case study in quantum chemistry shows that the C3-VQA reduces total heat dissipation by 87% with a 10 000-qubit system.

DOI： 10.1109/tqe.2024.3521442

Scopus

researchmap
Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration

Fei Z., Lyu M., Kawakami S., Inoue K.

IEEE Computer Architecture Letters 24 ( 1 ) 81 - 84 2025 （ ISSN:15566056 ）

　More details

Publisher：IEEE Computer Architecture Letters

The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.

DOI： 10.1109/LCA.2025.3548080

Scopus
Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications

Babai A.M.A., Inoue K.

IEEE Computer Architecture Letters 24 ( 1 ) 137 - 140 2025 （ ISSN:15566056 ）

　More details

Publisher：IEEE Computer Architecture Letters

Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.

DOI： 10.1109/LCA.2025.3563105

Scopus
SPDID: A Secure and Privacy- Preserving Decentralized Identity utilizing Blockchain and PUF Reviewed International journal

He Y., Fan W., Inoue K.

Proceedings of the IEEE International Conference on Trust Security and Privacy in Computin g and Communications Trustcom ( 2024 ) 1622 - 1623 2024.12 （ ISSN:2324898X ISBN:9798331506209 ）

　More details

Language：English Publishing type：Research paper (other academic) Publisher：Proceedings of the IEEE International Conference on Trust Security and Privacy in Computing and Communications Trustcom

For internet users, using digital identities has become increasingly essential. Traditional systems often rely on centralized control, which no longer meets privacy demands. Decentralized identity systems offer a more transparent and resilient solution by giving users control over their identity data through distributed ledgers, decentralized identifiers, and verifiable credentials. However, existing schemes are often based on assumptions such as the presence of a reliable anonymous credential issuer, making them challenging to implement in real-world scenarios. Furthermore, they overlook handling user credentials securely and struggle with ensuring user authentication while providing privacy. These systems are unable to provide public audits of issued credentials without jeopardizing sensitive data. This paper introduces SPDID, a novel decentralized identity system based on blockchain technology is introduced. First, it transforms legacy documents into anonymous credentials without interaction or alteration using zero-knowledge proofs and Pedersen commitments. SPDID also employs the physical unclonable function (PUF) to design a secure key management system resilient to physical attacks. Additionally, it suggests a user authentication method that is unlinkable and resistant to Sybil without the need for new trusted third parties. Furthermore, SPDID uses Merkle trees to construct a credential issuance list that is publicly auditable on the blockchain. The system's security and practical performance are demonstrated through security analysis and a prototype implementation on Hyperledger Fabric.

DOI： 10.1109/TrustCom63139.2024.00223

Scopus
Multithreaded Edge- Assisted Visual SLAM with Keyframe Backup Mechanism Reviewed International journal

Xia C., Wang Y., Inoue K.

Proceedings 2024 12th International Symposium on Computing and Networking Candar 20 24 259 - 265 2024.11 （ ISBN:9798331528362 ）

　More details

Language：English Publishing type：Research paper (other academic) Publisher：Proceedings 2024 12th International Symposium on Computing and Networking Candar 2024

Simultaneous Localization and Mapping (SLAM) has rapidly advanced on mobile devices, particularly camera-based Visual SLAM, which enables spatial perception and autonomous positioning by processing continuous image data. Due to its high memory and processing demands, it is challenging to deploy and execute continuously for a long time on mobile devices. The edge-assisted architecture that mitigates resource constraints by offloading heavy tasks to an edge server becomes optimal for settling the problem. However, existing studies suffer from high data synchronization delay, which is handled in the tracking module, resulting in prolonged tracking interruption and poor system robustness and accuracy. Based on a typical edge-assisted Visual SLAM system, we analyze the impact of the data synchronization process and propose a new multithreaded tracking solution with a keyframe backup mechanism. Verifying through two standard datasets, we evaluate our system's robustness and localization accuracy. The results show that the proposed system reduces tracking interruption by up to 88.1% and significantly improves the coverage, a critical robustness metric of the SLAM system, by up to 25.5%. Additionally, our proposed solution significantly improves localization accuracy, especially in rotation scenarios, by up to 26.7%.

DOI： 10.1109/CANDAR64496.2024.00041

Scopus
Performance evaluation of all intra Kvazaar and x265 HEVC encoders on embedded system Nvidia Jetson platform

James R., Abo-Zahhad M., Inoue K., Sayed M.S.

Journal of Real-Time Image Processing 21 ( 3 ) 2024.5 （ ISSN:18618200 ）

　More details

Publisher：Journal of Real-Time Image Processing

The growing demand for high-quality video requires complex coding techniques that cost resource consumption and increase encoding time which represents a challenge for real-time processing on Embedded Systems. Kvazaar and x265 encoders are two efficient implementations of the High-Efficient Video Coding (HEVC) standard. In this paper, the performance of All Intra Kvazaar and x265 encoders on the Nvidia Jetson platform was evaluated using two coding configurations; highspeed preset and high-quality preset. In our work, we used two scenarios, first, the two encoders were run on the CPU, and based on the average encoding time Kvazaar proved to be 65.44% and 69.4% faster than x265 with 1.88% and 0.6% BD-rate improvement over x265 at high-speed and high-quality preset, respectively. In the second scenario, the two encoders were run on the GPU of the Nvidia Jetson, and the results show the average encoding time under each preset is reduced by half of the CPU-based scenario. In addition, Kvazaar is 54.5% and 56.70% faster with 1.93% and 0.45% BD-rate improvement over x265 at high-speed and high-quality preset, respectively. Regarding the scalability, the two encoders on the CPU are linearly scaled up to four threads and speed remains constant afterward. On the GPU, the two encoders are scaled linearly with the number of threads. The obtained results confirmed that, Kvazaar is more efficient and that it can be used on Embedded Systems for real-time video applications due to its high speed and performance over the x265 HEVC encoder.

DOI： 10.1007/s11554-024-01429-5

Scopus
TinyEmergencyNet: a hardware-friendly ultra-lightweight deep learning model for aerial scene image classification

Mogaka O.M., Zewail R., Inoue K., Sayed M.S.

Journal of Real-Time Image Processing 21 ( 2 ) 2024.4 （ ISSN:18618200 ）

　More details

Publisher：Journal of Real-Time Image Processing

In the context of emergency response applications, real-time situational awareness is vital. Unmanned aerial vehicles (UAVs) with imagers have emerged as crucial tools for providing timely information in such scenarios. Convolutional neural networks (CNN) are effective in image processing. However, the deployment of CNN models in UAVs faces significant challenges. The CNN models involve large number of parameters and energy-costly floating-point computations beyond the memory and power available on-board the UAVs. To address these challenges, we propose a co-design optimization approach for deploying the EmergencyNet CNN model on resource-constrained UAVs. Our strategy includes channel-wise pruning to reduce the size and optimize the network architecture. Additionally, we apply additive powers-of-two (APoT) quantization to further compress the model and enhance computational efficiency. Using channel-wise network pruning we derive TinyEmergencyNet that is only 155KB in memory size and 50% smaller than EmergencyNet. This proposed approach is evaluated on Aerial Image Disaster Event Recognition (AIDER) dataset. We have achieved an F1-score of 93.6% with 4-bit APoT quantization that closely approaches the full precision (32-bit) accuracy of 94%. Furthermore, hardware-friendly bit-shifting operations as a result of APoT quantization present an added advantage in hardware accelerator implementations. This work pioneers the joint application of channel-wise pruning and non-uniform APoT quantization on EmergencyNet, presenting a suitable solution tailored for UAV-based emergency response applications.

DOI： 10.1007/s11554-024-01430-y

Scopus
CFChain: A Crowdfunding Platform that Supports Identity Authentication, Privacy Protection, and Efficient Audit

Yueyue He, Jiageng Chen, Koji Inoue

International Conference on Algorithms and Architectures for Parallel Processing 14493 LNCS 146 - 167 2024.3 （ ISSN:0302-9743 ISBN:9789819708611 eISSN:1611-3349 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Charity crowdfunding is a technique for raising funds that involves collecting modest contributions from a vast number of individuals or groups via established crowdfunding platforms or other digital avenues. The objective is to provide support for charitable organizations, social welfare initiatives, or personal requirements. The widespread adoption of the Internet and the rapid advancement of digital technology have facilitated the global dissemination and promotion of charity crowdfunding. However, crowdfunding platforms have recently experienced a decline in credibility due to various factors such as fraudulent donations, inadequate fund management, and other forms of disorder. The blockchain’s decentralization and anti-tampering features exhibit a high degree of compatibility with the requirements of a crowdfunding platform. Most current state-of-the-art techniques do not ensure the non-linkability of user identities in the face of sybil attacks, nor do they offer a streamlined auditing mechanism for crowdsourcing modest donations that simultaneously preserves transactional privacy. This paper presents a novel crowdfunding system called CFChain based on blockchain technology. Initially, the distributed identity and BLS signature are employed to establish a user authentication mechanism, enabling CFChain to withstand sybil attacks while preserving the non-linkability of user identities. Subsequently, a crowdfunding mechanism is constructed utilizing zero-knowledge proofs to facilitate streamlined auditing procedures while safeguarding donations’ confidentiality. Additionally, a security analysis of CFChain is presented. The system prototype is subsequently implemented on the Hyperledger Fabric. Empirical evidence indicates that the efficiency of CFChain is viable.

DOI： 10.1007/978-981-97-0862-8_10

Scopus

researchmap
Inter-Temperature Bandwidth Reduction in Cryogenic QAOA Machines

Yosuke Ueno, Yuna Tomida, Teruo Tanimoto, Masamitsu Tanaka, Yutaka Tabuchi, Koji Inoue, Hiroshi Nakamura

IEEE Computer Architecture Letters 23 ( 1 ) 1 - 4 2024.1 （ ISSN:1556-6056 eISSN:1556-6064 ）

　More details

Language：Others Publishing type：Research paper (scientific journal) Publisher：Institute of Electrical and Electronics Engineers (IEEE)

The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.

DOI： 10.1109/lca.2023.3322700

Scopus

CiNii Research

researchmap
Late Breaking Results: Single Flux Quantum Based Brownian Circuits for Ultra-Law-Power Computing

Kawakami S., Ohtusbo Y., Inoue K., Tanaka M.

Proceedings -Design, Automation and Test in Europe, DATE 2024 （ ISSN:15301591 ISBN:9798350348590 ）

　More details

Publisher：Proceedings -Design, Automation and Test in Europe, DATE

This paper proposes a random walk circuit imple-mentation with single flux quantum devices, essential for Brownian circuits, to reduce processing energy consumption dramatically. SPICE-based simulation demonstrating its functional operation and random walks can be achieved via the Shapiro- Wilk test. Furthermore, we developed a Monte Carlo simulator for Brownian circuits, enabling functionality verification and computation step distribution analysis. Latency/energy evaluation using a half-adder as a case study revealed that proposed circuits could reduce energy consumption by 1/1260 and offer an opportunity for low-power computing systems.

Scopus
CrowdChain: A privacy-preserving crowdfunding system based on blockchain and PUF

He Y., Inoue K.

Peer-to-Peer Networking and Applications 17 ( 6 ) 3669 - 3687 2024 （ ISSN:19366442 ）

　More details

Publisher：Peer-to-Peer Networking and Applications

Crowdfunding refers to the online collection of certain capital from a vast number of individuals or groups that each contribute a relatively small amount. Recently, the credibility of crowdfunding platforms has been undermined by fraudulent projects, inadequate fund management, and other forms of disorder. The decentralization and anti-tampering features of blockchain provide the possibility to solve the above problems, and many studies have proposed blockchain-based crowdfunding schemes. However, the existing state-of-the-art methods do not provide user authentication, transaction auditing, and identity management in a privacy-preserving way. Accordingly, this paper presents a novel blockchain-based crowdfunding system called CrowdChain. Initially, the distributed identity and BLS signature are employed to establish a user authentication mechanism, enabling CrowdChain to withstand Sybil attacks while preserving the non-linkability of user identities. Secondly, the physically unclonable function (PUF) is used to generate keys associated with digital identities that are not stored in external devices to resist physical attacks. Subsequently, a crowdfunding mechanism is constructed utilizing zero-knowledge proofs to facilitate streamlined auditing procedures while safeguarding the confidentiality of transactions. Additionally, the formal security analysis proves the security of the CrowdChain scheme. The system prototype is implemented on the Hyperledger Fabric. Empirical evidence indicates the viable efficiency of CrowdChain.

DOI： 10.1007/s12083-024-01785-w

Scopus
2A20 Semiconductor Human Resource Development Education for Junior High and High School Students

INOUE Koji, YAMADA Junji, KIMOTO Kanae, KOMAZAWA Toshiaki

Proceedings of Annual Conference of Japanese Society for Engineering Education 2024 ( 0 ) 126 - 127 2024 （ ISSN:21898928 eISSN:24241458 ）

　More details

Language：Japanese Publisher：Japanese Society for Engineering Education

DOI： 10.20549/jseeja.2024.0_126

CiNii Research
Empirical Power-performance Analysis of Layer-wise CNN Inference on Single Board Computers

Kuan Yi Ng, Aalaa M.A. Babai, Teruo Tanimoto, Satoshi Kawakami, Koji Inoue

Journal of Information Processing 31 478 - 494 2023.7 （ eISSN:1882-6652 ）

　More details

Language：Others Publishing type：Research paper (scientific journal) Publisher：Information Processing Society of Japan

This paper analyzes the impact of input sparsity and DFS/DVFS configurations for single-board computers on the execution time, power, and energy of each VGG16 layer as the first step towards efficient CNN inference on single-board computers. For this purpose, we first develop a power and execution time measurement environment and perform experiments using Raspberry Pi 4 and NVIDIA Jetson Nano. Our results show that clock frequency strongly correlates with execution time and power. Inversely, input sparsity has a weak correlation with execution time and power. Then, we show that a coarse-grained DVFS model can explain over 96% of the variations in the power of each VGG16 layer even when sets of clock frequency and voltage on the single-board computer are unavailable.

DOI： 10.2197/ipsjjip.31.478

Scopus

researchmap
50-GFLOPS Floating-Point Adder and Multiplier Using Gate-Level-Pipelined Single-Flux-Quantum Logic With Frequency-Increased Clock Distribution Reviewed

Ikki Nagaoka, Ryota Kashima, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Taro Yamashita, Koji Inoue, Akira Fujimaki

IEEE Transactions on Applied Superconductivity 33 ( 4 ) 1 - 11 2023.6 （ ISSN:1051-8223 eISSN:1558-2515 ）

　More details

Language：Others Publishing type：Research paper (scientific journal) Publisher：Institute of Electrical and Electronics Engineers (IEEE)

We demonstrate the functioning of a high-throughput, gate-level-pipelined floating-point adder and multiplier over 50 GHz. The gate-level-pipelined floating-point adder and multiplier requires dedicated circuit blocks to wait until other circuit blocks complete calculations because of the dependence between their sign, exponent, and significand parts. We revealed that the resultant delay difference of the waiting circuit blocks hinders high-frequency operation if the predesigned circuit blocks with the fixed clock distribution are connected in a simple manner. We showed that clock distribution needs to synchronize with every pipeline stage regardless of the circuit blocks to minimize the delay difference between the circuit blocks for circuits containing the waiting circuit blocks (e.g., the floating-point adder and multiplier). We designed a 5-bit floating-point adder and multiplier to demonstrate the effectiveness of the clock distribution experimentally. The test chips were fabricated using AIST 10-kA/cm$\boldsymbol{^{2}}$ Advanced Process 2. We verified the high-speed operation at over 50 GHz in the floating-point adder and multiplier. The maximum clock frequency and throughput of the floating-point adder were 56 GHz and 56 GFLOPS, respectively. The corresponding values for the floating-point multiplier were 63 GHz and 63 GFLOPS, respectively.

DOI： 10.1109/tasc.2023.3250614

Scopus

CiNii Research

researchmap
A High-Throughput Multiply-Accumulate Unit With Long Feedback Loop Using Low-Voltage Rapid Single-Flux Quantum Circuits Reviewed

Ikki Nagaoka, Ryota Kashima, Koki Ishida, Masamitsu Tanaka, Taro Yamashita, Takatsugu Ono, Koji Inoue, Akira Fujimaki

IEEE Transactions on Applied Superconductivity 33 ( 3 ) 1 - 8 2023.4 （ ISSN:1051-8223 eISSN:1558-2515 ）

　More details

Language：Others Publishing type：Research paper (scientific journal) Publisher：Institute of Electrical and Electronics Engineers (IEEE)

In this article, we demonstrated a high-throughput gate-level-pipelined 8-bit multiply-accumulate (MAC) unit with a long feedback loop using low-voltage rapid single-flux quantum (LV-RSFQ) logic. The long feedback loop in the MAC unit is an obstacle for high-throughput operation because the logic gates must wait for the delayed inputs from the feedback loop. The LV-RSFQ logic makes high-frequency operation even more difficult by larger and more variable feedback delay. We design the feedback loop by using counter-flow clocking and adding many D flip-flops to divide the long feedback loop into shorter paths. The target clock frequency of the MAC unit with a feedback loop was set to 30 GHz by the experimental results of the MAC unit without a feedback loop. We model the clock frequency and its circuit overhead in a feedback loop to design the feedback loop in the MAC unit achieving 30 GHz with a minimum overhead. The test chips are fabricated using the national institute of advanced industrial science and technology (AIST) 10-kA/cm 2 Advanced Process 2. We have successfully obtained high-throughput 30-GHz operations in the LV-RSFQ MAC unit with a long feedback loop by using the model-based design. The maximum operating frequency of the MAC unit reaches 40 GHz.

DOI： 10.1109/tasc.2023.3239329

Scopus

CiNii Research

researchmap
Next Generation Cryogenic Superconductor Computing: From Classical to Quantum

Inoue Koji

2023.4

　More details

Language：English Publisher：IEEE

Moore’s Law, doubling the number of transistors in a chip every two years, has so far contributed to the evolution of computer systems. Unfortunately, we cannot expect sustainable transistor shrinking anymore, marking the beginning of the so-called post-Moore era. Therefore, it has become essential to explore emerging devices, and superconductor single-flux-quantum (SFQ) logic that operates in a 4.2- kelvin environment is a promising candidate. Josephson junctions (JJs) are used as switching elements in SFQ logic to compose a superconductor ring (SFQ ring) that can store (or trap) and transfer a single magnetic flux quantum. It fundamentally operates with the voltage pulse-driven nature that makes it possible to achieve extremely low-latency and low-energy JJ switching. This talk shares the history of our SFQ Research, e.g., revisiting microarchitecture and demonstrating over 30 GHz microprocessors, AI accelerator designs, and recently targeting quantum computers. Then, the role of computer architecture for such emerging device computing is discussed.

CiNii Research
A Hybrid Opto-Electrical Floating-point Multiplier Reviewed

Takumi Inaba, Takatsugu Ono, Koji Inoue, Satoshi Kawakami

2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 313 - 320 2022.12 （ ISBN:9781665464994 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：IEEE

The performance improvement by CMOS circuit technology is reaching its limits. Many researchers have been studying computing technologies that use emerging devices to challenge such critical issues. Nanophotonic technology is a promising candidate due to its ultra-low latency, high bandwidth, and low power natures. The advanced research activity of nanophotonic computing is to design hardware accelerators for AI inference applications. However, few considerations about nanophotonic accelerators for AI training applications have been conducted. The main reason is that state-of-the-art nanophotonic AI accelerators involve integer operations, whereas floating-point (FP) sum-of-products dominate the training process. However, to the best of the authors' knowledge, there are no optical circuits that target floating-point arithmetic units. This study proposes a novel Opto-Electrical Floating-point Multiplier (OEFM) toward ultra-low-latency, a power-efficient nanophotonic accelerator for AI training applications. We design a microarchitecture of OEFM, including a novel optical integer multiplier and other electrical components. Based on our evaluation framework, we analyze the calculation accuracy of the proposed multiplier and OEFM. Experimental results show that OEFM achieves a 56 % reduction in latency and a 41 % reduction in energy consumption compared with a conventional electrical circuit.

DOI： 10.1109/mcsoc57363.2022.00057

Scopus

CiNii Research

researchmap
Implementation of Edge-cloud Cooperative CNN Inference on an IoT Platform Reviewed

Yuan Wang, Hidetomo Shibamura, KuanYi Ng, Koji Inoue

2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 337 - 344 2022.12 （ ISBN:9781665464994 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：IEEE

Since the Internet of Things (IoT) has become more widely used in various industrial situations, Artificial Intelligence (AI) programs, particularly Convolutional Neural Network (CNN) applications, are projected to be implemented on edge devices to meet high-accuracy and huge industry computing needs. Offloading computing-intensive workloads to the cloud is a promising solution for compact energy-constrained edge devices, but it tends to incur significant costs in total execution latency. For flexible and fine-grained offloading, this paper aims to design and implement an edge-cloud cooperative CNN inference framework on an IoT platform by targeting TensorFlow Lite. We have confirmed the implementation's feasibility and accuracy through the verification of implementing LeNet, AlexNet, and VGGNet. Intending to perform high-performance edge-cloud AI executions on the presented IoT platform, we evaluate the performance overhead (total execution latency) of the provided implementation and identify the current bottlenecks of the target platform for enhancing it.

DOI： 10.1109/mcsoc57363.2022.00060

Scopus

researchmap
Design and Analysis of a Nano-photonic Processing Unit for Low-Latency Recurrent Neural Network Applications Reviewed

Eito Sato, Koji Inoue, Satoshi Kawakami

2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 321 - 329 2022.12 （ ISBN:9781665464994 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：IEEE

Recurrent neural networks (RNNs) have achieved high performance in inference processing that handles time-series data. Among them, hardware acceleration for fast processing RNNs is helpful for tasks where real-time performance is es-sential, such as speech recognition and stock market prediction. The nano-photonic neural network accelerator is an approach that takes advantage of the high speed, high parallelism, and low power consumption of light to achieve high performance in neural network processing. However, existing methods are inefficient for RNNs due to significant overhead caused by the absence of recursive paths and the immaturity of the model to be designed. Therefore, architectural considerations that take advantage of RNN characteristics are essential for low latency. This paper proposes a fast and low-power processing unit for RNNs that introduces activation functions and recursion processing using optical devices. We clarified the impact of noise on the proposed circuit's calculation accuracy and inference accuracy. As a result, the calculation accuracy deteriorated significantly in proportion to the increase in the number of recursions, but the effect on inference accuracy was negligible. We also compared the performance of the proposed circuit to an all-electric design and a hybrid design that processes the vector-matrix product optically and the recursion electrically. As a result, the performance of the proposed circuit improves latency by 467x, reduces power consumption by 93.0% compared with the all-electrical design, improves latency by 7.3x, and reduces power consumption by 58.6% compared with the hybrid design.

DOI： 10.1109/mcsoc57363.2022.00058

Scopus

CiNii Research

researchmap
A 57.2GHz 11.2mW 8-bit General Purpose Superconductor Microprocessor with Dual-Clocking Scheme Reviewed

Ikki Nagaoka, Ryota Kashima, Tomoki Nakano, Masamitsu Tanaka, Taro Yamashita, Koji Inoue, Akira Fujimaki

2022 IEEE Asian Solid-State Circuits Conference (A-SSCC) 1 - 3 2022.11 （ ISBN:9781665471435 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：IEEE

A superconductor single-flux-quantum (SFQ) logic 8-bit microprocessor is demonstrated up to 57.2 GHz with a measured power consumption of 11.2 mW. The microprocessor has an ultradeep, gate-level pipelining containing many feedback paths and communications between components. The arrival clock timings at all the logic gates are ultra-precisely tuned using two different clocking schemes, called 'concurrent-flow' and 'counter-flow,' to achieve extremely high clock frequency operation over 50 GHz. Low-temperature circumstances enable us to conduct super delay-intensive layout design by controlling delays of all waveguide interconnects in the order of sub-picosecond precision.

DOI： 10.1109/a-sscc56115.2022.9980802

Scopus

CiNii Research

researchmap
An Edge Autonomous Lamp Control with Camera Feedback Reviewed

Satoshi Matsushita, Teruo Tanimoto, Satoshi Kawakami, Takatsugu Ono, Koji Inoue

2022 IEEE 8th World Forum on Internet of Things (WF-IoT) 1 - 7 2022.10 （ ISBN:9781665491532 ）

　More details

Language：Others Publishing type：Research paper (other academic) Publisher：IEEE

Recently IoT edge devices have become more diverse and lower cost. In addition, small low-power single-board computers' computing performance has significantly increased. These conditions make it possible to process locally without communicating to the cloud. Since the advantages of in-edge processing are security and privacy, we applied in-edge IoT to smart homes with rich private information to be secured. In in-edge processing, conventional cloud-managed abnormality monitoring and system maintenance cannot be involved. We developed a lamp control system with in-edge processing. It detects failures using camera image processing and recovers from the failure. The abnormalities of the image processing are detected by monitoring cyclic outdoor brightness change observed on windows captured with the same camera. We have developed a prototype system with Python with OpenCV and FastAPI, etc., over PHP-based lamp timer control while keeping source code size small and considering validation easiness. The camera detectors work at 10 FPS on Python with as small as 1607 total source code lines (three times of code lines against the original lamp control timer).

DOI： 10.1109/wf-iot54382.2022.10152281

Scopus

CiNii Research

researchmap
Q3DE: A fault-tolerant quantum computer architecture for multi-bit burst errors by cosmic rays

Suzuki Yasunari, Sugiyama Takanori, Arai Tomochika, Liao Wang, Inoue Koji, Tanimoto Teruo

IEEE/ACM International Symposium on Microarchitecture (MICRO) 2022 1110 - 1125 2022.10

　More details

Language：English Publisher：Institute of Electrical and Electronics Engineers (IEEE)

Demonstrating small error rates by integrating quantum error correction (QEC) into an architecture of quantum computing is the next milestone towards scalable fault-tolerant quantum computing (FTQC). Encoding logical qubits with superconducting qubits and surface codes is considered a promising candidate for FTQC architectures. In this paper, we propose an FTQC architecture, which we call Q3DE, that enhances the tolerance to multi-bit burst errors (MBBEs) by cosmic rays with moderate changes and overhead. There are three core components in Q3DE: in-situ anomaly DEtection, dynamic code DEformation, and optimized error DEcoding. In this architecture, MBBEs are detected only from syndrome values for error correction. The effect of MBBEs is immediately mitigated by dynamically increasing the encoding level of logical qubits and re-estimating probable recovery operation with the rollback of the decoding process. We investigate the performance and overhead of the Q3DE architecture with quantum-error simulators and demonstrate that Q3DE effectively reduces the period of MBBEs by 1000 times and halves the size of their region. Therefore, Q3DE significantly relaxes the requirement of qubit density and qubit chip size to realize FTQC. Our scheme is versatile for mitigating MBBEs, i.e., temporal variations of error properties, on a wide range of physical devices and FTQC architectures since it relies only on the standard features of topological stabilizer codes.

CiNii Research
Design of Variable Bit-Width Arithmetic Unit Using Single Flux Quantum Device

Iori Ishikawa, Ikki Nagaoka, Ryota Kashima, Koki Ishida, Kosuke Fukumitsu, Keitaro Oka, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Akira Fujimaki, Koji Inoue

2022 IEEE International Symposium on Circuits and Systems (ISCAS) 2022-May 3547 - 3551 2022.5 （ ISSN:02714310 ISBN:9781665484855 ）

　More details

Publisher：IEEE

This paper presents the design of an ultra-high-speed, low-power arithmetic unit that supports variable bit-width operations with single flux quantum (SFQ) technology. Because of the high-speed nature of superconductor devices, we can achieve extremely high power-performance efficiency that cannot be achieved by state-of-the-art CMOS devices. To implement the complex function to support the variable bit-width feature, we introduce a novel circuit architecture to maintain the high-speed operation over 50GHz. Our prototype chip design successfully demonstrated 53.5GHz 1.59mW operations.

DOI： 10.1109/iscas48785.2022.9937317

Scopus

CiNii Research
Next Generation Superconductor Computer Architecture

Koji INOUE

TEION KOGAKU (Journal of Cryogenics and Superconductivity Society of Japan) 57 ( 6 ) 382 - 383 2022 （ ISSN:03892441 eISSN:18800408 ）

　More details

Language：Japanese Publisher：CRYOGENICS AND SUPERCONDUCTIVITY SOCIETY OF JAPAN

DOI： 10.2221/jcsj.57.382

CiNii Research
Fast Screen Content Coding in HEVC Using Machine Learning.

Emad Badry, Koji Inoue, Mohammed Sharaf Sayed

IEEE Access 9 154659 - 154666 2021.11

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1109/ACCESS.2021.3125697
Demonstration of a 52-GHz Bit-Parallel Multiplier Using Low-Voltage Rapid Single-Flux-Quantum Logic Reviewed International journal

Ikki Nagaoka, Koki Ishida, Masamitsu Tanaka, Kyosuke Sano, Taro Yamashita, Takatsugu Ono, Koji Inoue, Akira Fujimaki

IEEE Transactions on Applied Superconductivity 31 ( 5 ) 1 - 5 2021.8

　More details

Language：English Publishing type：Research paper (scientific journal)

DOI： 10.1109/tasc.2021.3071996
Decision Tree Models and Early Splitting Termination in Screen Content Extension of High Efficiency Video Coding.

Emad Badry, Koji Inoue, Mohammed Sharaf Sayed

IEEE Access 8 143437 - 143452 2020.8

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1109/ACCESS.2020.3014163
How Many Trials Do We Need for Reliable NISQ Computing?

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, Koji Inoue

2020 IEEE Computer Society Annual Symposium on VLSI(ISVLSI) 288 - 290 2020.7

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISVLSI49217.2020.00059
Practical Error Modeling Toward Realistic NISQ Simulation.

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, Koji Inoue

2020 IEEE Computer Society Annual Symposium on VLSI(ISVLSI) 291 - 293 2020.7

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISVLSI49217.2020.00060
32 GHz 6.5 mW Gate-Level-Pipelined 4-Bit Processor using Superconductor Single-Flux-Quantum Logic

Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue

2020 IEEE Symposium on VLSI Circuits, VLSI Circuits 2020 2020 IEEE Symposium on VLSI Circuits, VLSI Circuits 2020 - Proceedings 2020.6

　More details

Language：English Publishing type：Research paper (other academic)

A Single-Flux-Quantum (SFQ) 4-bit throughput-oriented processor has successfully been demonstrated at up to 32 GHz with the measured power consumption of 6.5 mW. This is the first implementation of the gate-level-pipelined processor, and it achieves 2.5 Tera-Operations Per Watt (TOPS/W) by circuit and architectural optimizations.

DOI： 10.1109/VLSICircuits18222.2020.9162826
32 GHz 6.5 mW Gate-Level-Pipelined 4-Bit Processor using Superconductor Single-Flux-Quantum Logic.

Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue

IEEE Symposium on VLSI Circuits 1 - 2 2020.6

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/VLSICircuits18222.2020.9162826
Enhancing a manycore-oriented compressed cache for GPGPU

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Inoue Koji

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 22 - 31 2020.1

　More details

Language：English Publishing type：Research paper (other academic)

GPUs can achieve high performance by exploiting massive-thread parallelism. However, some factors limit performance on GPUs, one of which is the negative effects of L1 cache misses. In some applications, GPUs are likely to suffer from L1 cache conflicts because a large number of cores share a small L1 cache capacity. A cache architecture that is based on data compression is a strong candidate for solving this problem as it can reduce the number of cache misses. Unlike previous studies, our data compression scheme attempts to exploit the value locality existing within not only intra cache lines but also inter cache lines. We enhance the structure of a last-level compression cache proposed for general purpose manycore processors to optimize against shared L1 caches on GPUs. The experimental results reveal that our proposal outperforms the other compression cache for GPUs by 11 points on average.
Enhancing a manycore-oriented compressed cache for GPGPU.

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 22 - 31 2020.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/3368474.3368491
Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core.

Susumu Mashimo, Ryota Shioya, Koji Inoue

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 207 - 216 2020.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/3368474.3368496
An open source FPGA-optimized out-of-order RISC-V soft processor

Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima

18th International Conference on Field-Programmable Technology, ICFPT 2019 Proceedings - 2019 International Conference on Field-Programmable Technology, ICFPT 2019 63 - 71 2019.12

　More details

Language：English Publishing type：Research paper (other academic)

High-performance soft processors in field-programmable gate arrays (FPGAs) have become increasingly important as recent large FPGA systems have relied on soft processors to run many complex workloads, like a network software stack. An out-of-order (OoO) superscalar approach is a good candidate to improve performance in such cases, as evidenced from OoO hard processor studies. Recent studies have revealed, however, that conventional OoO processor components do not fit well in an FPGA, and it is thus important to carefully design such components for FPGA characteristics. Hence, we propose the RSD processor: a new, open-source OoO RISC-V soft processor optimized for an FPGA. The RSD supports many aggressive OoO execution features, like speculative scheduling, OoO memory instruction execution and disambiguation, a memory dependence predictor, and a non-blocking cache. While the RSD supports such aggressive features, it also leverages FPGA characteristics. Therefore, it consumes fewer FPGA resources than are consumed by existing OoO soft processors, which do not support such aggressive features well. We first introduce the end result of the RSD microarchitecture design and then describe several novel optimization techniques. The RSD achieves up to 2.5-times higher Dhrystone MIPS while using 60% fewer registers and 64% fewer lookup tables (LUTs) as compared to state-of-the-art, open-source OoO processors.

DOI： 10.1109/ICFPT47387.2019.00016
Evaluating the Impact of Energy Efficient Networks on HPC Workloads.

Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele

26th IEEE International Conference on High Performance Computing, Data, and Analytics(HiPC) 301 - 310 2019.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/HiPC.2019.00044
An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor.

Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi 0001, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima

International Conference on Field-Programmable Technology(FPT) 63 - 71 2019.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ICFPT47387.2019.00016
Evaluating the Impact of Energy Efficient Networks on HPC Workloads

Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele

26th Annual IEEE International Conference on High Performance Computing, HiPC 2019 Proceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019 301 - 310 2019.12

　More details

Language：English Publishing type：Research paper (other academic)

Interconnection networks grow larger as supercomputers include more nodes and require higher bandwidth for performance. This scaling significantly increases the fraction of power consumed by the network, by increasing the number of network components (links and switches). Typically, network links consume power continuously once they are turned on. However, recent proposals for energy efficient interconnects have introduced low-power operation modes for periods when network links are idle. Low-power operation can increase messaging time when switching a link from low-power to active operation. We extend the TraceR-CODES network simulator for power modeling to evaluate the impact of energy efficient networking on power and performance. Our evaluation presents the first study on both single-job and multi-job execution to realistically simulate power consumption and performance under congestion for a large-scale HPC network. Results on several workloads consisting of HPC proxy applications show that single-job and multi-job execution favor different modes of low power operation to have significant power savings at the cost of minimal performance degradation.

DOI： 10.1109/HiPC.2019.00044
Novel frontier of photonics for data processing—Photonic accelerator Reviewed International journal

Novel frontier of photonics for data processing—Photonic accelerator

APL Photonics 4 ( 090901 ) 2019.9

　More details

Language：English

DOI： 10.1063/1.5108912
Novel frontier of photonics for data processing-Photonic accelerator Reviewed International journal

Kitayama, Ken-ichi; Notomi, Masaya; Naruse, ; Inoue, Koji;, Koji; Kawakami, Satoshi; Uchida, Atsushi

APL PHOTONICS 4 ( 9 ) 2019.9

　More details

Language：English

DOI： 10.1063/1.5108912
Novel frontier of photonics for data processing-Photonic accelerator Reviewed

Ken Ichi Kitayama, Masaya Notomi, Makoto Naruse, Koji Inoue, Satoshi Kawakami, Atsushi Uchida

APL Photonics 4 ( 9 ) 2019.9

　More details

Language：English

In the emerging Internet of things cyber-physical system-embedded society, big data analytics needs huge computing capability with better energy efficiency. Coming to the end of Moore's law of the electronic integrated circuit and facing the throughput limitation in parallel processing governed by Amdahl's law, there is a strong motivation behind exploring a novel frontier of data processing in post-Moore era. Optical fiber transmissions have been making a remarkable advance over the last three decades. A record aggregated transmission capacity of the wavelength division multiplexing system per a single-mode fiber has reached 115 Tbit/s over 240 km. It is time to turn our attention to data processing by photons from the data transport by photons. A photonic accelerator (PAXEL) is a special class of processor placed at the front end of a digital computer, which is optimized to perform a specific function but does so faster with less power consumption than an electronic general-purpose processor. It can process images or time-serial data either in an analog or digital fashion on a real-time basis. Having had maturing manufacturing technology of optoelectronic devices and a diverse array of computing architectures at hand, prototyping PAXEL becomes feasible by leveraging on, e.g., cutting-edge miniature and power-efficient nanostructured silicon photonic devices. In this article, first the bottleneck and the paradigm shift of digital computing are reviewed. Next, we review an array of PAXEL architectures and applications, including artificial neural networks, reservoir computing, pass-gate logic, decision making, and compressed sensing. We assess the potential advantages and challenges for each of these PAXEL approaches to highlight the scope for future work toward practical implementation.

DOI： 10.1063/1.5108912
Efficient Autoencoder-Based Human Body Communication Transceiver for WBAN.

Abdelhay Ali, Koji Inoue, Ahmed Shalaby 0001, Mohammed Sharaf Sayed, Sabah Mohamed Ahmed

IEEE Access 7 117196 - 117205 2019.8

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1109/ACCESS.2019.2936796
Demonstration of an Energy-Efficient, Gate-Level-Pipelined 100 TOPS/W Arithmetic Logic Unit Based on Low-Voltage Rapid Single-Flux-Quantum Logic

Ikki Nagaoka, Masamitsu Tanaka, Kyosuke Sano, Taro Yamashita, Akira Fujimaki, Koji Inoue

17th IEEE International Superconductive Electronics Conference, ISEC 2019 ISEC 2019 - International Superconductive Electronics Conference 2019.7

　More details

Language：English Publishing type：Research paper (other academic)

We report the successful operation of an energy-efficient 8-bit arithmetic logic unit (ALU) based on bit-parallel, gate-Ievel-pipelining, and low-voltage rapid single-flux-quantum (LV-RSFQ) approaches. We implemented the ALU using a 10-kA/cm² Nb process. The bias voltage was optimized to obtain high energy efficiency. Although lowed bias voltage leads to difficulty in timing design, we solved the problem by precise timing control. The operating frequency reached 30 GHz. Thanks to these high-throughput and low-energy technologies, we realized highly energy-efficient operation over 100 tera-operations per second per watt (TOPS/W).

DOI： 10.1109/ISEC46533.2019.8990905
Critical Path Based Microarchitectural Bottleneck Analysis for Out-of-Order Execution.

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 102-A ( 6 ) 758 - 766 2019.6

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transfun.E102.A.758
ナノフォトニック・ニューラルネットワークアクセラレータ向け統合評価環境 Reviewed International journal

川上哲志, 小野貴継, 井上弘士, 納富雅也

電子情報通信学会論文誌 J102-A ( No.6 ) 2019.6

　More details

Language：Japanese Publishing type：Research paper (scientific journal)
Critical Path based Microarchitectural Bottleneck Analysis for Out-of-Order Execution Reviewed International journal

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

IEICE Transactions 2019.6

　More details

Language：English Publishing type：Research paper (scientific journal)
Hardware friendly algorithm for earthquakes discrimination based on wavelet filter bank and support vector machine

Omar M. Saad, Ahmed Shalaby, Inoue Koji, Mohammed S. Sayed

2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 115 - 118 2019.4

　More details

Language：English Publishing type：Research paper (other academic)

Discrimination between earthquakes and explosion is one of the main challenges in the field of seismology. In some cases, the explosions recorded as an earthquake or vice verse, which can contaminate the seismic catalog. Rapid discrimination is required to support the real-time seismic application. The discrimination algorithm is based on a wavelet filter bank to extract the discriminative features, and support vector machine (SVM) as a classifier. Therefore; we propose to optimize the hardware implementation of the discrimination algorithm on Field Programmable Gate Array (FPGA). First, we implement the wavelet filter bank using optimized lifting scheme. Then, we utilize the linear classifier to implement the SVM classifier. Finally, we optimize the hardware resources of the discrimination algorithm to be utilized on low-cost FPGA called TE0711 board (Xilinx Artix7). The implemented design is utilized 1.2% and 39.8% of the FPGA's Look Up Table (LUT) and register resources, respectively.

DOI： 10.1109/JEC-ECC.2018.8679531
Message from the Prof. Koji Inoue

Koji Inoue

2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 IV 2019.4

　More details

Language：English

DOI： 10.1109/JEC-ECC.2018.8679541
Improving lifetime in MLC phase change memory using slow writes

Takatsugu Ono, Zhe Chen, Inoue Koji

2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 65 - 68 2019.4

　More details

Language：English Publishing type：Research paper (other academic)

This paper reports the performance and endurance impacts of a slow-write approach for a multi-level cell (MLC) of phase change memory (PCM). An MLC improves the density of PCM, but the endurance is a critical issue. To extend the lifetime of the cell, a slow-write approach is one of the techniques that is used. However, the slow-write approach increases the program execution time because it takes a long time. In this paper, we discuss three types of slow-write approach for MLC and evaluate the endurance and performance quantitatively to understand the effectiveness of our approach. Our evaluation results show that one of the approaches enhances the endurance of MLC PCM 1.57 times with a 1.41 % performance degradation on average compared with the conventional write operation.

DOI： 10.1109/JEC-ECC.2018.8679540
29.3 A 48GHz 5.6mW Gate-Level-Pipelined Multiplier Using Single-Flux Quantum Logic

Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki

2019 IEEE International Solid-State Circuits Conference, ISSCC 2019 2019 IEEE International Solid-State Circuits Conference, ISSCC 2019 460 - 462 2019.3

　More details

Language：English Publishing type：Research paper (other academic)

A multiplier based on superconductor single-flux-quantum (SFQ) logic is demonstrated up to 48GHz with the measured power consumption of 5.6 mW. The multiplier performs 8 × 8 - bit signed multiplication every clock cycle. The design is based on a bit-parallel, gate-level-pipelined structure that exploits ultimately high-throughput performance of SFQ logic. The test chip fabricated using a 1.0- μ {m}, 9-layer process consists of 20,251 Nb/AlOx/Nb Josephson junctions (JJs). The correctness of operation is verified by on-chip high-speed testing.

DOI： 10.1109/ISSCC.2019.8662351
A 48GHz 5.6mW Gate-Level-Pipelined Multiplier Using Single-Flux Quantum Logic.

Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki

IEEE International Solid- State Circuits Conference(ISSCC) 460 - 462 2019.2

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISSCC.2019.8662351
Radio propagation characteristics-based spoofing attack prevention on wireless connected devices

Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti R. Sarangi, Koji Inoue

Journal of Information Processing 27 322 - 334 2019.1

　More details

Language：Others Publishing type：Research paper (scientific journal)

© 2019 Information Processing Society of Japan. A spoofing attack is a critical issue in wireless communication in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Not depending on information processing such as encryption technology, this method can be applied to sensing devices etc. which commonly have many resource restrictions. We call the space from which attacks achieve success as the “attack space.” In order to confine the attack space inside of the target system to prevent spoofing attacks from the outside, formulation of the relationship between combinations of transceivers and the attack space is necessary. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which uses these RPCs simultaneously. We take a tire pressure monitoring system (TPMS) as a case study of this method and execute a security evaluation based on radio-wave-propagation simulation. The simulation results assuming multiple noise environments all indicate that it is possible to eliminate the attack possibility from a distant location.

DOI： 10.2197/ipsjjip.27.322
Radio Propagation Characteristics-Based Spoofing Attack Prevention on Wireless Connected Devices Reviewed International journal

Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti Sarangi, Koji Inoue

IPSJ ACS 2019.1

　More details

Language：English Publishing type：Research paper (scientific journal)
Performance Analysis of CPU and DRAM Power Constrained Systems with Magnetohydrodynamic Simulation Code

Keiichiro Fukazawa, Masatsugu Ueda, Yuichi Inadomi, Mutsumi Aoyagi, Takayuki Umeda, Koji Inoue

20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018 Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018 626 - 631 2019.1

　More details

Language：English Publishing type：Research paper (other academic)

Presently the power consumption of supercomputer system becomes a critical issue to develop the exascale supercomputer system. On the other hand, the power consumption character of applications is not so considered by the applications developers because their main interest is how fast to run their applications. In this study, we examine and evaluate the power consumption behavior of our Magnetohydrodynamic simulation code which solves the planetary magnetosphere under the constrained power of CPU and DRAM on the x86 computer system. As the results, we found there are some regions in the simulation code which decrease the calculation performance or do not affect the performance under the power capping. This indicates the capability of power optimization without performance degradation using the dynamic power capping in running the application. In addition, we obtained the specific power consumption combinations between CPU and DRAM which greatly affect the calculation performance.

DOI： 10.1109/HPCC/SmartCity/DSS.2018.00113
Critical path based microarchitectural bottleneck analysis for out-of-order execution Reviewed

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E102A ( 6 ) 758 - 766 2019.1

　More details

Language：English Publishing type：Research paper (scientific journal)

SUMMARY Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks.

DOI： 10.1587/transfun.E102.A.758
Radio propagation characteristics-based spoofing attack prevention on wireless connected devices Reviewed

Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti R. Sarangi, Koji Inoue

Journal of information processing 27 322 - 334 2019

　More details

Language：English Publishing type：Research paper (scientific journal)

A spoofing attack is a critical issue in wireless communication in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Not depending on information processing such as encryption technology, this method can be applied to sensing devices etc. which commonly have many resource restrictions. We call the space from which attacks achieve success as the “attack space.” In order to confine the attack space inside of the target system to prevent spoofing attacks from the outside, formulation of the relationship between combinations of transceivers and the attack space is necessary. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which uses these RPCs simultaneously. We take a tire pressure monitoring system (TPMS) as a case study of this method and execute a security evaluation based on radio-wave-propagation simulation. The simulation results assuming multiple noise environments all indicate that it is possible to eliminate the attack possibility from a distant location.

DOI： 10.2197/ipsjjip.27.322
Parallel Precomputation with Input Value Prediction for Model Predictive Control Systems Reviewed International journal

Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Koji Inoue

2018.12

　More details

Language：English Publishing type：Research paper (scientific journal)
Situation-Based Dynamic Frame-Rate Control for On-Line Object Tracking, Reviewed

Yusuke Inoue, Takatsugu Ono, Koji Inoue

International Japan-Africa Conference on Electronics, Communications and Computations 129 - 132 2018.12

　More details

Language：English Publishing type：Research paper (other academic)

Situation-Based Dynamic Frame-Rate Control for On-Line Object Tracking,

DOI： 10.1109/jec-ecc.2018.8679545
Improving Lifetime in MLC Phase Change Memory Using Slow Writes Reviewed

Takatsugu Ono, Zhe Chen, Koji Inoue

2018 International Japan-Africa Conference on Electronics, Communications and Computations (JAC-ECC) 65 - 68 2018.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/jec-ecc.2018.8679540
Parallel Precomputation with Input Value Prediction for Model Predictive Control Systems.

Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Koji Inoue

IEICE Transactions on Information & Systems 101-D ( 12 ) 2864 - 2877 2018.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transinf.2018PAP0003
Real-Time Frame-Rate Control for Energy-Efficient On-Line Object Tracking

Yusuke INOUE, Takatsugu ONO, Koji INOUE

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E101.A ( 12 ) 2297 - 2307 2018.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transfun.e101.a.2297
Real-Time frame-rate control for energy-efficient on-line object tracking Reviewed

Yusuke Inoue, Takatsugu Ono, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E101A ( 12 ) 2297 - 2307 2018.12

　More details

Language：English Publishing type：Research paper (scientific journal)

On-line object tracking (OLOT) has been a core technology in computer vision, and its importance has been increasing rapidly. Because this technology is utilized for battery-operated products, energy consumption must be minimized. This paper describes a method of adaptive frame-rate optimization to satisfy that requirement. An energy trade-off occurs between image capturing and object tracking. Therefore, the method optimizes the frame-rate based on always changed object speed for minimizing the total energy while taking into account the trade-off. Simulation results show a maximum energy reduction of 50.0%, and an average reduction of 35.9% without serious tracking accuracy degradation.

DOI： 10.1587/transfun.E101.A.2297
Power management framework for post-petascale supercomputers

Masaaki Kondo, Ikuo Miyoshi, Koji Inoue, Shinobu Miwa

Advanced Software Technologies for Post-Peta Scale Computing The Japanese Post-Peta CREST Research Project 249 - 269 2018.12

　More details

Language：English

Power consumption is a first class design constraint for developing future exascale computing systems. To achieve exascale system performance with realistic power provisioning of 20-30MW, we need to improve power-performance efficiency significantly compared to today's supercomputer systems. In order to maximize effective performance within a power constraint, investigating how to optimize power resource allocation to each hardware component or each job submitted to the system is necessary. We have been conducting research and development on a software framework for code optimization and system power management for the power-constraint adaptive systems. We briefly introduce the research efforts for maximizing application performance under a given power constraint, power-aware resource manager, and power-performance simulation and analysis framework for future supercomputer systems.

DOI： 10.1007/978-981-13-1924-2_13
Parallel precomputation with input value prediction for model predictive control systems Reviewed

Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Inoue Koji

IEICE Transactions on Information and Systems E101D ( 12 ) 2864 - 2877 2018.12

　More details

Language：English Publishing type：Research paper (scientific journal)

We propose a parallel precomputation method for real-time model predictive control. The key idea is to use predicted input values produced by model predictive control to solve an optimal control problem in advance. It is well known that control systems are not suitable for multi- or many-core processors because feedback-loop control systems are inherently based on sequential operations. However, since the proposed method does not rely on conventional thread-/data-level parallelism, it can be easily applied to such control systems without changing the algorithm in applications. A practical evaluation using three real-world model predictive control system simulation programs demonstrates drastic performance improvement without degrading control quality offered by the proposed method.

DOI： 10.1587/transinf.2018PAP0003
Real-time Frame-Rate Control for Energy-Efficient On-Line Object Tracking Invited Reviewed International journal

Yusuke Inoue, Takatsugu Ono, Koji Inoue

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences, 2018.12

　More details

Language：English Publishing type：Research paper (scientific journal)
Automatic Arrival Time Detection for Earthquakes Based on Stacked Denoising Autoencoder Reviewed

Omar M. Saad, Koji Inoue, Ahmed Shalaby, Lotfy Samy, Mohammed S. Sayed

IEEE Geoscience and Remote Sensing Letters 15 ( 11 ) 1687 - 1691 2018.11

　More details

Language：English Publishing type：Research paper (scientific journal)

The accurate detection of P-wave arrival time is imperative for determining the hypocenter location of an earthquake. However, precise detection of onset time becomes more difficult when the signal-to-noise ratio (SNR) of the seismic data is low, such as during microearthquakes. In this letter, a stacked denoising autoencoder (SDAE) is proposed to smooth the background noise. The SDAE acts as a denoising filter for the seismic data. In the proposed algorithm, the SDAE is utilized to reduce background noise such that the onset time becomes more clear and sharp. Afterward, a hard decision with one threshold is used to detect the onset time of the event. The proposed algorithm is evaluated on both synthetic and field seismic data. As a result, the proposed algorithm outperforms the short-time average/long-time average and the Akaike information criterion algorithms. The proposed algorithm accurately picks the onset time of 94.1% for 407 field seismic waveforms with a standard deviation error of 0.10 s. In addition, the results indicate that the proposed algorithm can pick arrival times accurately for weak SNR seismic data with SNR higher than -14 dB.

DOI： 10.1109/LGRS.2018.2861218
Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded Programs Invited Reviewed International journal

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

IEICE Transactions on Information and Systems 2018.9

　More details

Language：English Publishing type：Research paper (scientific journal)
光パスゲート論理に基づく光波長多重並列加算器(2) ～熱光学スイッチによる動作実証～

新家昭彦, 石原亨, 野崎謙悟, 北翔太, 井上弘士, Cong Guangwei, 山田浩治, 納富雅也

応用物理学会学術講演会講演予稿集 2018.2 934 - 934 2018.9

　More details

Language：Japanese

Optical WDM parallel adder based on optical pass gate logic (2) ~ Experimental study using thermo-optic switch ~

DOI： 10.11470/jsapmeeting.2018.2.0_934
超伝導単一磁束量子回路による50~GHzビット並列演算マイクロプロセッサに向けた要素回路設計 Invited Reviewed

田中雅光, 佐藤諒, 石田浩貴, 畑中湧貴, 松井祐一, 小野貴継, 井上弘士, 藤巻朗

2018.9

　More details

Language：Japanese
Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded Programs

Satoshi IMAMURA, Yuichiro YASUI, Koji INOUE, Takatsugu ONO, Hiroshi SASAKI, Katsuki FUJISAWA

IEICE Transactions on Information and Systems E101.D ( 9 ) 2247 - 2257 2018.9

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transinf.2017edp7296
Evaluating energy-efficiency of DRAM channel interleaving schemes for multithreaded programs Reviewed

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

IEICE Transactions on Information and Systems E101D ( 9 ) 2247 - 2257 2018.9

　More details

Language：English Publishing type：Research paper (scientific journal)

The power consumption of server platforms has been increasing as the amount of hardware resources equipped on them is increased. Especially, the capacity of DRAM continues to grow, and it is not rare that DRAM consumes higher power than processors on modern servers. Therefore, a reduction in the DRAM energy consumption is a critical challenge to reduce the system-level energy consumption. Although it is well known that improving row buffer locality (RBL) and bank-level parallelism (BLP) is effective to reduce the DRAM energy consumption, our preliminary evaluation on a real server demonstrates that RBL is generally low across 15 multithreaded benchmarks. In this paper, we investigate the memory access patterns of these benchmarks using a simulator and observe that cache line-grained channel interleaving schemes, which are widely applied to modern servers including multiple memory channels, hurt the RBL each of the benchmarks potentially possesses. In order to address this problem, we focus on a row-grained channel interleaving scheme and compare it with three cache line-grained schemes. Our evaluation shows that it reduces the DRAM energy consumption by 16.7%, 12.3%, and 5.5%on average (up to 34.7%, 28.2%, and 12.0%) compared to the other schemes, respectively.

DOI： 10.1587/transinf.2017EDP7296
Autoencoder based Features Extraction for Automatic Classification of Earthquakes and Explosions

Omar M. Saad, Inoue Koji, Ahmed Shalaby, Lotfy Sarny, Mohammed S. Sayed

17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018 Proceedings - 17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018 445 - 450 2018.9

　More details

Language：English Publishing type：Research paper (other academic)

Monitoring illegal explosions is mandatory for the safety of human life, environment, and protect the important buildings such as High-dam in Egypt. This kind of monitoring can be accomplished by detecting and identifying the explosions. If an illegal explosion happens such as quarry blast, an alarm should be reported to the government to take immediate action. However, the main problem is that many measured signals from received explosions are similar to earthquakes in their shape and both cannot differentiate from each other. Also, incorrect classification possibly will distort the real seismicity nature of the region. This problem motivates us to search for unique discriminating features to distinguish between earthquakes and explosions with precise accuracy. Therefore, in this paper, we propose to extract the discriminative features based on Autoencoder from the first few seconds after the P-wave arrival time of the event. The discriminative features are found to be in the first 60 samples after the arrival time of P-wave. Thus the first stage of the proposed algorithm is extracting the discriminative features via the Autoencoder. Then, softmax classifies the event based on these extracted features. The proposed algorithm achieves a classification accuracy of 98.55% when applied to 900 earthquakes and quarry blasts waveforms recorded by Egyptian National Seismic Network (ENSN).

DOI： 10.1109/ICIS.2018.8466464
Analyzing resource trade-offs in hardware overprovisioned supercomputers

Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel Ellsworth, Barry Rountree, Martin Schulz

32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018 Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018 526 - 535 2018.8

　More details

Language：English Publishing type：Research paper (other academic)

Hardware overprovisioned systems have recently been proposed as a viable alternative for a power-efficient design of next-generation supercomputers. A key challenge for such systems is to determine the degree of overprovisioning, which refers to the number of extra nodes that need to be installed under a given power constraint. In this paper, we first show that the degree of overprovisioning depends on dynamic parameters, such as the job mix as well as the global power constraint, and that static decisions can result in limited system throughput. We then study an exhaustive combination of adaptive resource management strategies that span three job scheduling algorithms, four power capping techniques, and three node boot-up mechanisms to understand the trade-off space involved. We then draw conclusions about how these strategies can adaptively control the degree of overprovisioning and analyze their impact on job throughput and power utilization.

DOI： 10.1109/IPDPS.2018.00062
Automatic Arrival Time Detection for Earthquakes Based on Stacked Denoising Autoencoder.

Omar M. Saad, Koji Inoue, Ahmed Shalaby 0001, Lotfy Samy, Mohammed Sharaf Sayed

IEEE Geoscience and Remote Sensing Letters 15 ( 11 ) 1687 - 1691 2018.8

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1109/LGRS.2018.2861218
VMOR: Microarchitectural Support for Operand Access in an Interpreter.

Susumu Mashimo, Ryota Shioya, Koji Inoue

IEEE Computer Architecture Letters 17 ( 2 ) 217 - 220 2018.8

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1109/LCA.2018.2866243
An Integrated Nanophotonic Parallel Adder Reviewed International journal

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, and Masaya Notomi

ACM Journal on Emerging Technologies in Computing Systems (JETC) 2018.7

　More details

Language：English Publishing type：Research paper (scientific journal)
VMOR: Microarchitectural Support for Operand Access in an Interpreter Reviewed International journal

Mashimo, Susumu; Shioya, Ryota; Inoue, Koji

IEEE COMPUTER ARCHITECTURE LETTERS 17 ( 2 ) 217 - 220 2018.7

　More details

Language：English Publishing type：Research paper (scientific journal)

DOI： 10.1109/LCA.2018.2866243
Ultralow-latency optical circuit based on optical pass gate logic Reviewed

Akihiko Shinya, Kengo Nozaki, Masaya Notomi, Tohru Ishihara, Koji Inoue

NTT Technical Review 16 ( 7 ) 33 - 38 2018.7

　More details

Language：English

A novel light speed computing technology has been developed by NTT, Kyoto University, and Kyushu University that employs nanophotonic technology in critical paths and thus overcomes the problem of operational latency that is the chief limiting factor in conventional electronic circuits. The ultimate objective of this work is to develop an ultrahigh-speed optoelectronic arithmetic processor. This article provides an overview of our recent work and describes the successful implementation of this novel optical computing technology.
An Integrated Nanophotonic Parallel Adder.

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi

ACM Journal on Emerging Technologies in Computing Systems 14 ( 2 ) 26 - 20 2018.7

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1145/3178452
An Integrated Nanophotonic Parallel Adder Reviewed International journal

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, and Masaya Notomi,

ACM Journal on Emerging Technologies in Computing Systems (JETC) Volume 14 ( Issue 2, Article No. 26 ) 26:1 - 26:20 2018.6

　More details

Language：English Publishing type：Research paper (scientific journal)
Performance Analysis of CPU and DRAM Power Constrained Systems with Magnetohydrodynamic Simulation Code.

Keiichiro Fukazawa, Masatsugu Ueda, Yuichi Inadomi, Mutsumi Aoyagi, Takayuki Umeda, Koji Inoue

20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems(HPCC/SmartCity/DSS) 626 - 631 2018.6

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/HPCC/SmartCity/DSS.2018.00113
Autoencoder based Features Extraction for Automatic Classification of Earthquakes and Explosions.

Omar M. Saad, Koji Inoue, Ahmed Shalaby 0001, Lotfy Sarny, Mohammed Sharaf Sayed

17th IEEE/ACIS International Conference on Computer and Information Science(ICIS) 445 - 450 2018.6

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ICIS.2018.8466464
Towards Ultra High-Speed Cryogenic Single-Flux-Quantum Computing Invited Reviewed International journal

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue

IEICE Transactions on Electronics 2018.5

　More details

Language：English Publishing type：Research paper (scientific journal)
Analyzing Resource Trade-offs in Hardware Overprovisioned Supercomputers.

Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel A. Ellsworth, Barry Rountree, Martin Schulz 0001

2018 IEEE International Parallel and Distributed Processing Symposium(IPDPS) 526 - 535 2018.5

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/IPDPS.2018.00062
Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing Invited Reviewed International journal

Ishida, Koki; Tanaka, Masamitsu; Ono, Takatsugu; Inoue, Koji

IEICE TRANSACTIONS ON ELECTRONICS E101C ( 5 ) 359 - 369 2018.5

　More details

Language：English Publishing type：Research paper (scientific journal)

DOI： 10.1587/transele.E101.C.359
Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing.

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue

IEICE Transactions on Electronics 101-C ( 5 ) 359 - 369 2018.5

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transele.E101.C.359
CPCI Stack Metric for Accurate Bottleneck Analysis on OoO Microprocessors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

5th International Symposium on Computing and Networking, CANDAR 2017 Proceedings - 2017 5th International Symposium on Computing and Networking, CANDAR 2017 166 - 172 2018.4

　More details

Language：English Publishing type：Research paper (other academic)

Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks.

DOI： 10.1109/CANDAR.2017.60
Wireless Spoofing-Attack Prevention Using Radio-Propagation Characteristics

Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Inoue Koji

15th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 Proceedings - 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 502 - 510 2018.3

　More details

Language：English Publishing type：Research paper (other academic)

A spoofing attack is a critical issue in wireless communication in embedded systems in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Since RPCs are natural phenomena, this method does not depend on information processing such as encryption technology. We call the space from which attacks achieve success "attack space". By formulating the relationship between combinations of transceivers and the attack space, this method can be used in embedded systems. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which use these RPCs simultaneously for preventing wireless spoofing-attacks. We explain the results of a validity evaluation for the proposed model based on radio-wave-propagation simulation assuming free space and a noisy environment.

DOI： 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.94
Wireless Spoofing-Attack Prevention Using Radio-Propagation Characteristics

Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Koji Inoue

Proceedings - 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 2018-January 502 - 510 2018.3

　More details

Language：Others Publishing type：Research paper (other academic)

© 2017 IEEE. A spoofing attack is a critical issue in wireless communication in embedded systems in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Since RPCs are natural phenomena, this method does not depend on information processing such as encryption technology. We call the space from which attacks achieve success "attack space". By formulating the relationship between combinations of transceivers and the attack space, this method can be used in embedded systems. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which use these RPCs simultaneously for preventing wireless spoofing-attacks. We explain the results of a validity evaluation for the proposed model based on radio-wave-propagation simulation assuming free space and a noisy environment.

DOI： 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.94
Low-latency optical parallel adder based on a binary decision diagram with wavelength division multiplexing scheme

A. Shinya, T. Ishihara, K. Inoue, K. Nozaki, S. Kita, M. Notomi

Optical Data Science: Trends Shaping the Future of Photonics 2018 Optical Data Science Trends Shaping the Future of Photonics 2018.1

　More details

Language：English Publishing type：Research paper (other academic)

We propose an optical parallel adder based on a binary decision diagram that can calculate simply by propagating light through electrically controlled optical pass gates. The CARRY and CARRY operations are multiplexed in one circuit by a wavelength division multiplexing scheme to reduce the number of optical elements, and only a single gate constitutes the critical path for one digit calculation. The processing time reaches picoseconds per digit when we use a 100-μm-long optical path gates, which is ten times faster than a CMOS circuit.

DOI： 10.1117/12.2296842
Ultralow latency computation based on integrated nanophotonics

Masaya Notomi, Kengo Nozaki, Shota Kita, Akihiko Shinya, Tohru Ishihara, Inoue Koji

JSAP-OSA Joint Symposia, JSAP 2018 JSAP-OSA Joint Symposia, JSAP 2018 2018.1

　More details

Language：English Publishing type：Research paper (other academic)

Moore's law for CMOS computers is still continuing, but its near-future saturation is now being discussed. One of the serious saturations is about its latency. The computation delay for a CMOS transistor is already saturated above 10 ps, which will be problematic when ultralow-latency response is required for broad-band data streams, even with parallelization or pipe-line processing. We regard that optical circuits may serve as ultralow-latency computation circuits if they are small enough and tightly combined with electronic circuits. The former requires nanophotonic devices/circuits and the former requires OE/EO conversion with ultrasmall capacitance.
Dependence Graph Model for Accurate Critical Path　Analysis on Out-of-Order Processors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

Journal of Information Processing 2017.12

　More details

Language：English
Dependence Graph Model for Accurate Critical Path Analysis on Out-of-Order Processors.

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

Journal of Information Processing 25 983 - 992 2017.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.2197/ipsjjip.25.983
CPCI Stack: Metric for Accurate Bottleneck Analysis on OoO Microprocessors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

2017 Fifth International Symposium on Computing and Networking (CANDAR) 166 - 172 2017.11

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/candar.2017.60
Production Hardware Overprovisioning Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, Martin Schulz

31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017 Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017 957 - 966 2017.6

　More details

Language：English Publishing type：Research paper (other academic)

Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.

DOI： 10.1109/IPDPS.2017.107
Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework.

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel A. Ellsworth, Barry Rountree, Martin Schulz 0001

2017 IEEE International Parallel and Distributed Processing Symposium(IPDPS) 957 - 966 2017.5

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/IPDPS.2017.107
単一磁束量子回路向けマイクロプロセッサのアーキテクチャ探索

石田浩貴, 田中雅光, Takatsugu Ono, Inoue Koji

情報処理学会論文誌 2017.3

　More details

Language：Japanese
Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors.

Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Hiroshi Sasaki 0001

IEEE Computer Architecture Letters 16 ( 2 ) 111 - 114 2017.3

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1109/LCA.2017.2684813
Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Hiroshi Sasaki

IEEE Computer Architecture Letters 2017.3

　More details

Language：English
Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

2016 High Performance Graph Data Management and Processing, HPGDMP 2016 Proceedings of HPGDMP 2016 High Performance Graph Data Management and Processing - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis 17 - 24 2017.1

　More details

Language：English Publishing type：Research paper (other academic)

Graph analysis applications have been widely used in real services such as road-traffic analysis and social network services. Breadth-first search (BFS) is one of the most representative algorithms for such applications; therefore, many researchers have tuned it to maximize performance. On the other hand, owing to the strict power constraints of modern HPC systems, it is necessary to improve power efficiency (i.e., performance per watt) when executing BFS. In this work, we focus on the power efficiency of DRAM and investigate the memory access pattern of a state-of-the-art BFS implementation using a cycle-accurate processor simulator. The results reveal that the conventional address mapping schemes of modern memory controllers do not efficiently exploit row buffers in DRAM. Thus, we propose a new scheme called per-row channel interleaving and improve the DRAM power efficiency by 30.3% compared to a conventional scheme for a certain simulator setting. Moreover, we demonstrate that this proposed scheme is effective for various configurations of memory controllers.

DOI： 10.1109/HPGDMP.2016.010
Preface Reviewed

Jens Knoop, Wolfgang Karl, Martin Schulz, Koji Inoue

30th International Conference on Architecture of Computing Systems, ARCS 2017 Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10172 LNCS 2017.1

　More details

Language：English
An integrated optical parallel adder as a first step towards light speed data processing

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi

13th International SoC Design Conference, ISOCC 2016 ISOCC 2016 - International SoC Design Conference Smart SoC for Intelligent Things 123 - 124 2016.12

　More details

Language：English Publishing type：Research paper (other academic)

Integrated optical circuits with nanophotonic devices have attracted significant attention due to its low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-Transistor. This paper first introduces a concept of the optical pass-gate logic, and then proposes a parallel adder circuit based on the optical passgate logic. Experimental results obtained with an optoelectronic circuit simulator show advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.

DOI： 10.1109/ISOCC.2016.7799721
Evaluating the impacts of code-level performance tunings on power efficiency

Satoshi Imamura, Keitaro Oka, Yuichiro Yasui, Yuichi Inadomi, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Koji Inoue, Takatsugu Ono

2016 IEEE International Conference on Big Data (Big Data) 362 - 369 2016.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/bigdata.2016.7840624
Single-flux-quantum cache memory architecture

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue

13th International SoC Design Conference, ISOCC 2016 ISOCC 2016 - International SoC Design Conference Smart SoC for Intelligent Things 105 - 106 2016.12

　More details

Language：English Publishing type：Research paper (other academic)

Single-flux-quantum (SFQ) logic is promising technology to realize an incredible microprocessor which operates over 100 GHz due to its ultra-fast-speed and ultra-lowpower natures. Although previous work has demonstrated prototype of an SFQ microprocessor, the SFQ based L1 cache memory has not well optimized: A large access latency and strictly limited scalability. This paper proposes a novel SFQ cache architecture to support fast accesses. The sub-Arrayed structure applied to the cache produces better scalability in terms of capacity. Evaluation results show that the proposed cache achieves 1.8X fast access speed.

DOI： 10.1109/ISOCC.2016.7799755
Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

2016 High Performance Graph Data Management and Processing Workshop (HPGDMP) 17 - 24 2016.11

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/hpgdmp.2016.010
Accuracy analysis of machine learning-based performance modeling for microprocessors

Yoshihiro Tanaka, Keitaro Oka, Takatsugu Ono, Koji Inoue

4th International Japan-Egypt Conference on Electronic, Communication and Computers, JEC-ECC 2016 Proceedings of the 2016 4th International Japan-Egypt Conference on Electronic, Communication and Computers, JEC-ECC 2016 83 - 86 2016.7

　More details

Language：English Publishing type：Research paper (other academic)

This paper analyzes accuracy of performance models generated by machine learning-based empirical modeling methodology. Although the accuracy strongly depends on the quality of learning procedure, it is not clear what kind of learning algorithms and training data set (or feature) should be used. This paper inclusively explores the learning space of processor performance modeling as a case study. We focus on static architectural parameters as training data set such as cache size and clock frequency. Experimental results show that a tree-based non-linear regression modeling is superior to a stepwise linear regression modeling. Another observation is that clock frequency is the most important feature to improve prediction accuracy.

DOI： 10.1109/JEC-ECC.2016.7518973
An integrated optical parallel adder as a first step towards light speed data processing.

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi

International SoC Design Conference(ISOCC) 123 - 124 2016.7

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISOCC.2016.7799721
From FLOPS to BYTES Disruptive change in high-performance computing towards the post-moore era

Satoshi Matsuoka, Hideharu Amano, Kengo Nakajima, Koji Inoue, Tomohiro Kudoh, Naoya Maruyama, Kenjiro Taura, Takeshi Iwashita, Takahiro Katagiri, Toshihiro Hanawa, Toshio Endo

ACM International Conference on Computing Frontiers, CF 2016 2016 ACM International Conference on Computing Frontiers - Proceedings 274 - 281 2016.5

　More details

Language：English Publishing type：Research paper (other academic)

Slowdown and inevitable end in exponential scaling of processor performance, the end of the so-called"Moore's Law" is predicted to occur around 2025-2030 timeframe. Because CMOS semiconductor voltage is also approaching its limits, this means that logic transistor power will become constant, and as a result, the system FLOPS will cease to improve, resulting in serious consequences for IT in general, especially supercomputing. Existing attempts to overcome the end of Moore's law are rather limited in their future outlook or applicability. We claim that data-oriented parameters, such as bandwidth and capacity, or BYTES, are the new parameters that will allow continued performance gains for periods even after computing performance or FLOPS ceases to improve, due to continued advances in storage device technologies and optics, and manufacturing technologies including 3-D packaging. Such transition from FLOPS to BYTES will lead to disruptive changes in the overall systems from applications, algorithms, software to architecture, as to what parameter to optimize for, in order to achieve continued performance growth over time. We are launching a new set of research efforts to investigate and devise new technologies to enable such disruptive changes from FLOPS to BYTES in the Post-Moore era, focusing on HPC, where there is extreme sensitivity to performance, and expect the results to disseminate to the rest of IT.

DOI： 10.1145/2903150.2906830
From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era.

Satoshi Matsuoka, Hideharu Amano, Kengo Nakajima, Koji Inoue, Tomohiro Kudoh, Naoya Maruyama, Kenjiro Taura, Takeshi Iwashita, Takahiro Katagiri, Toshihiro Hanawa, Toshio Endo

Proceedings of the ACM International Conference on Computing Frontiers 274 - 281 2016.5

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/2903150.2906830
Accuracy analysis of machine learning-based performance modeling for microprocessors

Yoshihiro Tanaka, Keitaro Oka, Takatsugu Ono, Koji Inoue

2016 Fourth International Japan-Egypt Conference on Electronics, Communications and Computers (JEC-ECC) 2016.5

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/jec-ecc.2016.7518973
Evaluating the impacts of code-level performance tunings on power efficiency

Satoshi Imamura, Keitaro Oka, Yuichiro Yasui, Yuichi Inadomi, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Koji Inoue, Takatsugu Ono

4th IEEE International Conference on Big Data, Big Data 2016 Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 362 - 369 2016.1

　More details

Language：English Publishing type：Research paper (other academic)

As the power consumption of HPC systems will be a primary constraint for exascale computing, a main objective in HPC communities is recently becoming to maximize power efficiency (i.e., performance per watt) rather than performance. Although programmers have spent a considerable effort to improve performance by tuning HPC programs at a code level, tunings for improving power efficiency is now required. In this work, we select two representative HPC programs (Graph500 and SDPARA) and evaluate how traditional code-level performance tunings applied to these programs affect power efficiency. We also investigate the impacts of the tunings on power efficiency at various operating frequencies of CPUs and/or GPUs. The results show that the tunings significantly improve power efficiency, and different types of tunings exhibit different trends in power efficiency by varying CPU frequency. Finally, the scalability and power efficiency of state-of-the-art Graph500 implementations are explored on both a single-node platform and a 960-node supercomputer. With their high scalability, they achieve 27.43 MTEPS/Watt with 129.76 GTEPS on the single-node system and 4.39 MTEPS/Watt with 1,085.24 GTEPS on the supercomputer.

DOI： 10.1109/BigData.2016.7840624
Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing

Yuichi Inadomi, Tapasya Patki, Koji Inoue, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi

International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 Proceedings of SC 2015 The International Conference for High Performance Computing, Networking, Storage and Analysis 2015.11

　More details

Language：English Publishing type：Research paper (other academic)

A key challenge in next-generation supercomputing is to effectively schedule limited power resources. Modern processors suffer from increasingly large power variations due to the chip manufacturing process. These variations lead to power inhomogeneity in current systems and manifest into performance inhomogeneity in power constrained environments, drastically limiting supercomputing performance. We present a first-of-its-kind study on manufacturing variability on four production HPC systems spanning four microarchitectures, analyze its impact on HPC applications, and propose a novel variation-aware power budgeting scheme to maximize effective application performance. Our low-cost and scalable budgeting algorithm strives to achieve performance homogeneity under a power constraint by deriving application-specific, module-level power allocations. Experimental results using a 1,920 socket system show up to 5.4X speedup, with an average speedup of 1.8X across all benchmarks when compared to a variation-unaware power allocation scheme.

DOI： 10.1145/2807591.2807638
Message from the IEEE MCSoC-15 Program Co-Chairs Reviewed

José Ayala, Fumio Arakawa, Inoue Koji

9th IEEE International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2015 Proceedings - IEEE 9th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2015 xi 2015.11

　More details

Language：English

DOI： 10.1109/MCSoC.2015.5
Characterization and cross-platform analysis of high-throughput accelerators

Keitaro Oka, Wenhao Jia, Margaret Martonosi, Koji Inoue

2015 15th IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2015 ISPASS 2015 - IEEE International Symposium on Performance Analysis of Systems and Software 161 - 162 2015.4

　More details

Language：English Publishing type：Research paper (other academic)

Today's computer systems often employ high-throughput accelerators (such as Intel Xeon Phi coprocessors and NVIDIA Tesla GPUs) to improve the performance of some applications or portions of applications. While such accelerators are useful for suitable applications, it remains challenging to predict which workloads will run well on these platforms and to predict the resulting performance trends for varying input.

DOI： 10.1109/ISPASS.2015.7095797
Characterization and cross-platform analysis of high-throughput accelerators.

Keitaro Oka, Wenhao Jia, Margaret Martonosi, Koji Inoue

2015 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS) 161 - 162 2015.4

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISPASS.2015.7095797
A flexible hardware barrier mechanism for many-core processors

Takeshi Soga, Hiroshi Sasaki, Tomoya Hirao, Masaaki Kondo, Koji Inoue

2015 20th Asia and South Pacific Design Automation Conference, ASP-DAC 2015 20th Asia and South Pacific Design Automation Conference, ASP-DAC 2015 61 - 68 2015.3

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes a new hardware barrier mechanism which offers the flexibility to select which cores should join the synchronization, allowing for executing multiple multi-threaded applications by dividing a many-core processor into several groups. Experimental results based on an RTL simulation show that our hardware barrier achieves a 66-fold reduction in latency over typical software based implementations, with a hardware overhead of the processor of only 1.8%. Additionally, we demonstrate that the proposed mechanism is sufficiently flexible to cover a variety of core groups with minimal hardware overhead.

DOI： 10.1109/ASPDAC.2015.7058982
A flexible hardware barrier mechanism for many-core processors.

Takeshi Soga, Hiroshi Sasaki 0001, Tomoya Hirao, Masaaki Kondo, Koji Inoue

The 20th Asia and South Pacific Design Automation Conference(ASP-DAC) 61 - 68 2015.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ASPDAC.2015.7058982
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems

Satoshi Imamura, Hiroshi Sasaki, Koji Inoue, Dimitrios S. Nikolopoulos

32nd IEEE International Conference on Computer Design, ICCD 2014 2014 32nd IEEE International Conference on Computer Design, ICCD 2014 324 - 331 2014.12

　More details

Language：English Publishing type：Research paper (other academic)

Power capping is an essential function for efficient power budgeting and cost management on modern server systems. Contemporary server processors operate under power caps by using dynamic voltage and frequency scaling (DVFS). However, these processors are often deployed in non-uniform memory access (NUMA) architectures, where thread allocation between cores may significantly affect performance and power consumption. This paper proposes a method which maximizes performance under power caps on NUMA systems by dynamically optimizing two knobs: DVFS and thread allocation. The method selects the optimal combination of the two knobs with models based on artificial neural network (ANN) that captures the nonlinear effect of thread allocation on performance. We implement the proposed method as a runtime system and evaluate it with twelve multithreaded benchmarks on a real AMD Opteron based NUMA system. The evaluation results show that our method outperforms a naive technique optimizing only DVFS by up to 67.1%, under a power cap.

DOI： 10.1109/ICCD.2014.6974701
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems.

Satoshi Imamura, Hiroshi Sasaki 0001, Koji Inoue, Dimitrios S. Nikolopoulos

32nd IEEE International Conference on Computer Design(ICCD) 324 - 331 2014.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ICCD.2014.6974701
Power Consumption Evaluation of an MHD Simulation with CPU Power Capping.

Keiichiro Fukazawa, Masatsugu Ueda, Mutsumi Aoyagi, Tomonori Tsuhata, Kyohei Yoshida, Aruta Uehara, Masakazu Kuze, Yuichi Inadomi, Koji Inoue

14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing(CCGRID) 612 - 617 2014.7

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/CCGrid.2014.47
Power and Performance Characterization and Modeling of GPU-Accelerated Systems.

Yuki Abe 0001, Hiroshi Sasaki 0001, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

2014 IEEE 28th International Parallel and Distributed Processing Symposium(IPDPS) 113 - 122 2014.5

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/IPDPS.2014.23
Power and performance characterization and modeling of GPU-accelerated systems

Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014 Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014 113 - 122 2014.1

　More details

Language：English Publishing type：Research paper (other academic)

Graphics processing units (GPUs) provide an order-of-magnitude improvement on peak performance and performance-per-watt as compared to traditional multicore CPUs. However, GPU-accelerated systems currently lack a generalized method of power and performance prediction, which prevents system designers from an ultimate goal of dynamic power and performance optimization. This is due to the fact that their power and performance characteristics are not well captured across architectures, and as a result, existing power and performance modeling approaches are only available for a limited range of particular GPUs. In this paper, we present power and performance characterization and modeling of GPU-accelerated systems across multiple generations of architectures. Characterization and modeling both play a vital role in optimization and prediction of GPU-accelerated systems. We quantify the impact of voltage and frequency scaling on each architecture with a particularly intriguing result that a cutting-edge Kepler-based GPU achieves energy saving of 75% by lowering GPU clocks in the best scenario, while Fermi- and Tesla-based GPUs achieve no greater than 40% and 13%, respectively. Considering these characteristics, we provide statistical power and performance modeling of GPU-accelerated systems simplified enough to be applicable for multiple generations of architectures. One of our findings is that even simplified statistical models are able to predict power and performance of cutting-edge GPUs within errors of 20% to 30% for any set of voltage and frequency pair.

DOI： 10.1109/IPDPS.2014.23
Power consumption evaluation of an MHD simulation with CPU power capping

Keiichiro Fukazawa, Masatsugu Ueda, Mutsumi Aoyagi, Tomonori Tsuhata, Kyohei Yoshida, Aruta Uehara, Masakazu Kuze, Yuichi Inadomi, Koji Inoue

14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2014 Proceedings - 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2014 612 - 617 2014.1

　More details

Language：English Publishing type：Research paper (other academic)

Recently to achieve the Exa-flops next generation computer system, the power consumption becomes the important issue. On the other hand, the power consumption character of application program is not so considered now. In this study we examine the power character of our Magneto hydrodynamic (MHD) simulation code for the global magnetosphere to evaluate the power consumption behavior of the simulation code under the CPU power capping on the parallel computer system. As a result, it is confirmed that there are different power consumption parts in the MHD simulation code, which the execution performance decreases or does not change under the CPU power capping. This indicates the capability of performance optimization with the power capping.

DOI： 10.1109/CCGrid.2014.47
Coordinated power-performance optimization in manycores

Hiroshi Sasaki, Satoshi Imamura, Koji Inoue

22nd International Conference on Parallel Architectures and Compilation Techniques, PACT 2013 PACT 2013 - Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques 51 - 61 2013.11

　More details

Language：English Publishing type：Research paper (other academic)

Optimizing the performance in multiprogrammed environments, especially for workloads composed of multi-threaded programs is a desired feature of runtime management system in future manycore processors. At the same time, power capping capability is required in order to improve the reliability of microprocessor chips while reducing the costs of power supply and thermal budgeting. This paper presents a sophisticated runtime coordinated power-performance management system called C-3PO, which optimizes the performance of manycore processors under a power constraint by controlling two software knobs: thread packing, and dynamic voltage and frequency scaling (DVFS). The proposed solution distributes the power budget to each program by controlling the workload threads to be executed with appropriate number of cores and operating frequency. The power budget is distributed carefully in different forms (number of allocated cores or operating frequency) depending on the power-performance characteristics of the workload so that each program can effectively convert the power into performance. The proposed system is based on a heuristic algorithm which relies on runtime prediction of power and performance via hardware performance monitoring units. Empirical results on a 64-core platform show that C-3PO well outperforms traditional counterparts across various PARSEC workload mixes.

DOI： 10.1109/PACT.2013.6618803
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator.

Lovic Gauthier, Shinya Ueno, Koji Inoue

International Conference on Compilers, Architecture and Synthesis for Embedded Systems(CASES) 10 - 10 2013.11

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/CASES.2013.6662514
Static Mapping of Multiple Data-Parallel Applications on Embedded Many-Core SoCs.

Junya Kaida, Yuko Hara-Azumi, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Koji Inoue

IEICE Transactions on Information & Systems 96-D ( 10 ) 2268 - 2271 2013.10

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transinf.E96.D.2268
Coordinated power-performance optimization in manycores.

Hiroshi Sasaki 0001, Satoshi Imamura, Koji Inoue

Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques(PACT) 51 - 61 2013.10

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/PACT.2013.6618803
A Prototype System for Many-core Architecture SMYLEref with FPGA Evaluation Boards

Son-Truong NGUYEN, Masaaki KONDO, Tomoya HIRAO, Inoue Koji

IEICE Transactions on Information and Systems 2013.8

　More details

Language：English
A Prototype System for Many-Core Architecture SMYLEref with FPGA Evaluation Boards.

Son-Truong Nguyen, Masaaki Kondo, Tomoya Hirao, Koji Inoue

IEICE Transactions on Information & Systems 96-D ( 8 ) 1645 - 1653 2013.8

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transinf.E96.D.1645
Many-core acceleration for model predictive control systems.

Satoshi Kawakami, Akihito Iwanaga, Koji Inoue

Proceedings of the 1st International Workshop on Many-core Embedded Systems 2013(MES) 17 - 24 2013.6

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/2489068.2489071
Line sharing cache Exploring cache capacity with frequent line value locality

Keitarou Oka, Hiroshi Sasaki, Koji Inoue

2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 669 - 674 2013.5

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes a new last level cache architecture called line sharing cache (LSC), which can reduce the number of cache misses without increasing the size of the cache memory. It stores lines which contain the identical value in a single line entry, which enables to store greater amount of lines. Evaluation results show performance improvements of up to 35% across a set of SPEC CPU2000 benchmarks.

DOI： 10.1109/ASPDAC.2013.6509677
SMYLEref A reference architecture for manycore-processor SoCs

M. Kondo, S. T. Nguyen, T. Hirao, T. Soga, H. Sasaki, K. Inoue

2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 561 - 564 2013.5

　More details

Language：English Publishing type：Research paper (other academic)

Nowadays, the trend of developing micro-processor with tens of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. We are currently developing a many-core processor architecture for embedded systems as a part of a NEDO's project. This paper introduces the many-core architecture called SMYLEref along whit the concept of Virtual Accelerator on Many-core, in which many cores on a chip are utilized as a hardware platform for realizing multiple virtual accelerators. We are developing its prototype system with off-the-shelf FPGA evaluation boards. In this paper, we introduce the architecture of SMYLEref and the detail of the prototype system. In addition, several initial experiments with the prototype system are also presented.

DOI： 10.1109/ASPDAC.2013.6509656
SMYLE project Toward high-performance, low-power computing on manycore-processor SoCs

Koji Inoue

2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 558 - 560 2013.5

　More details

Language：English Publishing type：Research paper (other academic)

This paper introduces a manycore research project called SMYLE (Scalable ManYcore for Low Energy computing). The aims of this project are: 1) proposing a manycore SoC architecture and developing a suitable programming and execution environment, 2) designing a domain specific manycore system for emerging video mining applications, and 3) releasing developed software tools and FPGA emulation environments to accelerate manycore research and development in the community. The project started in December 2010 with full support from the New Energy and Industrial Technology Development Organization (NEDO).

DOI： 10.1109/ASPDAC.2013.6509655
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator

Lovic Gauthier, Shinya Ueno, Inoue Koji

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2013 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2013 2013.1

　More details

Language：English Publishing type：Research paper (other academic)

This paper presents a hybrid compile and run-time memory management technique for a 3D-stacked reconfigurable accelerator including a memory layer composed of multiple memory units whose parallel access allows a very high bandwidth. The technique inserts allocation, free and data transfers into the code for using the memory layer and avoids memory overflows by adding a limited number of additional copies to and from the host memory. When compile-time information is lacking, the technique relies on run-time decisions for controlling these memory operations. Experiments show that, compared to a pessimistic approach, the overhead for avoiding overflows can be cut on average by 27%, 45% and 63% when the size of each memory unit is respectively 1kB, 128kB and 1MB.

DOI： 10.1109/CASES.2013.6662514
SMYLEref: A reference architecture for manycore-processor SoCs.

Masaaki Kondo, Son Truong Nguyen, Tomoya Hirao, Takeshi Soga, Hiroshi Sasaki 0001, Koji Inoue

18th Asia and South Pacific Design Automation Conference(ASP-DAC) 561 - 564 2013.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ASPDAC.2013.6509656
SMYLE Project: Toward high-performance, low-power computing on manycore-processor SoCs.

Koji Inoue

18th Asia and South Pacific Design Automation Conference(ASP-DAC) 558 - 560 2013.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ASPDAC.2013.6509655
Power and performance of GPU-accelerated systems: A closer look.

Yuki Abe 0001, Hiroshi Sasaki 0001, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

Proceedings of the IEEE International Symposium on Workload Characterization(IISWC) 109 - 110 2013.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/IISWC.2013.6704675
Line sharing cache: Exploring cache capacity with frequent line value locality.

Keitarou Oka, Hiroshi Sasaki 0001, Koji Inoue

18th Asia and South Pacific Design Automation Conference(ASP-DAC) 669 - 674 2013.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ASPDAC.2013.6509677
Power and performance of GPU-accelerated systems A closer look

Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

2013 IEEE International Symposium on Workload Characterization, IISWC 2013 Proceedings - 2013 IEEE International Symposium on Workload Characterization, IISWC 2013 109 - 110 2013.1

　More details

Language：English Publishing type：Research paper (other academic)

DOI： 10.1109/IISWC.2013.6704675
Many-core acceleration for model predictive control systems

Satoshi Kawakami, Akihito Iwanaga, Inoue Koji

1st International Workshop on Many-Core Embedded Systems, MES 2013, in Conjunction with the 40th Annual IEEE/ACM International Symposium on Computer Architecture, ISCA 2013 1st International Workshop on Many-Core Embedded Systems, MES 2013 - In Conjunction with the 40th Annual IEEE/ACM International Symposium on Computer Architecture, ISCA 2013 17 - 24 2013

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes a novel many-core execution strategy for real-time model predictive controls. The key idea is to exploit predicted input values, which are produced by the model predictive control itself, to speculatively solve an op- timal control problem. It is well known that control appli- cations are not suitable for multi- or many-core processors, because feedback-loop systems inherently stand on sequen- tial operations. Since the proposed scheme does not rely on conventional thread-/data-level parallelism, it can be easily applied to such control systems. An analytical evaluation using a real application demonstrates the potential of per- formance improvement achieved by the proposed speculative executions.

DOI： 10.1145/2489068.2489071
A three-dimensional integrated accelerator

Farhad Mehdipour, Krishna C. Nunna, Koji Inoue, Kazuaki J. Murakami

15th Euromicro Conference on Digital System Design, DSD 2012 Proceedings - 15th Euromicro Conference on Digital System Design, DSD 2012 148 - 151 2012.12

　More details

Language：English Publishing type：Research paper (other academic)

We propose a three-dimensional (3D) reconfigurable data-path accelerator which is capable of running partitioned large data flow graphs (DFGs) on the layers of 3D stack, while inter-layer connections are implemented by means of through-silicon vias (TSVs). A tool for mapping data flow graphs has been developed, and a key 3D-specific problem namely routing nets on 3D architecture has been discussed in details as well. Conducted experiments demonstrate smaller footprint area and higher performance for the 3D accelerator comparing with 2D counterpart.

DOI： 10.1109/DSD.2012.15
A Three-Dimensional Integrated Accelerator.

Farhad Mehdipour, Krishna Chaitanya Nunna, Koji Inoue, Kazuaki J. Murakami

15th Euromicro Conference on Digital System Design(DSD) 148 - 151 2012.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/DSD.2012.15
Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization.

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki J. Murakami

The Journal of Supercomputing 60 ( 2 ) 196 - 222 2012.11

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1007/s11227-010-0505-0
Scalability-based manycore partitioning Reviewed

Hiroshi Sasaki, Koji Inoue, Teruo Tanimoto, Hiroshi Nakamura

21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012 Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT 107 - 116 2012.10

　More details

Language：English Publishing type：Research paper (scientific journal)

Multicore processors have been popular for years, and the industry is gradually shifting towards the era of manycore processors. Single-thread performance of microprocessors is not growing at a historical rate, but the existence of a num- ber of active processes in the computer system and the con- tinuing development of multi-threaded applications benefit from the growing core counts to sustain system throughput. This trend brings us a situation where a number of paral- lel applications simultaneously being executed on a single system. Since multi-threaded applications try to maximize its throughput by utilizing the whole system, each of them usually create equal or larger number of threads compared to underlying logical core counts. This introduces much greater number of threads to be co-scheduled in the entire system. However, each program has different characteristics (or scalability) and contends for shared resources, which are the CPU cores and memory hierarchies, with each other. Therefore, it is clear that OS thread scheduling will play a major role in achieving high system performance under such conditions. We develop a sophisticated scheduler that (1) dynamically predicts the scalability of programs via the use of hardware performance monitoring units, (2) decides the optimal number of cores to be allocated for each program, and (3) allocates the cores to programs while maximizing the system utilization to achieve fair and maximum perfor- mance. The evaluation results on a 48-core AMD Opteron system show improvements over the Linux scheduler for a variety of multiprogramming workloads.

DOI： 10.1145/2370816.2370833
Power and Performance Analysis of GPU-Accelerated Systems.

Yuki Abe 0001, Hiroshi Sasaki 0001, Martin Peres, Koji Inoue, Kazuaki J. Murakami, Shinpei Kato

2012 Workshop on Power-Aware Computing Systems(HotPower) 2012.10

　More details

Language：Others Publishing type：Research paper (other academic)
コア数と動作周波数の動的変更によるメニーコア・プロセッサ性能向上手法の提案 Reviewed

今村智史, 佐々木広, 福本尚人, 井上弘士, 村上和彰

情報処理学会論文誌ACS 5 ( 4 ) 24 - 35 2012.8

　More details

Language：Japanese
データ値の局所性を利用したライン共有キャッシュ Reviewed

岡慶太郎, 佐々木広, 阿部祐希, 井上弘士, 村上和彰

情報処理学会論文誌ACS 5 ( 4 ) 36 - 47 2012.8

　More details

Language：Japanese
Scalability-based manycore partitioning.

Hiroshi Sasaki 0001, Teruo Tanimoto, Koji Inoue, Hiroshi Nakamura

International Conference on Parallel Architectures and Compilation Techniques(PACT) 107 - 116 2012.2

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/2370816.2370833
3次元積層LSI向けSRAM/DRAMハイブリッドキャッシュ・アーキテクチャ Reviewed

上野伸也, 橋口慎哉, 福本尚人, 井上弘士, 村上和彰

情報処理学会論文誌コンピューティングシステム（ACS） 5 ( 1 ) 41 - 52 2012.1

　More details

Language：English

3次元積層SRAM/DRAMハイブリッド・キャッシュ
Task mapping techniques for embedded many-core SoCs.

Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Koji Inoue

International SoC Design Conference(ISOCC) 204 - 207 2012.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISOCC.2012.6407075
Task mapping techniques for embedded many-core SoCs

Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Koji Inoue

2012 International SoC Design Conference, ISOCC 2012 ISOCC 2012 - 2012 International SoC Design Conference 204 - 207 2012

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes static task mapping techniques for embedded many-core SoCs. The proposed techniques take into account both task and data parallelisms of the tasks in order to efficiently utilize the potential parallelism of the many-core architecture. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. In addition, a static mapping technique considering dynamic task switching is proposed. Experimental results show the effectiveness of the proposed techniques.

DOI： 10.1109/ISOCC.2012.6407075
NSIM: An Interconnection Network Simulator for Extreme-Scale Parallel Computers Reviewed

Hideki MIWA Ryutaro SUSUKITA Hidetomo SHIBAMURA Tomoya HIRAO Jun MAKI Makoto YOSHIDA Takayuki KANDO Yuichiro AJIMA Ikuo MIYOSHI Toshiyuki SHIMIZU Yuji OINAGA Hisashige ANDO Yuichi INADOMI Koji INOUE Mutsumi AOYAGI Kazuaki MURAKAMI

IEICE TRANSACTIONS on Information and Systems 2011.12

　More details

Language：English Publishing type：Research paper (scientific journal)
Performance evaluation of 3D stacked multi-core processors with temperature consideration

Takaaki Hanada, Hiroshi Sasaki, Koji Inoue, Kazuaki Murakami

2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011.12

　More details

Language：English Publishing type：Research paper (other academic)

3D stacked multi-core processor is one of the applications of 3D integration technology. It achieves high bandwidth access to last level cache and allows to increase the number of cores while maintaining the package area. Although, 3D multi-core temperature increases with the number of stacked dies because of the escalating power density and thermal resistivity. Therefore, 3D multi-cores require lower clock frequencies for keeping the temperature under a safe constraint, so that performance is not always improved. In this paper, we evaluate the performance of 3D stacked multi-cores running under temperature constraints, and we show that there is a trade-off between clock frequency and parallel capability.

DOI： 10.1109/3DIC.2012.6263025
NSIM: An Interconnection Network Simulator for Extreme-Scale Parallel Computers.

Hideki Miwa, Ryutaro Susukita, Hidetomo Shibamura, Tomoya Hirao, Jun Maki, Makoto Yoshida, Takayuki Kando, Yuichiro Ajima, Ikuo Miyoshi, Toshiyuki Shimizu, Yuji Oinaga, Hisashige Ando, Yuichi Inadomi, Koji Inoue, Mutsumi Aoyagi, Kazuaki J. Murakami

IEICE Transactions on Information & Systems 94-D ( 12 ) 2298 - 2308 2011.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transinf.E94.D.2298
3D implemented SRAM/DRAM hybrid cache architecture for high-performance and low power consumption

Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, Kazuaki Murakami

54th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2011 54th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2011 2011.10

　More details

Language：English Publishing type：Research paper (other academic)

This paper introduces our research status focusing on 3D-implemented microprocessors. 3D-IC is one of the most interesting techniques to achieve high-performance, low-power VLSI systems. Stacking multiple dies makes it possible to implement microprocessor cores and large caches (or DRAM) into the same chip. Although this kind of integration has a great potential to bring a breakthrough in computer systems, its efficiency strongly depends on the characteristics of target application programs. Unfortunately, applying die stacking implementation causes performance degradation for some programs. To tackle this issue, we introduce a novel cache architecture consisting of a small but fast SRAM and a stacked large DRAM. The cache attempts to adapt to varying behavior of application programs in order to compensate for the negative impact of the die stacking approach.

DOI： 10.1109/MWSCAS.2011.6026484
Message from the chairs Reviewed

Naehyuck Chang, Hiroshi Nakamura, Kenichi Osada, Massimo Poncino, Koji Inoue

17th IEEE/ACM International Symposium on Low Power Electronics and Design, ISLPED 2011 Proceedings of the International Symposium on Low Power Electronics and Design iii - iv 2011.9

　More details

Language：English

DOI： 10.1109/ISLPED.2011.5993616
Routing architecture and algorithms for a superconductivity circuits-based computing hardware.

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 24th Canadian Conference on Electrical and Computer Engineering(CCECE) 977 - 980 2011.9

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/CCECE.2011.6030605
A thermal-aware mapping algorithm for reducing peak temperature of an accelerator deployed in a 3D stack.

Farhad Mehdipour, Krishna Chaitanya Nunna, Lovic Gauthier, Koji Inoue, Kazuaki J. Murakami

2011 IEEE International 3D Systems Integration Conference (3DIC)(3DIC) 1 - 4 2011.8

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/3DIC.2012.6263034
Performance evaluation of 3D stacked multi-core processors with temperature consideration.

Takaaki Hanada, Hiroshi Sasaki 0001, Koji Inoue, Kazuaki J. Murakami

2011 IEEE International 3D Systems Integration Conference (3DIC)(3DIC) 1 - 5 2011.8

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/3DIC.2012.6263025
演算/メモリ性能バランスを考慮したマルチコア向けオンチップメモリ貸与法 Reviewed

福本尚人, 井上弘士, 村上和彰

情報処理学会論文誌ACS 2011.5

　More details

Language：English
A design scheme for a reconfigurable accelerator implemented by single-flux quantum circuits.

Farhad Mehdipour, Hiroaki Honda, Koji Inoue, Hiroshi Kataoka, Kazuaki J. Murakami

Journal of Systems Architecture - Embedded Systems Design 57 ( 1 ) 169 - 179 2011.1

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1016/j.sysarc.2010.07.009
A thermal-aware mapping algorithm for reducing peak temperature of an accelerator deployed in a 3D stack

Farhad Mehdipour, Krishna Chaitanya Nunna, Lovic Gauthier, Koji Inoue, Kazuaki Murakami

2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011

　More details

Language：English Publishing type：Research paper (other academic)

Thermal management is one of the main concerns in three-dimensional integration due to difficulty of dissipating heat through the stack of the integrated circuit. In a 3D stack involving a data-path accelerator, a base processor and memory components, peak temperature reduction is targeted in this paper. A mapping algorithm has been devised in order to distribute operations of data flow graphs evenly over the processing elements of the target accelerator in two steps involving thermal-aware partitioning of input data flow graphs, and thermal-aware mapping of the partitions onto the processing elements. The efficiency of the proposed technique in reducing peak temperature is demonstrated throughout the experiments.

DOI： 10.1109/3DIC.2012.6263034
Routing architecture and algorithms for a superconductivity circuits-based computing hardware

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami

2011 Canadian Conference on Electrical and Computer Engineering, CCECE 2011 2011 Canadian Conference on Electrical and Computer Engineering, CCECE 2011 977 - 980 2011

　More details

Language：English Publishing type：Research paper (other academic)

Dedicated tools for placing and routing data flow graphs extracted from computation-intensive applications are basic requirements for developing applications on a large-scale reconfigurable data-path processor (LSRDP) implemented by superconductivity circuits. Using an alternative technology instead of CMOS circuits for implementing such hardware entails considering particular constraints and conditions from the architecture and tools development perspectives. The main contribution of this work is to introduce an operand routing network (ORN) architecture as well as algorithms for routing the nets corresponding to the edges of the data flow graphs. Further, a micro-routing algorithm is proposed for routing and configuring the ORNs internally. These algorithms have been applied on a number of data flow graphs from target applications and the results demonstrate their efficacy.

DOI： 10.1109/CCECE.2011.6030605
Hardware and software requirements for implementing a high-performance superconductivity circuits-based accelerator

Farhad Mehdipour, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

3rd Asia Symposium on Quality Electronic Design, ASQED 2011 Proceedings of the 3rd Asia Symposium on Quality Electronic Design, ASQED 2011 229 - 235 2011

　More details

Language：English Publishing type：Research paper (other academic)

Single-Flux Quantum based large-scale data-path processor (SFQ-LSRDP) is a reconfigurable computing system which is implemented by means of superconductivity circuits. SFQ-LSRDP has a capability of accelerating data flow graphs (DFGs) extracted from scientific applications. Using an alternative technology instead of CMOS circuits for implementing such hardware entails considering particular constraints and conditions from the architecture and tools development perspectives. In this paper, we will introduce hardware specifications of the LSRDP and the tool chain developed for implementing applications. Placing and routing data flow graphs is a fundamental part to develop applications on the SFQ-LSRDP. Algorithms for placing DFG operations and routing nets corresponding to the edges of data flow graphs will be discussed in more details. These algorithms have been applied on a number of data flow graphs and the results demonstrate their efficiency. Further, simulation results demonstrates remarkable performance numbers in the range of hundreds of Gflops for the proposed architecture.

DOI： 10.1109/ASQED.2011.6111751
Mapping scientific applications on a large-scale data-path accelerator implemented by single-flux quantum (SFQ) circuits.

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Irina Kataeva, Kazuaki J. Murakami, Hiroyuki Akaike, Akira Fujimaki

Design, Automation and Test in Europe(DATE) 993 - 996 2010.4

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/DATE.2010.5456902
Mapping scientific applications on a large-scale data-path accelerator implemented by Single-Flux Quantum (SFQ) circuits

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Irina Kataeva, Kazuaki Murakami, Hiroyuki Akaike, Akira Fujimaki

Design, Automation and Test in Europe Conference and Exhibition, DATE 2010 DATE 10 - Design, Automation and Test in Europe 993 - 996 2010

　More details

Language：English Publishing type：Research paper (other academic)

To overcome issues originating from the CMOS technology, a large-scale reconfigurable data-path (LSRDP) processor based on single-flux quantum circuits is introduced. LSRDP is augmented to a general purpose processor to accelerate the execution of data flow graphs (DFGs) extracted from scientific applications. Procedure of mapping large DFGs onto the LSRDP is discussed and our proposed techniques for reducing area of the accelerator within the design procedure will be introduced as well.

DOI： 10.1109/date.2010.5456902
Rapid design space exploration of a reconfigurable instruction-set processor Reviewed

Farhad Mehdipour, Hamid Noori, Inoue Koji, Kazuaki Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E92-A ( 12 ) 3182 - 3192 2009.12

　More details

Language：English Publishing type：Research paper (scientific journal)

Multitude parameters in the design process of a reconfigurable instruction-set processor (RISP) may lead to a large design space and remarkable complexity. Quantitative design approach uses the data collected from applications to satisfy design constraints and optimize the design goals while considering the applications' characteristics; however it highly depends on designer observations and analyses. Exploring design space can be considered as an effective technique to find a proper balance among various design parameters. Indeed, this approach would be computationally expensive when the performance evaluation of the design points is accomplished based on the synthesis-and-simulation technique. A combined analytical and simulation-based model (CAnSO**) is proposed and validated for performance evaluation of a typical RISP. The proposed model consists of an analytical core that incorporates statistics collected from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. CAnSO has clear speed advantages and therefore it can be used for easing a cumbersome design space exploration of a reconfigurable RISP processor and quick performance evaluation of slightly modified architectures.

DOI： 10.1587/transfun.E92.A.3182
Performance balancing Software-based on-chip memory management for effective CMP executions

Naoto Fukumoto, Kenichi Imazato, Inoue Koji, Kazuaki Murakami

10th MEDEA Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture, MEDEA '09, held in conjunction with the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT 2009 Proceedings of the 10th MEDEA Workshop on MEmory Performance DEaling with Applications, Systems and Architecture, MEDEA '09, held in conjunction with the PACT 2009 Conference 28 - 34 2009.12

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes the concept of performance balancing, and reports its performance impact on a Chip multiprocessor (CMP). Integrating multiple processor cores into a single chip, or CMPs, can achieve higher peak performance by means of exploiting thread level parallelism. However, the off-chip memory bandwidth which does not scale with the number of cores tends to limit the potential of CMPs. To solve this issue, the technique proposed in this paper attempts to make a good balance between computation and memorization. Unlike conventional parallel executions, this approach exploits some cores to improve the memory performance. These cores devote the on-chip memory hardware resources to the remaining cores executing the parallelized threads. In our evaluation, it is observed that our approach can achieve 31% of performance improvement compared to a conventional parallel execution model in the specified program.

DOI： 10.1145/1621960.1621966
Adaptive cache-line size management on 3D integrated microprocessors

Takatsugu Ono, Inoue Koji, Kazuaki Murakami

2009 International SoC Design Conference, ISOCC 2009 2009 International SoC Design Conference, ISOCC 2009 472 - 475 2009.12

　More details

Language：English Publishing type：Research paper (other academic)

The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the cache miss penalty because large amount of data can be transferred from the main memory to the cache at a time. If a large cache line size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable line-size cache scheme. In this paper, we apply it to an L1 data cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data cache and stacked DRAM energy consumption up to 75%, compared to a conventional cache.

DOI： 10.1109/SOCDC.2009.5423920
ALU-array based reconfigurable accelerator for energy efficient executions

Inoue Koji, Hamid Noori, Farhad Mehdipour, Takaaki Hanada, Kazuaki Murakami

2009 International SoC Design Conference, ISOCC 2009 2009 International SoC Design Conference, ISOCC 2009 157 - 160 2009.12

　More details

Language：English Publishing type：Research paper (other academic)

This paper introduces an energy efficient acceleration technique for embedded microprocessors. By means of supporting an ALU-array based coarse-grain reconfigurable functional unit, well customized special instructions are identified and executed for each application program. Since the reconfigurable functional unit can execute several dependent instructions (a sequence of instructions), simultaneously, the performance of the base microprocessor can dramatically be improved. In addition, this kind of direct execution is very energy efficient because it reduces activation counts of hardware components such as instruction cache, branch predictor, register-file accesses, and so on.

DOI： 10.1109/SOCDC.2009.5423898
Rapid Design Space Exploration of a Reconfigurable Instruction-Set Processor.

Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 92-A ( 12 ) 3182 - 3192 2009.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transfun.E92.A.3182
Adaptive cache-line size management on 3D integrated microprocessors

Takatsugu Ono, Koji Inoue, Kazuaki Murakami

2009 International SoC Design Conference (ISOCC) 2009.11

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/socdc.2009.5423920
Performance balancing: software-based on-chip memory management for effective CMP executions.

Naoto Fukumoto, Kenichi Imazato, Koji Inoue, Kazuaki J. Murakami

MEDEA@PACT 28 - 34 2009.9

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/1621960.1621966
An Operand Routing Network for an SFQ Reconfigurable Data-Paths Processor Reviewed

I. Kataeva, H. Akaike, A. Fujimaki, N. Yoshikawa, N. Takagi, K. Inoue, H. Honda, and K. Murakami

IEEE Transactions on Applied Superconductivity 2009.6

　More details

Language：English
An operand routing network for an SFQ reconfigurable Data-Paths processor Reviewed

Irina Kataeva, Hiroyuki Akaike, Akira Fujimaki, Nobuyuki Yoshikawa, Naofumi Takagi, Koji Inoue, Hiroaki Honda, Kazuaki Murakami

IEEE Transactions on Applied Superconductivity 19 ( 3 ) 665 - 669 2009.6

　More details

Language：English

We report the progress in the development of an Operand Routing Network (ORN) for an SFQ Reconfigurable Data-Paths processor (SFQ-RDP). The SFQ-RDP is implemented as a two-dimensional array of Floating-Point Units (FPU), outputs of which can be connected to the inputs of one or more FPUs in the next row via ORN. We have considered two architectures of the ORN: one is based on NDRO switches and the other-on crossbar switches. The comparison shows that the crossbar-based ORN has better performance due to the regular pipelined structure. We have designed a crossbar switch with a multicasting function and a l-to-2 ORN prototype for 2.5 kA/cm² process. The circuits have been experimentally tested at the frequencies up to 36 GHz.

DOI： 10.1109/TASC.2009.2018534
Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

Takatsugu ONO, Koji INOUE, Kazuaki MURAKAMI, Kenji YOSHIDA

IEICE Transactions on Electronics E92-C ( 4 ) 433 - 443 2009.4

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/transele.e92.c.433
Reducing On-Chip DRAM energy via data transfer size optimization Reviewed

Takatsugu Ono, Koji Inoue, Kazuaki Murakami, Kenji Yoshida

IEICE Transactions on Electronics E92-C ( 4 ) 433 - 443 2009.1

　More details

Language：English Publishing type：Research paper (scientific journal)

This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines.

DOI： 10.1587/transele.E92.C.433
A combined analytical and simulation-based model for performance evaluation of a reconfigurable instruction set processor.

Farhad Mehdipour, Hamid Noori, Bahman Javadi, Hiroaki Honda, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 14th Asia South Pacific Design Automation Conference(ASP-DAC) 564 - 569 2009.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ASPDAC.2009.4796540
A combined analytical and simulation-based model for performance evaluation of a reconfigurable instruction set processor

Farhad Mehdipour, Hamid Noori, Bahman Javadi, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

Asia and South Pacific Design Automation Conference 2009, ASP-DAC 2009 Proceedings of the ASP-DAC 2009 Asia and South Pacific Design Automation Conference 2009 564 - 569 2009

　More details

Language：English Publishing type：Research paper (other academic)

Performance evaluation is a serious challenge in designing or optimizing reconfigurable instruction set processors. The conventional approaches based on synthesis and simulations are very time consuming and need a considerable design effort. A combined analytical and simulation-based model (CAnSO ) is proposed and validated for performance evaluation of a typical reconfigurable instruction set processor. The proposed model consists of an analytical core that incorporates statistics gathered from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. Compared to cycle-accurate simulation results, CAnSO proves almost 2% variation in the speedup measurement.

DOI： 10.1109/ASPDAC.2009.4796540
Foreword Special section on hardware and software technologies on advanced microprocessors Reviewed

Koji Inoue, Koji Kai, Fumio Arakawa, Akihiko Inoue, Yoshio Hirose, Shorin Kyo, Keiji Kimura, Morihiro Kuga, Masaaki Kondo, Toshinori Sato, Makoto Satoh, Hiroyuki Tomiyama, Hiroshi Nakamura, Hiroo Hayashi, Masanori Hariyama, Hiroki Matsutani, Kunio Uchiyama

IEICE Transactions on Electronics E92-C ( 10 ) 2009

　More details

Language：English Publishing type：Research paper (scientific journal)

DOI： 10.1587/transele.E92.C.1231
Analyzing the impact of data prefetching on chip multiprocessors

Naoto Fukumoto, Tomonobu Mihara, Inoue Koji, Kazuaki Murakami

13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008 13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008 2008.11

　More details

Language：English Publishing type：Research paper (other academic)

Data prefetching is a well known approach to compensating for poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed single-core architectures. In Chip Multiprocessor (or CMP) chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMP should be different from traditional single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetches issued during program execution. Then, we discuss quantitatively the effect of prefetching to memory performance. The experimental results show that the negative effect of invalidation of prefetched cache blocks is very small. In addition, it is observed that the current prefetch algorithms do not exploit effectively the feature of CMPs, i.e. cache-to-cache on-chip data transfer.

DOI： 10.1109/APCSAC.2008.4625454
Improved policies for Drowsy caches in embedded processors

Junpei Zushi, Gang Zeng, Hiroyuki Tomiyama, Hiroaki Takada, Inoue Koji

4th IEEE International Symposium on Electronic Design, Test and Applications, DELTA 2008 Proceedings - 4th IEEE International Symposium on Electronic Design, Test and Applications, DELTA 2008 362 - 367 2008.9

　More details

Language：English Publishing type：Research paper (other academic)

In the design of embedded systems, especially batterypowered systems, it is important to reduce energy consumption. Cache are now used not only in general-purpose processors but also in embedded processors. As feature sizes shrink, the leakage energy has contributed to a significant portion of total energy consumption. To reduce the leakage energy of cache, the Drowsy cache was proposed, in which the cache lines are periodically moved to the lowleakage mode without loss of its content. However, when a cache line in the low-leakage mode is accessed, one or more clock cycles are required to transition the cache line back to the normal mode before its content can be accessed. As a result, these penalty cycles may significantly degrade the cache performance, especially in embedded processors without out-of-order execution. In this paper, we propose four mode transition policies which aim at high energy reduction with the minimum performance degradation. We also compare our policies with existing policies in the context of embedded processors. Experimental results demonstrate the effectiveness of the proposed policies.

DOI： 10.1109/DELTA.2008.70
Analyzing the impact of data prefetching on Chip MultiProcessors.

Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki J. Murakami

13th Asia-Pacific Computer Systems Architecture Conference(ACSAC) 2008.9

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/APCSAC.2008.4625454
Improving energy efficiency of configurable caches via temperature-aware configuration selection

Hamid Noori, Maziar Goudarzi, Inoue Koji, Kazuaki Murakami

IEEE Computer Society Annual Symposium on VLSI: Trends in VLSI Technology and Design, ISVLSI 2008 Proceedings - IEEE Computer Society Annual Symposium on VLSI Trends in VLSI Technology and Design, ISVLSI 2008 363 - 368 2008.9

　More details

Language：English Publishing type：Research paper (other academic)

Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Our results show that using a Temperature-Aware Configurable Cache (TACC), up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner case temperature (100°C). The TACC also enhances the performance by up to 28% and 17% for the instruction and data cache, respectively.

DOI： 10.1109/ISVLSI.2008.24
Design space exploration for a coarse grain accelerator

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Inoue Koji, Kazuaki Murakami

2008 Asia and South Pacific Design Automation Conference, ASP-DAC 2008 Asia and South Pacific Design Automation Conference, ASP-DAC 685 - 690 2008.8

　More details

Language：English Publishing type：Research paper (other academic)

In the design process of a reconfigurable accelerator employing in an embedded system, multitude parameters may result in remarkable complexity and a large design space. Design space exploration as an alternative to the quantitative approach can be employed to find a right balance between the different design parameters. In this paper, a hybrid approach is introduced to analytically explore the design space for a coarse grain accelerator and determine a wise design point exploiting data extracted from applications, quantitatively. It also provides flexibility for taking into account new design constraints as well as new characteristics of applications. Furthermore, this approach is a methodological approach which reduces the design time and results in a point which satisfies the design goals.

DOI： 10.1109/ASPDAC.2008.4484039
Enhancing energy efficiency of processor-based embedded systems through post-fabrication ISA extension.

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 2008 International Symposium on Low Power Electronics and Design(ISLPED) 241 - 246 2008.8

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/1393921.1393987
Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits

N. Takagi, K. Murakami, A. Fujimaki, N. Yoshikawa, K. Inoue, and H. Honda

IEICE Transactions on Electronics 2008.7

　More details

Language：English
A Gravity-Directed Temporal Partitioning Approach Reviewed

F. Mehdipour, H. Noori, H. Honda, K. Inoue, and K. Murakami

IEICE Electronics Express, Vol. 5, No. 10, pp.366-373 2008.5

　More details

Language：English Publishing type：Research paper (scientific journal)
A gravity-directed temporal partitioning approach Reviewed

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

IEICE Electronics Express 5 ( 10 ) 366 - 373 2008.5

　More details

Language：English Publishing type：Research paper (scientific journal)

Reconfiguration latency has a significant impact on the system performance in reconfigurable systems. A temporal partitioning approach is introduced for partitioning data flow graphs for a reconfigurable system comprising a partial programmable fine-grained hardware. Residing eligibility inspired from the Universal gravitation law is introduced to depict the eligibility of a node to stay in succeeding configurations (partitions) and to prohibit it from being swapped in/out. Partitioning based on residing eligibility causes fewer nodes with different functionalities to be assigned to subsequent partitions. Thus, reconfiguration overhead time and also unused hardware space decreases due to common parts in consecutive configurations.

DOI： 10.1587/elex.5.366
A gravity-directed temporal partitioning approach.

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki J. Murakami

IEICE Electronic Express 5 ( 10 ) 366 - 373 2008.5

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1587/elex.5.366
Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems International journal

H. Noori, M. Goudarzi, K. Inoue, and K. Murakami

IEICE Transactions on Electronics 2008.4

　More details

Language：English Publishing type：Research paper (scientific journal)
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

IEEE Computer Society Annual Symposium on VLSI(ISVLSI) 363 - 368 2008.4

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISVLSI.2008.24
Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Electronics 91-C ( 4 ) 418 - 431 2008.4

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietele/e91-c.4.418
A reconfigurable functional unit with conditional execution for multi-exit custom instructions Reviewed

Hamid Noori, Farhad Mehdipour, Inoue Koji, Kazuaki Murakami

IEICE Transactions on Electronics E91-C ( 4 ) 497 - 508 2008.4

　More details

Language：English Publishing type：Research paper (scientific journal)

Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of these custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. A quantitative approach is utilized to propose an efficient architecture for the RFU and fix its constraints. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple exits custom instructions are proposed. Conditional execution has been added to the RFU to support the multi-exit feature of custom instructions. Experimental results show that multi-exit custom instructions enhance the performance by an average of 67% compared to custom instructions limited to one basic block. A maximum speedup of 4.7, compared to a general embedded processor, and an average speedup of 1.85 was achieved on MiBench benchmark suite.

DOI： 10.1093/ietele/e91-c.4.497
Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits.

Naofumi Takagi, Kazuaki J. Murakami, Akira Fujimaki, Nobuyuki Yoshikawa, Koji Inoue, Hiroaki Honda

IEICE Transactions on Electronics 91-C ( 3 ) 350 - 355 2008.3

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietele/e91-c.3.350
Improved Policies for Drowsy Caches in Embedded Processors.

Junpei Zushi, Gang Zeng, Hiroyuki Tomiyama, Hiroaki Takada, Koji Inoue

4th IEEE International Symposium on Electronic Design, Test and Applications(DELTA) 362 - 367 2008.3

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/DELTA.2008.70
Design space exploration for a coarse grain accelerator.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 13th Asia South Pacific Design Automation Conference(ASP-DAC) 685 - 690 2008.3

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ASPDAC.2008.4484039
An architecture framework for an adaptive extensible processor.

Hamid Noori, Farhad Mehdipour, Kazuaki J. Murakami, Koji Inoue, Morteza Saheb Zamani

The Journal of Supercomputing 45 ( 3 ) 313 - 340 2008.2

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1007/s11227-008-0174-4
A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions.

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Electronics 91-C ( 4 ) 497 - 508 2008.1

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietele/e91-c.4.497
Enhancing energy efficiency of processor-based embedded systems through post-fabrication ISA extension

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki Murakami

ISLPED'08: 13th ACM/IEEE International Symposium on Low Power Electronics and Design ISLPED'08 Proceedings of the 2008 International Symposium on Low Power Electronics and Design 241 - 246 2008

　More details

Language：English Publishing type：Research paper (other academic)

Application-specific instruction set extension is an effective technique for reducing accesses to components such as on- and off-chip memories, register file and enhancing the energy efficiency. However, the addition of custom functional units to the base processor is required for supporting custom instructions, which due to the increase of manufacturing and design costs in new nanometer-scale technologies and shorter time-to-market, is becoming an issue. To address above issues, in our proposed approach, an optimized reconfigurable functional unit is used instead, and instruction set customization is done after chip-fabrication. Therefore, while maintaining the flexibility of a conventional microprocessor, the low-energy feature of customization is applicable. Experimental results show that the maximum and average energy savings are 67% and 22%, respectively for our proposed architecture framework.

DOI： 10.1145/1393921.1393987
Performance prediction of large-scale parallell system and application using macro-level simulation

Ryutaro Susukita, Yasunori Kimura, Hisashige Ando, Hidemi Komatsu, Mutsumi Aoyagi, Motoyoshi Kurokawa, Hiroaki Honda, Kazuaki J. Murakami, Yuichi Inadomi, Hidetomo Shibamura, Koji Inoue, Shuji Yamamura, Shigeru Ishizuki, Yunqing Yu

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008 2008

　More details

Language：English Publishing type：Research paper (other academic)

To predict application performance on an HPC system is an important technology for designing the computing system and developing applications. However, accurate prediction is a challenge, particularly, in the case of a future coming system with higher performance. In this paper, we present a new method for predicting application performance on HPC systems. This method combines modeling of sequential performance on a single processor and macro-level simulations of applications for parallel performance on the entire system. In the simulation, the execution flow is traced but kernel computations are omitted for reducing the execution time. Validation on a real terascale system showed that the predicted and measured performance agreed within 10% to 20 %. We employed the method in designing a hypothetical petascale system of 32768 SIMD-extended processor cores. For predicting application performance on the petascale system, the macro-level simulation required several hours.

DOI： 10.1109/SC.2008.5220091
Performance evaluation of a reconfigurable set processor

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

2008 International SoC Design Conference, ISOCC 2008 2008 International SoC Design Conference, ISOCC 2008 I184 - I187 2008

　More details

Language：English Publishing type：Research paper (other academic)

Performance evaluation is a serious challenge in designing optimizing reconfigurable instruction set processors. A combined and simulation-based model (CAnSO?) is proposed and for performance evaluation of a typical reconfigurable set processor. The proposed model consists of an core that incorporates statistics gathered from cycleaccurate to make a reasonable evaluation. CAnSO has speed advantages and compared to cycle-accurate simulation, proves almost 2% variation in the speedup measurement.

DOI： 10.1109/SOCDC.2008.4815603
Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs

F. Mehdipour, H. Noori, M. S. Zamani, K. Inoue, and K. Murakami

IEICE Transactions on Electronics 2007.12

　More details

Language：English
Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Information & Systems 90-D ( 12 ) 1956 - 1966 2007.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietisy/e90-d.12.1956
At the cutting edge of a petascale computing world An overview of Petascale System Interconnect project

Kazuaki J. Murakami, Feng Long Gu, Mutsumi Aoyagi, Takeshi Nanri, Koji Inoue

5th International Conference on Computational Methods in Science and Engineering, ICCMSE 2007 Computational Methods in Science and Engineering - Theory and Computation Old Problems and New Challenges, Lectures Presented at the Int. Conf. Computational Methods in Sci. Eng. 2007 ICCMSE 2007 23 - 38 2007.12

　More details

Language：English Publishing type：Research paper (other academic)

This talk presents an overview of the Petascale System Interconnect (PSI) project. The PSI project is one of the national projects on "Fundamental Technologies for the Next Generation Supercomputing" of MEXT (Ministry of Education, Culture Sports, Science and Technology), Japan. The goal of the PSI project is to develop technologies enabling petascale supercomputing systems with hundreds of thousands of computing nodes. The PSI project consists of three subprojects to tackle with the three fundamental technologies: subproject 1 is for the small and efficient optical packet switches; subproject 2 is for the low-cost & high-performance MPI communications; and subproject 3 is for the methodologies of evaluating and estimating the performance of petascale systems. With the successful completion of the PSI project, the Japan Next-Generation Supercomputer R&D Center (NSC) will take the technologies to build Japan's next generation supercomputer, which is expected to be over 70 times faster than the current fastest supercomputers.

DOI： 10.1063/1.2827008
Design of a reconfigurable data-path prototype in the single-flux-quantum circuit Reviewed

S. Iwasaki, M. Tanaka, Y. Yamanashi, H. Park, H. Akaike, A. Fujimaki, N. Yoshikawa, N. Takagi, K. Murakami, H. Honda, K. Inoue

Superconductor Science and Technology 20 ( 11 ) S328 - S331 2007.11

　More details

Language：English

We have designed a reconfigurable data-path (RDP) prototype based on the single-flux-quantum (SFQ) circuit. The RDP serves as an accelerator for a high performance computer and is composed of many stages of the array of floating point number processing units (FPUs) connected by reconfigurable operand routing networks (ORNs). The FPU array usually includes shift-registers (SRs) in order that the data is forwarded to the next stage without calculation. The data-path is reconfigured so as to reflect a long repeat instruction appearing in large-scale calculations. We can implement parallel and pipelined processing without memory access in such calculations, reducing the required bandwidth between a memory and a microprocessor. The SFQ high speed network switches and bit-serial/slice FPUs realize reduction in the circuit areas and in the power consumption compared to semiconductor devices when we make up the RDP by using the SFQ circuit. As a first step of the development of the SFQ-RDP, we design a 2 × 2 RDP prototype composed of double arrays of dual arithmetic logic units (ALUs). The prototype also has dual SRs in each array and four ORNs. We use bit-serial ALUs designed to operate at 25GHz. Each ORN behaves like a 4 × 2 crossbar switch. We have demonstrated the reconfiguration in the RDP prototype made up of 15 050 Josephson junctions though only some of the functions of ALUs are available.

DOI： 10.1088/0953-2048/20/11/S06
A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

M. Sakamoto, A. Katsuno, G. Sugizaki, T. Yoshida, A. Inoue, K. Inoue, and K. Murakami

IEICE Transactions on Electronics 2007.10

　More details

Language：English
A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips.

Mariko Sakamoto, Akira Katsuno, Go Sugizaki, Toshio Yoshida, Aiichiro Inoue, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Electronics 90-C ( 10 ) 1972 - 1982 2007.10

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietele/e90-c.10.1972
Multi-physics Extension of OpenFMO Framework

Toshiya Takami, Jun Maki, Jun-ichi Ooba, Yuichi Inadomi, Hiroaki Honda, Ryutaro Susukita, Koji Inoue, Taizo Kobayashi, Rie Nogita, Mutsumi Aoyagi

CoRR abs/0707.2630 2007.9

　More details

Language：Others Publishing type：Research paper (scientific journal)
メモリアクセスの特徴を活用した高速かつ正確なメモリアーキテクチャ・シミュレーション法

小野貴継　井上弘士　村上和彰

情報処理学会論文誌コンピューティングシステム 2007.8

　More details

Language：Japanese
メモリアクセスの特徴を活用した高速かつ正確なメモリアーキテクチャ・シミュレーション法

小野貴継, 井上弘士, 村上和彰

情報処理学会論文誌コンピューティングシステム（ACS） 48 ( 13 ) 203 - 213 2007.8

　More details

Language：Japanese

Fast, Accurate Memory Architecture Simulation Technique Using Memory Access Characteristics
This paper proposes a fast and accurate memory architecture simulation technique. To design memory architecture, the first steps commonly involve using trace-driven simulation. However, expanding the design space makes the evaluation time increase. A fast simulation is achieved by a trace size reduction, but it reduces the simulation accuracy. Our approach can reduce the simulation time while maintaining the accuracy of the simulation results. In order to evaluate validity of proposed technique, we measured the cache miss ratio. In our evaluation, the proposed technique reduces the trace size 98.8% and cache miss ratio differs from 0.067 percentage point on an average.

DOI： 10.15017/8308
通信タイミングを考慮した衝突削減のためのMPIランク配置最適化技術

森江善之,　末安直樹　松本透, 南里豪志,　石畑宏明,　井上弘士,　村上和彰

情報処理学会論文誌コンピューティングシステム 2007.8

　More details

Language：Japanese
Handling Control Data Flow Graphs for a Tightly Coupled Reconfigurable Accelerator.

Hamid Noori, Farhad Mehdipour, Morteza Saheb Zamani, Koji Inoue, Kazuaki J. Murakami

Embedded Software and Systems(ICESS) 249 - 260 2007.5

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1007/978-3-540-72685-2_24
Interactive presentation: Generating and executing multi-exit custom instructions for an adaptive extensible processor.

Hamid Noori, Farhad Mehdipour, Kazuaki J. Murakami, Koji Inoue, Maziar Goudarzi

2007 Design, Automation and Test in Europe Conference and Exposition(DATE) 325 - 330 2007.4

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/DATE.2007.364612
The effect of temperature on cache size tuning for low energy embedded systems.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 17th ACM Great Lakes Symposium on VLSI 2007 453 - 456 2007.3

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/1228784.1228891
The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 2007 International Conference on Embedded Systems & Applications(ESA) 169 - 176 2007.1

　More details

Language：Others Publishing type：Research paper (other academic)
Generating and executing multi-exit custom instructions for an adaptive extensible processor

Hamid Noon, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, Maziar Goudarzi

2007 Design, Automation and Test in Europe Conference and Exhibition Proceedings - 2007 Design, Automation and Test in Europe Conference and Exhibition, DATE 2007 325 - 330 2007

　More details

Language：English Publishing type：Research paper (other academic)

To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problems of this approach are immense cost and long time of designing. To address these issues, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chip-fabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units with the capability of conditional execution. Unlike previous proposed CIs, ours can include multiple exits. Experimental results show that multi-exit CIs enhance the performance by 46% in average compared to CIs limited to one basic block. A maximum speedup of 2.89 compared to a 4-issue in-order RISC processor, and a speedup of 1.66 in average, was achieved on MiBench benchmark suite.

DOI： 10.1109/DATE.2007.364612
The effect of temperature on cache size tuning for low energy embedded systems

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki Murakami

17th Great Lakes Symposium on VLSI, GLSVLSI'07 GLSVLSI'07 Proceedings of the 2007 ACM Great Lakes Symposium on VLSI 453 - 456 2007

　More details

Language：English Publishing type：Research paper (other academic)

Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for about 40% or more of the total energy consumed in these systems. In older technology nodes, active power was the primary contributor to total power dissipation of a CMOS design. However, with the scaling of feature sizes, the share of leakage in total power consumption of digital systems continues to grow. Temperature is a factor which exponentially increases the leakage current. In this paper, we show the effects of temperature on the selection of optimal cache size for low energy embedded systems. Our results show that for a given application, the optimal cache size selection is affected by the temperature. Our experiments have been done for 100nm technology. Our study reveals that the cache size selection for different temperatures depends on the rate at which cache miss increases when reducing the cache size. When the miss rate increases sharply the optimal point is the same for all examined temperatures, however when it becomes smoother, the optimal point for different temperatures begin to get farther.

DOI： 10.1145/1228784.1228891
Multi-physics extension of OpenFMO framework

Toshiya Takami, Jun Maki, Jun'ichi Ooba, Yuuichi Inadomi, Hiroaki Honda, Ryutaro Susukita, Koji Inoue, Taizo Kobayashi, Rie Nogita, Mutsumi Aoyagi

International Conference on Computational Methods in Science and Engineering 2007, ICCMSE 2007 Computation in Modern Science and Engineering - Proceedings of the International Conference on Computational Methods in Science and Engineering 2007 (ICCMSE 2007) 122 - 125 2007

　More details

Language：English Publishing type：Research paper (other academic)

OpenFMO framework, an open-source software (OSS) platform for Fragment Molecular Orbital (FMO) method, is extended to multi-physics simulations (MPS). After reviewing the several FMO implementations on distributed computer environments, the subsequent development planning corresponding to MPS is presented. It is discussed which should be selected as a scientific software, lightweight and reconfigurable form or large and self-contained form.

DOI： 10.1063/1.2835969
Implementation and evaluation of Fock matrix calculation program on the Cell processor

Hiroaki Honda, Tetsuo Hayashi, Yuichi Inadomi, Koji Inoue, Kazuaki J. Murakami

International Conference on Computational Methods in Science and Engineering 2007, ICCMSE 2007 Computation in Modern Science and Engineering - Proceedings of the International Conference on Computational Methods in Science and Engineering 2007 (ICCMSE 2007) 64 - 67 2007

　More details

Language：English Publishing type：Research paper (other academic)

Various processor architectures have been proposed until today, and the performance has improved remarkably. Recently, the Chip Multi-processors (CMPs), which has many processor cores onto a chip, are proposed for further performance improvement. The Cell processor is one of such CMP and shows high computational performance. Although this processor is designed for the multimedia, that high performance character can be utilized to molecular orbital calculation. In this study we implemented Fock matrix construction program on the Cell processor, and evaluated computational performance. As a result, there were two kinds of main stalls by the branch prediction and the data alignment, which are controlled by software mechanism for the simplification of the Cell processor hardware. It is possible to improve the performance about 30%, if the branch prediction hit ratio could be improved to 99%. For data alignment stall, a part of stalls, which is originated by data shuffle pipeline, could be decreased by preparing hardware data alignment mechanism.

DOI： 10.1063/1.2836167
Handling control data flow graphs for a tightly coupled reconfigurable accelerator

Hamid Noori, Farhad Mehdipour, Morteza Saheb Zamani, Koji Inoue, Kazuaki Murakami

3rd International Conference on Embedded Software and Systems, ICESS 2007 Embedded Software and Systems - Third International Conference, ICESS 2007, Proceedings 249 - 260 2007

　More details

Language：English Publishing type：Research paper (other academic)

In an embedded system including a base processor integrated with a tightly coupled accelerator, extracting frequently executed portions of the code (hot portion) and executing their corresponding data flow graph (DFG) on the accelerator brings about more speedup. In this paper, we intend to present our motivations for handling control instructions in DFGs and extending them to Control DFGs (CDFGs). In addition, basic requirements for an accelerator with conditional execution support are proposed. Moreover, some algorithms are presented for temporal partitioning of CDFGs considering the target accelerator architectural specifications. To show the effectiveness of the proposed ideas, we applied mem to the accelerator of an extensible processor called AMBER. Experimental results represent the effectiveness of covering control instructions and using CDFGs versus DFGs.

DOI： 10.1007/978-3-540-72685-2_24
Return Address Protection on Cache Memories.

Koji Inoue

IEICE Transactions on Electronics 89-C ( 12 ) 1937 - 1947 2006.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietele/e89-c.12.1937
Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors.

Koji Inoue

13th IEEE International Conference on Electronics, Circuits, and Systems(ICECS) 160 - 163 2006.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ICECS.2006.379744
Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache.

Koji Inoue

13th IEEE International Conference on Electronics, Circuits, and Systems(ICECS) 1093 - 1096 2006.12

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ICECS.2006.379629
Special section on VLSI Design and CAD Algorithms Reviewed

Hidetoshi Onodera, Makoto Ikeda, Tohru Ishihara, Tsuyoshi Isshiki, Koji Inoue, Kenichi Okada, Seiji Kajihara, Mineo Kaneko, Hiroshi Kawaguchi, Shinji Kimura, Morihiro Kuga, Atsushi Kurokawa, Takashi Sato, Toshiyuki Shibuya, Yoichi Shiraishi, Kazuyoshi Takagi, Atsushi Takahashi, Yoshinori Takeuchi, Nozomu Togawa, Hiroyuki Tomiyama, Yuichi Nakamura, Kiyoharu Hamaguchi, Yukiya Miura, Shin Ichi Minato, Ryuichi Yamaguchi, Masaaki Yamada, Yasushi Yuminaka, Takayuki Watanabe, Masanori Hashimoto, Masayuki Miyazaki

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E89-A ( 12 ) 3377 2006.12

　More details

Language：English

DOI： 10.1093/ietfec/e89-a.12.3377
An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki J. Murakami, Mehdi Sedighi, Koji Inoue

Advances in Computer Systems Architecture 219 - 230 2006.9

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1007/11859802_18
Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki J. Murakami, Koji Inoue, Mehdi Sedighi

Embedded and Ubiquitous Computing(EUC) 722 - 731 2006.8

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1007/11802167_73
A Reconfigurable Functional Unit for an Adaptive Dynamic Extensible Processor.

Hamid Noori, Farhad Mehdipour, Kazuaki J. Murakami, Koji Inoue, Morteza Saheb Zamani

Proceedings of the 2006 International Conference on Field Programmable Logic and Applications (FPL)(FPL) 1 - 4 2006.8

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/FPL.2006.311313
A reconfigurable functional unit for an adaptive dynamic extensible processor

Hamid Noori, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, Morteza Sahebzamani

2006 International Conference on Field Programmable Logic and Applications, FPL Proceedings - 2006 International Conference on Field Programmable Logic and Applications, FPL 781 - 784 2006

　More details

Language：English Publishing type：Research paper (other academic)

This paper presents a reconfigurable functional unit (RFU) for an adaptive dynamic extensible processor. The processor can tune its extended instructions to the target applications, after chip-fabrication. The custom instructions (CIs) are generated deploying the hot basic blocks during the training mode. In the normal mode, CIs are executed on the RFU. A quantitative approach was used for designing the RFU. The RFU is a matrix of functional units with 8 inputs and 6 outputs. Performance is enhanced up to 1.25 using the proposed RFU for 22 applications of Mibench. This processor needs no extra opcodes for CIs, new compiler, source code modification and recompilation.

DOI： 10.1109/FPL.2006.311313
Supporting a dynamic program signature An intrusion detection framework for microprocessors

Koji Inoue

ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems 160 - 163 2006

　More details

Language：English Publishing type：Research paper (other academic)

To address computer security issues, a hardware-based intrusion detection technique is proposed. This uses the dynamic program execution behavior for authentication. Based on secret key information, an execution behavior is determined. Next, a secure compiler constructs object code which generates the predetermined execution behavior at runtime. During program execution, a secure profiler monitors the execution behavior. If the profiler cannot detect the expected behavior, it sends an alarm signal to the microprocessor for terminating program execution. Since attack code cannot anticipate the execution behavior required, malicious attacks can be detected and prohibited at the start of program execution.

DOI： 10.1109/ICECS.2006.379744
Lock and unlock A data management algorithm for a security-aware cache

Koji Inoue

ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems 1093 - 1096 2006

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes an efficient cache line management algorithm for a security-aware cache architecture (SCache). SCache attempts to detect the corruption of return address values at runtime. When a return address store is executed, the cache generates a replica of the return address. This copied data is treated as read only. Subsequently, when the corresponding return address load is performed, the cache verifies the return address value loaded from the memory stack by means of comparing it with the replica data. Unfortunately, since the replica data is also a candidate for cache line replacements, SCache does not work well for application programs that cause higher cache miss rates. To resolve this issue, a lock and unlock data management algorithm is proposed in order to improve the security of SCache. The experimental results show that a proposed SCache model can protect about 99% of return address loads from the threat of buffer overflow attacks, while it worsens the processor performance by only 1%, compared with a non-secure conventional cache.

DOI： 10.1109/ICECS.2006.379629
Custom instruction generation using temporal partitioning techniques for a reconfigurable functional unit

Farhad Mehdipour, Hamid Noon, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi

International Conference on Embedded and Ubiquitous Computing, EUC 2006 Embedded and Ubiquitous Computing - International Conference, EUC 2006, Proceedings 722 - 731 2006

　More details

Language：English Publishing type：Research paper (other academic)

Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit (RFU). Custom instructions (CIs) are usually extracted from critical portions of applications. It may not be possible to meet all of the RFU constraints when CIs are generated. This paper addresses the generation of mappable CIs on an RFU. In this paper, our proposed RFU architecture for an adaptive dynamic extensible processor is described. Then, an integrated framework for temporal partitioning and mapping is presented to partition and map the CIs on RFU. In this framework, two mapping aware temporal partitioning algorithms are used to generate CIs. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. Using this framework brings about more speedup for the extensible processor.

DOI： 10.1007/11802167_73
An integrated temporal partitioning and mapping framework for handling custom instructions on a reconfigurable functional unit

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Mehdi Sedighi, Koji Inoue

11th Asia-Pacific Conference on Advances in Computer Systems Architecture, ACSAC 2006 Advances in Computer Systems Architecture - 11th Asia-Pacific Conference, ACSAC 2006, Proceedings 219 - 230 2006

　More details

Language：English Publishing type：Research paper (other academic)

Extensible processors allow customization for an application by extending the core instruction set architecture. Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit. Custom instructions (CIs) usually are extracted from critical portions of applications. This paper presents approaches for CI generation with respect to the RFU constraints to improve speedup of the extensible processor. First, our proposed RFU architecture for an adaptive dynamic extensible processor called AMBER is described. Then, an integrated temporal partitioning and mapping framework is presented to partition and map the CIs on the RFU. In this framework, a mapping aware temporal partitioning algorithm is used to generate CIs which are mappable on the RFU. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. In addition, a mapping algorithm is presented which supports CIs with critical path length more than the RFU depth.

DOI： 10.1007/11859802_18
Adaptive Mode Control for Low-Power Caches Based on Way-Prediction Accuracy.

Hidekazu Tanaka, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 88-A ( 12 ) 3274 - 3281 2005.12

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietfec/e88-a.12.3274
A Cost Effective Spacial Redundancy with Data-Path Partitioning.

Shigeharu Matsusaka, Koji Inoue

Third International Conference on Information Technology and Applications (ICITA 2005) 51 - 56 2005.7

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ICITA.2005.7
Quantitative Evaluation of State-Preserving Leakage Reduction Algorithm for L1 Data Caches.

Reiko Komiya, Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 88-A ( 4 ) 862 - 868 2005.4

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1093/ietfec/e88-a.4.862
Energy-security tradeoff in a secure cache architecture against buffer overflow attacks.

Koji Inoue

SIGARCH Computer Architecture News 33 ( 1 ) 81 - 89 2005.3

　More details

Language：Others Publishing type：Research paper (scientific journal)

DOI： 10.1145/1055626.1055638
Low-power cache design

Vasily G. Moshnyaga, Koji Inoue

Low-Power Processors and Systems on Chips 8 - 1-8-21 2005.1

　More details

Language：English

Cache memories are the most area-and energy-consuming units in today’s microprocessors. As the speed disparity between processor and external memory increases, designers try to put large multilevel caches on a chip to reduce the number of external memory accesses and thus boost the system performance. (See Table 8.1 for a survey of the on-die caches for several recent high-end microprocessors.) On-chip data and instruction caches are implemented using arrays of densely packed static RAM cells. The device count for the caches often exceeds the number of transistors devoted to the processor’s datapath and controller. For example, the Alpha21364 [3] and PA-RISC Maco [5] microprocessors have over 90% of their transistors in RAM, with most of them dedicated for caches; the Itanium2 [1] has 80% in caches, the IBM G5 [7] has 72%, the PowerPC [8] has 71%, and Strong-ARM110 [9] has 70%. Due to the large load capacitance and high access rate, these caches account for significant portion of the overall power dissipation (e.g., 35% in Itanium2 [1]; 43% in Strong-ARM [9]). Therefore optimizing caches for power is increasingly important. Although much work on energy reduction has taken place in the circuit and technology domains [10,11], interest in cache design for power efficiency at the architectural level continues to increase. Architecture is the entry point in cache design hierarchy, and decisions taken at this level can drastically affect the efficiency of design.
A cost effective spatial redundancy with data-path partitioning

Shigeharu Matsusaka, Koji Inoue

3rd International Conference on Information Technology and Applications, ICITA 2005 Proceedings - 3rd International Conference on Information Technology and Applications, ICITA 2005 51 - 56 2005

　More details

Language：English Publishing type：Research paper (other academic)

In order to maintain the high reliability of a computer system, it is necessary to detect the failure leading to a fault. In general, fault can be detected by exploiting time redundancy or spatial redundancy. However, it negatively affects on either hardware cost or processor performance. To solve the cost-performance issue, in this paper, we propose a concept of cost-effective approach to achieve spatial redundancy for dependable processors. In addition, we perform a primly evaluation for the impact of our method on processor performance.

DOI： 10.1109/ICITA.2005.7
A low-power I-cache design with tag-comparison reuse.

Koji Inoue, Hidekazu Tanaka, Vasily G. Moshnyaga, Kazuaki J. Murakami

Proceedings of the 2004 International Symposium on System-on-Chip(SoC) 61 - 67 2004.11

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ISSOC.2004.1411147
Low-power cache design

Vasily G. Moshnyaga, Inoue Koji

Low-Power Electronics Design 25 - 1-25-21 2004.1

　More details

Language：English

Cache memories are the most area- and energy-consuming units in today’s microprocessors. As the speed disparity between processor and external memory increases, designers try to put large multilevel caches on a chip to reduce the number of external memory accesses and thus boost the system performance. (See Table 25.1 for a survey of the on-die caches for several recent high-end microprocessors.) On-chip data and instruction caches are implemented using arrays of densely packed static RAM cells. The device count for the caches often exceeds the number of transistors devoted to the processor’s datapath and controller. For example, the Alpha21364 [3] and PA-RISC Maco [5] microprocessors have over 90% of their transistors in RAM, with most of them dedicated for caches; the Itanium2 [1] has 80% in caches, the IBM G5 [7] has 72%, the PowerPC [8] has 71%, and Strong-ARM110 [9] has 70%. Due to the large load capacitance and high access rate, these caches account for significant portion of the overall power dissipation (e.g., 35% in Itanium2 [1]; 43% in Strong-ARM [9]). Therefore optimizing caches for power is increasingly important. Although much work on energy reduction has taken place in the circuit and technology domains [10,11], interest in cache design for power efficiency at the architectural level continues to increase. Architecture is the entry point in cache design hierarchy, and decisions taken at this level can drastically affect the efficiency of design.
A low-power I-cache design with tag-comparison reuse

Koji Inoue, Hidekazu Tanaka, Vasily G. Moshnyaga, Kazuaki Murakami

2004 International Symposium on System-on-Chip 2004 International Symposium on System-on-Chip Proceedings 61 - 67 2004

　More details

Language：English Publishing type：Research paper (other academic)

This paper reports design and evaluation results of a low-energy I-cache architecture, called history-based tag-comparison (HBTC) cache. The HBTC cache attempts to re-use tag-comparison results to detect and eliminate unnecessary memory-array activations. We have performed cycle accurate simulations, and have designed an SRAM core based on a 0.18 μm CMOS technology. As a result, it has been observed that the HBTC approach can achieve 60% of energy reduction, with only 0.3% performance degradation, compared to a conventional cache. Furthermore, we have also evaluated the potential of the HBTC cache by combining with other low-energy techniques.
Designing a TCP/IP core for power consumption analysis

Kenichi Tanamachi, Inoue Koji, Vasily G. Moshnyaga

Proceedings of 2004 IEEE Asia-Pacific Conference on Advanced System Integrated Circuits Proceedings of 2004 IEEE Asia-Pacific Conference on Advanced System Integrated Circuits 412 - 413 2004

　More details

Language：English Publishing type：Research paper (other academic)

The designing of a low-power TCP/IP hardcore for pervasive computing was discussed. In order to implement the TCP/IP operations in hardware, the TCP and IP functions were partitioned into four modules which were port_ctr, data_ctr, window_ctr, and checksum. It was found that the data_ctr consumed about 22-30% of total power. The power consumption of TCP core and IP core were compared and it was found that the power consumed by the TCP was almost double of that of the IP core.
Reducing Access Count to Register-Files through Operand Reuse.

Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga

Advances in Computer Systems Architecture 112 - 121 2003.9

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1007/978-3-540-39864-6_10
Instruction Encoding for Reducing Power Consumption of I-ROMs Based on Execution Locality.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 86-A ( 4 ) 799 - 805 2003.4

　More details

Language：Others Publishing type：Research paper (scientific journal)
A zero-value prediction technique for fast DCT computation

Y. Nishida, Inoue Koji, V. G. Moshnyaga

2003 IEEE Workshop on Signal Processing Systems, SIPS 2003 2003 IEEE Workshop on Signal Processing Systems Design and Implementation, SIPS 2003 2003-January 165 - 170 2003.1

　More details

Language：English Publishing type：Research paper (other academic)

The paper proposes a new computationally efficient technique for DCT operation. Unlike related research, the technique reduces the number of computations by predicting the effect of quantization on DCT and avoiding calculations of those DCT values which lead to zero elements in the block after quantization. Experimental evaluation on a number of video benchmarks shows that our method is able to reduce the total number of computations by 29% for DCT and by 59% for quantization while maintaining high image quality.

DOI： 10.1109/SIPS.2003.1235663
Dynamic tag-check omission A low power instruction cache architecture exploiting execution footprints

Koji Inoue, Vasily Moshnyaga, Kazuaki Murakami

2nd International Workshop on Power-Aware Computer Systems, PACS 2002 Power-Aware Computer Systems - 2nd International Workshop, PACS 2002, Revised Papers 18 - 32 2003

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes an architecture for low-power directmapped instruction caches, called “history-based tag-comparison (HBTC) cache”. The HBTC cache attempts to detect and omit unnecessary tag checks at run time. Execution footprints are recorded in an extended BTB (Branch Target Buffer), and are used to know the cache residence of target instructions before starting cache access. In our simulation, it is observed that our approach can reduce the total count of tag checks by 90 %, resulting in 15 % of cache-energy reduction, with less than 0.5 % performance degradation.

DOI： 10.1007/3-540-36612-1_2
Multiplier energy reduction through bypassing of partial products.

Jun-ni Ohban, Vasily G. Moshnyaga, Koji Inoue

IEEE Asia Pacific Conference on Circuits and Systems 2002 13 - 17 2002.10

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/APCCAS.2002.1115097
Reducing power consumption of instruction ROMs by exploiting instruction frequency.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

IEEE Asia Pacific Conference on Circuits and Systems 2002 1 - 6 2002.10

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/APCCAS.2002.1115094
A history-based I-cache for low-energy multimedia applications.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

Proceedings of the 2002 International Symposium on Low Power Electronics and Design(ISLPED) 148 - 153 2002.8

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/566408.566447
Reducing energy consumption of video memory by bit-width compression.

Vasily G. Moshnyaga, Koji Inoue, Mizuka Fukagawa

Proceedings of the 2002 International Symposium on Low Power Electronics and Design(ISLPED) 142 - 147 2002.8

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1145/566408.566446
Omitting cache look-up for high-performance, low-power microprocessors Reviewed

K Inoue, VG Moshnyaga, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E85C ( 2 ) 279 - 287 2002.2

　More details

Language：English Publishing type：Research paper (scientific journal)

In this paper, we propose a novel architecture for low-power direct-mapped instruction caches, called "history-based tag-comparison (HBTC) cache." The cache attempts to reuse tag-comparison results for avoiding unnecessary tag checks. Execution footprints are recorded into an extended BTB (Branch Target Buffer). In our evaluation, it is observed that the energy for tag comparison can be reduced by more than 90% in many applications.
Dynamic Tag-Check Omission: A Low Power Instruction Cache Architecture Exploiting Execution Footprints.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

Power-Aware Computer Systems(PACS) 18 - 32 2002.2

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1007/3-540-36612-1_2
Trends in high-performance, low-power cache memory architectures Reviewed

K Inoue, VG Moshnyaga, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E85C ( 2 ) 304 - 314 2002.2

　More details

Language：English Publishing type：Research paper (scientific journal)

One of uncompromising requirements from portable computing is energy efficiency, because that affects directly the battery life. On the other hand, portable computing will target more demanding applications, for example moving pictures, so that higher performance is still required. Cache memories have been employed as one of the most important components of computer, systems. In this paper, we briefly survey architectural techniques for high performance, low power cache memories.
A Low Energy Set-Associative I-Cache with Extended BTB.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

20th International Conference on Computer Design (ICCD 2002), VLSI in Computers and Processors(ICCD) 187 2002.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1109/ICCD.2002.1106768
Register File Energy Reduction by Operand Data Reuse.

Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga

Integrated Circuit Design. Power and Timing Modeling, Optimization and Simulation(PATMOS) 278 - 288 2002.1

　More details

Language：Others Publishing type：Research paper (other academic)

DOI： 10.1007/3-540-45716-X_28
Register file energy reduction by operand data reuse

Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga

12th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS 2002 Integrated Circuit Design Power and Timing Modeling, Optimization and Simulation - 12th International Workshop, PATMOS 2002, Proceedings 278 - 288 2002.1

　More details

Language：English Publishing type：Research paper (other academic)

This paper presents an experimental study of register file utilization in conventional RISC-type data path architecture to determine benefits that we can expect to achieve by eliminating unnecessary register file reads and writes. Our analysis shows that operand bypassing, enhanced for operand-reuse can discard the register file accesses up to 65% as a peak and by 39% on average for tested benchmark programs.

DOI： 10.1007/3-540-45716-x_28
Multiplier energy reduction through bypassing of partial products

Jun Ni Ohban, V. G. Moshnyaga, K. Inoue

Asia-Pacific Conference on Circuits and Systems, APCCAS 2002 Proceedings - APCCAS 2002 Asia-Pacific Conference on Circuits and Systems 13 - 17 2002.1

　More details

Language：English Publishing type：Research paper (other academic)

The design of portable battery operated multimedia devices requires energy-efficient multiplication circuits. This paper presents a novel approach to reduce power consumption of digital multiplier based on dynamic bypassing of partial products. The bypassing elements incorporated into the multiplier hardware eliminate redundant signal transitions, which appear within the carry-save adders when the partial product is zero. Simulations on the real-life DCT data show that the proposed approach can improve power saving of related methods by 12%, while jointly with them, it reduces the power consumption of a 16x16 digital CMOS multiplier by 31%, with 25% area overhead and less than 4% performance degradation in the worst case. The circuit implementation is outlined.

DOI： 10.1109/APCCAS.2002.1115097
Reducing power consumption of instruction ROMs by exploiting instruction frequency

K. Inoue, V. G. Moshnyaga, K. Murakami

Asia-Pacific Conference on Circuits and Systems, APCCAS 2002 Proceedings - APCCAS 2002 Asia-Pacific Conference on Circuits and Systems 1 - 6 2002

　More details

Language：English Publishing type：Research paper (other academic)

This paper proposes a new approach to reducing the power consumption of instruction ROMs for embedded systems. The power consumption of instruction ROMs strongly depends on the switching activity of bit-lines. If a read bit-value indicates '0', the precharged bitline is discharged. In this scenario, a bit-line switching takes place and consumes power. Otherwise, the precharged bit-line level is maintained until the next access, thus no bit-line switching occurs. In our approach, the binary-patterns to be assigned to op-codes are determined based on the frequency of instructions for reducing the bit-line switching activity. Application programs are analyzed in advance, and then binary-patterns including many '1's' are assigned to the most frequently referenced instructions. In our evaluation, it is observed that the proposed approach can reduce bit-line switching by 40%.

DOI： 10.1109/APCCAS.2002.1115094
Performance/energy efficiency of variable line-size caches for intelligent memory systems

Koji Inoue, Koji Kai, Kazuaki Murakami

2nd International Workshop on Intelligent Memory Systems, IMS 2000 Intelligent Memory Systems - 2nd International Workshop, IMS 2000, Revised Papers 169 - 178 2001

　More details

Language：English Publishing type：Research paper (other academic)

DOI： 10.1007/3-540-44570-6_13
A high-performance/low-power on-chip memory-path architecture with variable cache-line size

K Inoue, K Kai, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E83C ( 11 ) 1716 - 1723 2000.11

　More details

Language：English

This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. Ln our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.
Dynamically variable line-size cache architecture for merged DRAM/Logic LSIs

K Inoue, K Kai, K Murakami

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E83D ( 5 ) 1048 - 1057 2000.5

　More details

Language：English

This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache)." The D-VLS cache ran optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth on merged DRAM/logic LSIs appropriately. In our evaluation, it is observed that an average memory access time improvement achieved by a direct-mapped D-VLS cache is about 20% compared to a conventional direct-mapped cache with tired 32-byte lines. This performance improvement is better than that of a doubled-size conventional direct-mapped cache*.
A high-performance and low-power cache architecture with speculative way-selection

K Inoue, T Ishihara, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E83C ( 2 ) 186 - 194 2000.2

　More details

Language：English

This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache*.
MOE A special-purpose parallel computer for high-speed, large-scale molecular orbital calculation

Koji Hashimoto, Hiroto Tomita, Inoue Koji, Katsuhiko Metsugi, Kazuaki Murakami, Shinjiro Inabata, So Yamada, Nobuaki Miyakawa, Hajime Takashima, Kunihiro Kitamura, Shigeru Obara, Takashi Amisaki, Kazutoshi Tanabe, Umpei Nagashima

1999 ACM/IEEE Conference on Supercomputing, SC 1999 ACM/IEEE SC 1999 Conference, SC 1999 1999.1

　More details

Language：English Publishing type：Research paper (other academic)

We are constructing a high-performance, special-purpose parallel machine for ab initio Molecular Orbital calculations, called MOE (Molecular Orbital calculation Engine). The sequential execution time is O(N⁴) where N is the number of basis functions, and most of time is spent to the calculations of electron repulsion integrals (ERIs). The calculation of ERIs have a lot of parallelism of O(N⁴), and therefore MOE tries to exploit the parallelism. This paper discuss the MOE architecture and examines important aspects of architecture design, which is required to calculate ERIs according to the "Obara method". We conclude that n-way parallelization is the most cost-effective, hence we designed the MOE prototype system with a host computer and many processing nodes. The processing node includes a 76 bit oating-point MULTIPLY-and-ADD unit and internal memory, etc., and it performs ERI computations efficiently. We estimate that the prototype system with 100 processing nodes calculate the energy of proteins in a few days.

DOI： 10.1109/SC.1999.10000
High bandwidth, variable line-size cache architecture for merged DRAM/logic LSIs Reviewed

K Inoue, K Kai, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E81C ( 9 ) 1438 - 1447 1998.9

　More details

Language：English

Merged DRAM/logic LSIs could provide high on-chip memory bandwidth by interconnecting logic portions and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit such high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses. This approach tends to increase the cache-line size if we attempt to improve the attainable memory bandwidth. Larger cache lines, however, might worsen the system performance if programs running on the LSIs do not have enough spatial locality of references and cache misses frequently take place. This paper describes a novel cache architecture suitable for merged DRAM/logic LSIs, called variable line-size cache or VLS cache, for resolving the above-mentioned dilemma. The VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and, at the same time, alleviate the negative effects of larger cache-line size by partitioning each large cache line into multiple sub-lines and allowing every sub-line to work as an independent cache line. The number of sub-lines involved when a cache replacement occurs fan be determined depending on the characteristics of programs. This paper also evaluates the cost/performance improvements attainable by the VLS cache and compares it with those of conventional cache architectures. As a result, it is observed that a VLS cache reduces the average memory-access time by 16.4% while it increases the hardware cost by only 13%, compared to a conventional direct-mapped cache with fixed 32-byte lines.
Efficient Autoencoder-Based Human Body Communication Transceiver for WBAN Reviewed International journal

Ali, Abdelhay; Inoue, Koji; Shalaby, Ahmed; Sayed, Mohammed Sharaf; Ahmed, Sabah Mohamed

IEEE ACCESS 7 117196 - 117205 1900

　More details

Language：English Publishing type：Research paper (scientific journal)

DOI： 10.1109/ACCESS.2019.2936796
Decision Tree Models and Early Splitting Termination in Screen Content Extension of High Efficiency Video Coding Reviewed International journal

Badry, Emad; Inoue, Koji; Sayed, Mohammed Sharaf

IEEE ACCESS 8 143437 - 143452 1900

　More details

Language：English Publishing type：Research paper (scientific journal)

DOI： 10.1109/ACCESS.2020.3014163

▼display all

Books

Low-Power Electronics Design (Low-Power Cache Design: Chap. 25)

V. Moshnyaga and K. Inoue（Role：Joint author）

CRC PRESS 2004.1

　More details

Language：English Book type：Scholarly book

Presentations

SuperNPU: An Extremely Fast Neural Processing Unit Using Superconducting Logic Devices International conference

Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kousuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, and Koji Inoue

IEEE/ACM International Symposium on Microarchitecture (MICRO) 2020.10

　More details

Event date： 2020.10

Language：English Presentation type：Oral presentation (general)

Venue：online Country：Japan

Superconductor single-flux-quantum (SFQ) logic family has been recognized as a highly promising solution for the post-Moore's era, thanks to its ultra-fast and low-power switching characteristics. Therefore, researchers have made a tremendous amount of effort in various aspects to promote the technology and automate its circuit design process (e.g., low-cost fabrication, design tool development). However, there has been no progress in designing a convincing SFQ-based architectural unit due to the architects' lack of understanding of the technology's potentials and limitations at the architecture level. In this paper, we present how to architect an SFQ-based architectural unit by providing design principles with an extreme-performance neural processing unit (NPU). To achieve the goal, we first implement an architecture-level simulator to model an SFQ-based NPU accurately. We validate this model using our die-level prototypes, design tools, and logic cell library. This simulator accurately measures the NPU's performance, power consumption, area, and cooling overheads. Next, driven by the modeling, we identify key architectural challenges for designing a performance-effective SFQ-based NPU (e.g., expensive on-chip data movements and buffering). Lastly, we present SuperNPU, our example SFQ-based NPU architecture, which effectively resolves the challenges. Our evaluation shows that the proposed design outperforms a conventional state-of-the-art NPU by 23 times. With free cooling provided as done in quantum computing, the performance per chip power increases up to 490 times. Our methodology can also be applied to other architecture designs with SFQ-friendly characteristics.
Performance Prediction of Large-scale Parallel System and Application using Macro-level Simulation International conference

R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, K. Inoue, S. Ishizuki, Y. Kimura, H. Komatsu, M. Kurokawa, K. Murakami, H. Shibamura, S. Yamamura, Y. Yu

the International Conference for High Performance Computing, Networking, Storage and Analysis (SC08) 2008.11

　More details

Event date： 2008.11

Language：Others Presentation type：Oral presentation (general)

Venue：オースティン Country：Other
Analyzing and Mitigating the Impact of Manufacturing Variability in Power-Constrained Supercomputing International conference

稲富雄一, Tapasya Patki, Inoue Koji, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi

The International Conference for High Performance Computing, Networking, Storage and Analysis 2015.11

　More details

Language：English Presentation type：Oral presentation (general)

Country：United States
H. Noori, F. Mehdipour, K. Murakami, K. Inoue, and M. Goudarzi, "Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor International conference

H. Noori, F. Mehdipour, K. Murakami, K. Inoue, and M. Goudarzi

The European Event for Electronic System Design & Test (DATE'07) 2007.4

　More details

Language：Others Presentation type：Oral presentation (general)

Country：France
How many trials do we need for reliable NISQ computing? International conference

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, and Koji Inoue

The First International Workshop on Quantum Computing: Circuits Systems Automation and Applications 2020.7

　More details

Event date： 2021.6

Language：English Presentation type：Oral presentation (general)

Venue：online Country：Japan
Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core International conference

Susumu Mashimo, Ryota Shioya, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020.1

　More details

Event date： 2021.6

Language：English Presentation type：Oral presentation (general)

Venue：Fukuoka Country：Japan
Enhancing a manycore-oriented compressed cache for GPGPU International conference

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020.1

　More details

Event date： 2021.6

Language：English Presentation type：Oral presentation (general)

Venue：Fukuoka Country：Japan
32 GHz 6.5 mW Gate-Level-Pipelined 4-bit Processor using Superconductor Single-Flux-Quantum Logic International conference

Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue

2020 Symposia on VLSI Technology and Circuits 2020.6

　More details

Event date： 2021.6

Language：English Presentation type：Oral presentation (general)

Venue：online Country：Japan
Practical error modeling toward realistic NISQ simulation International conference

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, and Koji Inoue

The First International Workshop on Quantum Computing: Circuits Systems Automation and Applications 2020.7

　More details

Event date： 2021.6

Language：English Presentation type：Oral presentation (general)

Venue：online Country：Japan
Enhancing a manycore-oriented compressed cache for GPGPU International conference

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020.1

　More details

Event date： 2020.1

Language：English Presentation type：Oral presentation (general)

Venue：Fukuoka, Japan Country：Japan
Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core International conference

Susumu Mashimo, Ryota Shioya, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020.1

　More details

Event date： 2020.1

Language：English Presentation type：Oral presentation (general)

Venue：Fukuoka, Japan Country：Japan
Evaluating the Impact of Energy Efficient Networks on HPC Workloads International conference

G Georgakoudis, N Jain, T Ono, K Inoue, S Miwa, A Bhatele

26th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) 2020.1

　More details

Event date： 2019.12

Language：English Presentation type：Oral presentation (general)

Venue：Hyderabad, India Country：India
An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor International conference

Susumu Mashimo, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima, Koji Inoue, Ryota Shioya

IEEE International Conference on Field Programmable Technology 2019.12

　More details

Event date： 2019.12

Language：English Presentation type：Oral presentation (general)

Venue：Tianjin, China Country：China
A 48GHz 5.6mW gate-level-pipelined multiplier using single-flux quantum logic International conference

Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki

IEEE International Solid-State Circuits Conference (ISSCC 2019) 2019.2

　More details

Event date： 2019.2

Language：English

Venue：San Francisco Country：United States
Improving Lifetime in MLC Phase Change Memory using Slow Writes International conference

Takatsugu Ono, Zhe Chen and Koji Inoue

International Japan-Africa Conference on Electronics, Communication and Computations 2018.12

　More details

Event date： 2018.12

Language：English

Country：Egypt
Situation-Based Dynamic Frame-Rate Control for On-Line Object Tracking International conference

Yusuke Inoue, Takatsugu Ono and Koji Inoue

International Japan-Africa Conference on Electronics, Communication and Computations 2018.12

　More details

Event date： 2018.12

Language：English

Country：Japan
30-GHz Operation of Datapath for Bit-Parallel, Gate-Level-Pipelined Rapid Single-Flux-Quantum Microprocessors Invited International conference

Masamitsu Tanaka, Yuki Hatanaka, Yuichi Matsui, Ikki Nagaoka, Koki Ishida, Kyosuke Sano, Taro Yamashita, Takatsugu Ono, Koji Inoue, Akira Fujimaki

Applied Superconductivity Conference 2018.10

　More details

Event date： 2018.10

Language：English

Country：Japan
Autoencoder based Features Extraction for Automatic Classiﬁcation of Earthquakes and Explosions International conference

Omar M. Saad, K. Inoue, Ahmed Shalaby, Lotf Samy, and Mohammed S. Sayed

the 17th IEEE/ACIS International Conference on Computer and Information Science 2018.6

　More details

Event date： 2018.6

Language：English

Country：Japan
Analyzing Resource Trade-offs in Hardware-overprovisioned Supercomputers International conference

Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel Ellsworth, Barry Rountree, Martin Schulz

the 32nd International Parallel and Distributed Processing 2018.5

　More details

Event date： 2018.5

Language：English

Country：Japan
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems International conference

Satoshi Imamura, Hiroshi Sasaki, Inoue Koji, Dimitrios S. Nikolopoulos

IEEE International Conference on Computer Design 2014.10

　More details

Event date： 2014.10

Language：English

Country：Korea, Republic of
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems International conference

Satoshi Imamura, Hiroshi Sasaki, Inoue Koji, Dimitrios S. Nikolopoulos

IEEE International Conference on Computer Design 2014.10

　More details

Event date： 2014.10

Language：English

Country：Korea, Republic of

researchmap
Performance evaluations of finite difference applications realized on a single flux quantum circuits-based reconfigurable accelerator

Hiroaki Honda, Farhad Mehdipour, Hiroshi Kataoka, Inoue Koji, Kazuaki J. Murakami

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2011, APSIPA ASC 2011 2011.12

　More details

Event date： 2011.10

Language：English

Venue：Xi'an Country：China

Hardware accelerators integrating to general purpose processors are increasingly employed to achieve lower power consumption and higher processing speed, however, energy consumption of high performance accelerators has become a great issue on large scale parallel computer system. We have investigated the applicability of Single-Flux-Quantum (SFQ) circuits as a part of superconductivity technology in high-performance computing systems. Although it is possible to develop extraordinary low power processor by SFQ devices, conditional branch and loop back controls are difficult to be implemented by current SFQ technology. Therefore, we have proposed Reconfigurable Data- Path (RDP) accelerator which is avoiding those limitations of SFQ technology, while trying to get benefits of these circuits. In this research, we have implemented two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications for investigating efficiency of using SFQ-RDP accelerator. According to performance evaluation results for above applications, execution times are 50.6 and 79.0 times smaller than those of the general purpose processor, and comparable with ones reported for GPU (Graphics Processing Units).Hardware accelerators integrating to general purpose processors are increasingly employed to achieve lower power consumption and higher processing speed, however, energy consumption of high performance accelerators has become a great issue on large scale parallel computer system. We have investigated the applicability of Single-Flux-Quantum (SFQ) circuits as a part of superconductivity technology in high-performance computing systems. Although it is possible to develop extraordinary low power processor by SFQ devices, conditional branch and loop back controls are difficult to be implemented by current SFQ technology. Therefore, we have proposed Reconfigurable Data-Path (RDP) accelerator which is avoiding those limitations of SFQ technology, while trying to get benefits of these circuits. In this research, we have implemented two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications for investigating efficiency of using SFQ-RDP accelerator. According to performance evaluation results for above applications, execution times are 50.6 and 79.0 times smaller than those of the general purpose processor, and comparable with ones reported for GPU (Graphics Processing Units).
パケットペーシングによる全対全通信の最適化とシミュレーション評価

柴村英智, 三輪英樹, 薄田竜太郎, 平尾智也, 安島雄一郎, 三吉郁夫, 清水俊幸, 石畑宏明, 井上弘士

ハイパフォーマンスコンピューティングと計算科学シンポジウム 2011.1

　More details

Event date： 2011.1

Language：Others Presentation type：Oral presentation (general)

Venue：筑波 Country：Japan
演算/メモリ性能バランスを考慮したマルチコア向けオンチップメモリ貸与法

福本尚人, 井上弘士, 村上和彰

ハイパフォーマンスコンピューティングと計算科学シンポジウム 2011.1

　More details

Event date： 2011.1

Language：Others Presentation type：Oral presentation (general)

Venue：筑波 Country：Japan
Reducing Preprocessing Overhead Times in a Reconfigurable Accelerator of Finite Difference Applications International conference

H. Kataoka, H. Honda, F. Mehdipour, K. Inoue, and K. Murakami

In Proc. Symp. on Application Accelerators in High Performance Computing (SAAHPC'10) 2010.7

　More details

Event date： 2010.7

Language：Others

Venue：テネシー Country：Other
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor International conference

Farhad Mehdipour, Hamid Noori, Bahman Javadi, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

The 14th Asia and South-Pacific Design Automation Conference (ASP-DAC 2009) 2009.1

　More details

Event date： 2009.1

Language：Others Presentation type：Oral presentation (general)

Venue：yokohama Country：Japan
Analyzing the Impact of Data Prefetching on Chip MultiProcessors International conference

N. Fukumoto, T. Mihara, K. Inoue, and K. Murakami

IEEE Asia-Pacific Computer Systems Architecture Conference (ACSAC'08) 2008.8

　More details

Event date： 2008.8

Language：Others

Country：Japan
Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection International conference

H. Noori, M. Goudarzi, K. Inoue, and K. Murakami

International Symposium on VLSI (ISVLSI'08) 2008.8

　More details

Event date： 2008.8

Language：Others

Country：France
Enhancing Energy Efficiency of Processor-Based Embedded Systems through Post-Fabrication ISA Extension International conference

H. Noori, F. Mehdipour, K. Inoue, and K. Murakami

International Symposium on Low Power Electronics and Design (ISLPED'08) 2008.8

　More details

Event date： 2008.8

Language：Others

Country：India
Design Space Exploration for a Coarse Grain Accelerator International conference

F. Mehdipour, H. Noori, M. S. Zamani, K. Inoue, and K. Murakami

Asia and South Pacific Design Automation Conference (ASPDAC'08) 2008.1

　More details

Event date： 2008.1

Language：Others

Country：Korea, Republic of
Improved Policies for Drowsy Caches in Embedded Processors International conference

J. Zushi, G. Zeng, H. Tomiyama, H. Takada, and K. Inoue

Internal Symposium on Electronics Design, Test & Applications 2008.1

　More details

Event date： 2008.1

Language：Others

Country：Taiwan, Province of China
Design Space Exploration for a Coarse Grain Accelerator International conference

F. Mehdipour, H. Noori, M. S. Zamani, K. Inoue, and K. Murakami

Asia and South Pacific Design Automation Conference 2008.1

　More details

Event date： 2008.1

Language：Others

Country：Korea, Republic of
Improved Policies for Drowsy Caches in Embedded Processors International conference

J. Zushi, G. Zeng, H. Tomiyama, H. Takada, and K. Inoue

Internal Symposium on Electronics Design, Test & Applications 2008.1

　More details

Event date： 2008.1

Language：Others
Energy Consumption Evaluation of an Adaptive Extensible Processor International conference

H. Noori, F. Mehdipour, M. Goudarzi, S. Yamaguchi, K. Inoue, and K. Murakami

Reconfigurable and Adaptive Architecture Workshop 2007.12

　More details

Event date： 2007.12

Language：Others

Country：United States
Adaptive Management of Cache Block Replication for High-Performance CMP International conference

T. Mihara, K. Inoue, and K. Murakami

WorkshopOn Chip MultiProcessor: Processor Architecture and Memory Hierarchy related Issues 2007.9

　More details

Event date： 2007.9

Language：Others

Country：Romania
One-sided Communication Implementation in FMO Method International conference

J. Maki, Y. Inadomi, T. Takami, R. Susukita, H. Honda, J. Ooba, T. Kobayashi, R. Nogita, K. Inoue and M. Aoyagi

International Conference on High Performance Computing, Grid and e-Science in Asia Pacific Regiion 2007.9

　More details

Event date： 2007.9

Language：Others

Country：Greece
Multi-physics Extension of OpenFMO International conference

T. Takami, J. Maki, J. Ooba, Y. Inadomi, H. Honda, R. Susukita, K. Inoue, T. Kobayashi, R. Nogita, and M. Aoyagi

FrameworkInternational Conference of Computational Method in Sciences and Enginnering 2007.9

　More details

Event date： 2007.9

Language：Others

Country：Greece
Implementation and Evaluation of Fock Matrix Calculation Program on the Cell Processor International conference

H. Honda, T. Hayashi, Y. Inadomi, K. Inoue, and K. Murakami

International Conference of Computational Method in Sciences and Enginnering 2007.9

　More details

Event date： 2007.9

Language：Others

Country：Greece
The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems International conference

H. Noori, M. Goudarzi, K. Inoue, and K. Murakami

International Conference on Embedded Systems and Applications 2007.6

　More details

Event date： 2007.6

Language：Others

Country：United States
メモリアクセスの特徴を活用した高速かつ正確なメモリアーキテクチャ・シミュレーション法

小野貴継井上弘士村上和彰

先進的計算基盤システムシンポジウム 2007.5

　More details

Event date： 2007.5

Language：Others Presentation type：Oral presentation (general)

Country：Japan
通信タイミングを考慮した衝突削減のためのMPIランク配置最適化技術

森江善之, 末安直樹松本透, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

先進的計算基盤システムシンポジウム 2007.5

　More details

Event date： 2007.5

Language：Others Presentation type：Oral presentation (general)

Country：Japan
Dynamic Management Technique to Mitigate Performance Degradation for Low-Leakage Caches International conference

R. Komiya, K. Inoue, and K. Murakami

The 10th IEEE Symposium on Low-Power and High-Speed Chips 2007.4

　More details

Event date： 2007.4

Language：Others Presentation type：Oral presentation (general)

Venue：横浜 Country：Japan
Reducing energy consumption of video memory by bit-width compression

Vasily G. Moshnyaga, Koji Inoue, Mizuka Fukagawa

Proceedings of the 2002 International Symposium on Low Power Electronics and Design 2002.1

　More details

Event date： 2002.8

Language：English

Venue：Monterey, CA Country：United States

A new architectural technique to reduce energy dissipation of video memory is propose. Unlike existing approaches, the technique exploits the pixel correlation in video sequences, dynamically adjusting the memory bit-width to the number of bits changed per pixel. Instead of treating the data bits independently, we group the most significant bits together, activating the corresponding group of bit-lines adaptively to data variation. The method is not restricted to the specific bit-patterns nor depends on the storage phase. It works equally well on read and write accesses, as well as during precharging. Simulation results show that using this method we can reduce the total energy consumption of video memory by 20% without affecting the picture quality.
A history-based i-cache for low-energy multimedia applications

Koji Inoue, V. G. Moshnyaga, K. Murakami

Proceedings of the 2002 International Symposium on Low Power Electronics and Design

　More details

Event date： 2002.8

Language：English

Venue：Monterey, CA Country：United States

This paper proposes a history-based tag-comparison scheme for reducing energy consumption of direct-mapped instruction caches. The proposed cache efficiently exploits program-execution footprints recorded in the Branch Target Buffer (BTB), and attempts to detect and eliminate unnecessary tag checks at run time. Simulation results show that our approach can eliminate up to 95% of tag checks, saving the cache energy by 17%, while affecting the processor performance by only 0.2%.
Way-predicting set-associative cache for high performance and low energy consumption

Koji Inoue, Tohru Ishihara, Kazuaki Murakami

Proceedings of the 1999 International Conference on Low Power Electronics and Design (ISLPED)

　More details

Event date： 1999.8

Language：English

Venue：San Diego, CA, USA Country：Other

This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache.
Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs

Inoue Koji, Koji Kai, Kazuaki Murakami

Proceedings of the 1999 5th International Symposium on High-Performance Computer Architecture, HPCA 1999.1

　More details

Event date： 1999.1

Language：English

Venue：Orlando, FL, USA Country：Other

This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called `dynamically variable line-size cache (D-VLS cache)'. The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is about 27%, compared to a conventional direct-mapped cache with fixed 32-byte lines.
Quantitative Evaluation of Leakage Reduction Algorithm for L1 Data Caches International conference

R. Komiya, K. Inoue, V. Moshnyaga, K. Murakami

The International SoC Design Conference (ISOCC) 2004.10

　More details

Language：Others

Venue：the Convention and Exhibition Center (COEX) Country：Korea, Republic of
Energy-Security Tradeoff in a Secure Cache Architecture Against Buffer Overflow Attacks International conference

Koji Inoue

Workshop on Architectural Support for Security and Anti-Virus (WASSA) 2004.10

　More details

Language：Others Presentation type：Oral presentation (general)

Venue：Park Plaza Hotel Country：United States
A Low Power I-Cache Design with Tag-Comparison Reuse International conference

K. Inoue, H. Tanaka, V. Moshnyaga, K. Murakami

The International Symposium on System-On-Chip 2004.11

　More details

Language：Others Presentation type：Oral presentation (general)

Venue：Tampere Country：Finland
3D memory architecture Invited International conference

Koji Inoue

D43D: 3rd Design for 3D Silicon Integration Workshop 2011.6

　More details

Language：Others Presentation type：Oral presentation (general)

Country：France
Adaptive Execution on 3D Microprocessors Invited International conference

Koji Inoue

11th International Forum on Embedded MPSoC and Multicore 2011.7

　More details

Language：Others Presentation type：Oral presentation (general)

Country：France
Adaptive Execution on 3D Microprocessors Invited International conference

Koji Inoue

11th International Forum on Embedded MPSoC and Multicore 2011.7

　More details

Language：Others Presentation type：Oral presentation (general)

Country：France
Performance Evaluation of 3D Stacked Multi-Core Processors with Temperature Consideration International conference

T. Hanada, H. Sasaki, K. Inoue and K. Murakami

International 3D System Integration Conference 2012.1

　More details

Language：Others

Country：Japan
A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack International conference

F. Mehdipour, K. C. Nunna, L. Gauthier, K. Inoue and K. Murakami

International 3D System Integration Conference 2012.1

　More details

Language：Others

Country：Japan
Efficient Barrier Synchronization for 2D Meshed NoC-based Many-core Processors International conference

Lovic Gauthier, Farhad Mehdipour, Koji Inoue, Shinya Ueno, Hiroshi Sasaki

The 17th Workshop on Synthesis And System Integration of Mixed Information technologies 2012.3

　More details

Language：Others

Country：Japan
Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling International conference

Satoshi Imamura, Hiroshi Sasaki, Naoto Fukumoto, Koji Inoue, and Kazuaki Murakami

2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments (RESoLVE '12) 2012.3

　More details

Language：Others Presentation type：Oral presentation (general)

Country：United Kingdom
On the Power and Performance Analysis of GPU-Accelerated Systems International conference

Yuki Abe, 佐々木広, Inoue Koji, Kazuaki Murakami, Shinpei Kato

Poster session, 2012 USENIX Annual Technical Conference 2012.6

　More details

Language：English Presentation type：Symposium, workshop panel (public)

Country：United States
SMYLE: Scalable Many-core for Low-Energy computing (Invited) Invited International conference

Koji Inoue and Masaaki Kondo

12th International Forum on Embedded MPSoC and Multicore 2012.7

　More details

Language：Others

Country：Japan
A Three-Dimensional Integrated Accelerator International conference

Farhad Mehdipour, Krishna Chaitanya Nunna, Inoue Koji, Kazuaki Murakami

Euromicro Conference on Digital System Design 2012.9

　More details

Language：English Presentation type：Symposium, workshop panel (public)

Country：Turkey
Scalability-Based Manycore Partitioning International conference

Hiroshi Sasaki, Teruo Tanimoto, Koji Inoue, and Hiroshi Nakamura

International Conference on Parallel Architectures and Compilation Techniques 2012.9

　More details

Language：Others Presentation type：Symposium, workshop panel (public)

Country：United States
Power and Performance Analysis of GPU-Accelerated Systems International conference

Yuki Abe, Hiroshi Sasaki, Martin Peres, Inoue Koji, Kazuaki Murakami, Shinpei Kato

Workshop on Power-Aware Computing and Systems 2012.10

　More details

Language：English Presentation type：Symposium, workshop panel (public)

Country：United States
Task Mapping Techniques for Embedded Many-core SoCs International conference

Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Inoue Koji

International SoC Design Conference 2012.11

　More details

Language：English Presentation type：Symposium, workshop panel (public)

Country：Korea, Republic of
SMYLEref: A Reference Architecture for Manycore-Processor SoCs International conference

Masaaki Kondo, Son Truong Nguyen, Takeshi Soga, Tomoya Hirao, Hiroshi Sasaki, Inoue Koji

Asia and South Pacific Design Automation Conference (ASP-DAC) 2013.1

　More details

Language：English

Country：Japan

, , , , Hiroshi Sasaki, and Koji Inoue,
"
SMYLEProject:TowardHigh-Performance,Low-PowerComputingonManycore-Processor SoCs

Inoue Koji

Asia and South Pacific Design Automation Conference (ASP-DAC) 2013.1

　More details

Language：English

Country：Japan
Line Sharing Cache: Exploring Cache Capacity with Frequent Line Value Locality International conference

Keitaro Oka, Hiroshi Sasaki, Inoue Koji

Asia and South Pacific Design Automation Conference 2013.1

　More details

Language：English Presentation type：Symposium, workshop panel (public)

Country：Japan
メニーコアプロセッサにおける実時間モデル予測制御のための投機実行法

川上哲志, 岩永明人, 井上弘士

先進的計算基盤システムシンポジウム論文集 2013.5

　More details

Language：Japanese

Country：Japan

Speculative Execution for Real-time Model Predictive Control on Manycore Processor
Many-core Acceleration for Model Predictive Control Systems International conference

Satoshi Kawakami, Akihito Iwanaga, Inoue Koji

Int’l Workshop on Manycore Embedded Systems 2013.6

　More details

Language：English

Country：Japan
Coordinated Power-Performance Optimization in Manycores International conference

Hiroshi Sasaki, Satoshi Imamura, Inoue Koji

the 22nd International Conference on Parallel Architectures and Compilation Techniques 2013.9

　More details

Language：English

Country：Japan
フレームレートの動的最適化に基づく低消費エネルギー物体追跡システムの提案 (集積回路デザインガイア2013 : VLSI設計の新しい大地)

江川瀬里奈, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2013.11

　More details

Language：Japanese

Country：Japan

Low Energy Tracking System with Dynamic Frame-Rate Optimization
Real-time object tracking which estimates a position coordinate of the target object at each frame is widely applied to various applications, for example, obstacle tracking and driver's drowsiness detection. It is required to decrease the energy consumption for limited power supply and thermal problem because it is often used in embedded systems. This paper proposes a method of frame-rate optimization in order to satisfy that requirement. The scheme attempts to adjust its frame-rate based on the moving speed of the target object. As a result of evaluation, it is observed that the proposed approach can reduce total system energy by more than 70%.
Performance and Power Consumption Evaluation of MHD Simulation for Magnetosphere on Parallel Computer System with CPU Power Capping International conference

FUKAZAWA Keiichiro, Tomonori Tsuhata, Kyohei Yoshida, Masakazu Kuze, Masatsugu Ueda, 稲富雄一, Inoue Koji

Extreme Green & Energy Efficiency in Large Scale Distributed Systems 2014.5

　More details

Language：English

Country：Netherlands
Power and Performance Characterization and Modeling of GPU-accelerated Systems International conference

Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Inoue Koji, Masato Edahiro, Martin Peres

the 28th IEEE International Parallel & Distributed Processing Symposium 2014.5

　More details

Language：English

Country：Japan
A flexible hardware barrier mechanism for many-core processors International conference

Takeshi Soga, Hiroshi Sasaki, Tomoya Hirao, Masaaki Kondo, Inoue Koji

Asia and South Pacific Design Automation Conference 2015.1

　More details

Language：English Presentation type：Oral presentation (general)

Country：Japan
物体追跡システムの低消費エネルギー化を目的とした動的フレームレート制御法 (集積回路)

井上優良, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2015.12

　More details

Language：Japanese

Country：Japan

Dynamic Frame-rate Optimization for Low Energy Object Tracking
物体追跡システムの低消費エネルギー化を目的とした動的フレームレート制御法 (電子部品・材料)

井上優良, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2015.12

　More details

Language：Japanese

Country：Japan

Dynamic Frame-rate Optimization for Low Energy Object Tracking
モデル予測制御を対象としたメニーコアプロセッサ向け投機実行法の制御性能評価 (VLSI設計技術)

藤井卓, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016.1

　More details

Language：Japanese

Country：Japan
光パスゲート論理に基づく並列加算回路の提案と光電混載回路シミュレータによる動作検証 (回路とシステム)

石原亨, 新家昭彦, 井上弘士, 野崎謙悟, 納富雅也

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016.6

　More details

Language：Japanese

Country：Japan

A Parallel Adder Circuit based on Optical Pass-gate Logic and Its Evaluation with Optoelectronic Circuit Simulator
受信信号強度を用いたデバイス認証方式における攻撃可能条件の定式化 (コンピュータシステム)

藤井達也, 小野貴継, 金谷晴一, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016.8

　More details

Language：Japanese

Country：Japan

Formulating Attack Condition on Received Signal Strength Indicator based Device Authentication
Single-Flux-Quantum Cache Memory Architecture International conference

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Inoue Koji

International SoC Design Conference 2016.10

　More details

Language：English Presentation type：Oral presentation (general)

Country：Korea, Republic of
単一磁束量子回路を用いたシフトレジスタ型キャッシュメモリ・アーキテクチャの提案 (電子部品・材料) -- (デザインガイア2016 : VLSI設計の新しい大地)

石田浩貴, 田中雅光, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016.11

　More details

Language：Japanese

Country：Japan

Shift-Register-Based Single-Flux-Quantum Cache Memory Architecture
Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping International conference

今村智史, Yuichiro Yasui, Inoue Koji, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

the 1st High Performance Graph Data Management and Processing workshop 2016.11

　More details

Language：English Presentation type：Oral presentation (general)

Country：United States
Evaluating the Impacts of Code-Level Performance Tunings on Power Efficiency International conference

今村智史, Keitaro Oka, Yuichiro Yasui, 稲富雄一, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Inoue Koji, Takatsugu Ono

IEEE International Conference on Big Data 2016.12

　More details

Language：English Presentation type：Oral presentation (general)

Country：United States
Production Hardware Overprovisioning: Real-world Performance Optimization using an Extensible Power-aware Resource Management Framework International conference

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, and Martin Schulz

IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017) 2017.5

　More details

Language：English Presentation type：Oral presentation (general)

Country：United States
High-Throughput Bit-Parallel Arithmetic Logic Unit Using Rapid Single-Flux-Quantum Logic International conference

Masamitsu Tanaka, Ryo Sato, Yuki Hatanaka, Yuichi Matsui, Hiroyuki Akaike, Akira Fujimaki, Koki Ishida, Takatsugu Ono, Koji Inoue

International Superconductive Electronics Conference 2017.6

　More details

Language：English Presentation type：Oral presentation (general)

Country：Italy
単一磁束量子ゲートレベルパイプラインマイクロプロセッサに向けた要素回路設計 (超伝導エレクトロニクス)

畑中湧貴, 松井裕一, 田中雅光, 佐野京佑, 藤巻朗, 石田浩貴, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2017.8

　More details

Language：Japanese

Country：Japan

Design of Component Circuits for Rapid Single-Flux-Quantum Gate-Level-Pipelined Microprocessors
CPCI Stack: Metric for Accurate Bottleneck Analysis on OoO Microprocessors International conference

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

International Symposium on Computing and Networking 2017.11

　More details

Language：English Presentation type：Oral presentation (general)

Country：Japan
Wireless Spoofing-Attack PreventionUsing Radio-Propagation Characteristics International conference

Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Koji Inoue

IEEE International Conference on Dependable, Autonomic and Secure Computing 2017.11

　More details

Language：English Presentation type：Oral presentation (general)

Country：United States

▼display all

MISC

Way-Predicting Set-Associative Cache for High Performance and Low Energy Consumption Reviewed

Koji Inoue, Tohru Ishihara, Kazuaki J. Murakami

Proceedings of International Symposium on Low Power Electronics and Design (ISLPED'99) 1999.8

　More details

Language：Others

DOI： 10.1145/313817.313948
Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/Logic LSIs Reviewed

K Inoue, K Kai, K Murakami

FIFTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS 1999.1

　More details

Language：English

This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache)" The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is about 27%, compared to a conventional direct-mapped cache with fixed 32-byte lines.

DOI： 10.1109/HPCA.1999.744366
RTL Design of Surface Code Decoder for Fault-Tolerant Quantum Computers Targeting Cryogenic Non-volatile FPGAs

中村, 徹舟, 宮村, 信, 井上, 弘士, 川上, 哲志, 阪本, 利司, 多田, 宗弘, 谷本, 輝夫

情報処理学会論文誌コンピューティングシステム（ACS） 17 ( 1 ) 13 - 25 2024.3 （ ISSN:1882-7829 ）

　More details

Language：Japanese

量子ハードウェアは高いエラー率を示すため，量子誤り訂正技術の実現が不可欠である．特に，表面符号は高いエラー訂正性能を持つ誤り訂正符号として注目されている．本研究では，極低温環境で動作可能なNanoBridge-FPGAへの実装を目指し，iterative greedyアルゴリズムを用いた表面符号復号器のRTL設計を行った．設計した復号器は，先行研究と同じ誤りシミュレータを用いて動作検証を行い，レイテンシ・使用リソース量の評価も行った．さらに，NanoBridge-FPGAへの論理合成・配置配線も行い，使用リソース量を確認した．
Since the error rates of existing quantum devices are high, it is essential to realize quantum error correction (QEC) techniques. In particular, surface code (SC) has attracted attention as one of the most promising error-correcting codes. In this study, we have designed an RTL surface code decoder using the iterative greedy algorithm to implement on NanoBridge-FPGA that can operate in cryogenic environments. The designed decoder was verified using the same error simulator as in the previous study, and latency and resource usage were also evaluated. In addition, we performed logic synthesis, placement and routing targeting NanoBridge-FPGA and confirmed the resource usage.

CiNii Books

CiNii Research

researchmap
極低温不揮発FPGAを対象とした誤り耐性量子コンピュータ向け表面符号復号器のRTL設計

中村徹舟, 宮村信, 井上弘士, 川上哲志, 阪本利司, 多田宗弘, 谷本輝夫

情報処理学会研究報告(Web) 2023 ( ARC-252 ) 2023

　More details

J-GLOBAL

researchmap
通信量に着目したQAOA向け極低温NISQコンピューティングのアーキテクチャ検討

富田祐永, 上野洋典, 上野洋典, 谷本輝夫, 田中雅光, 井上弘士, 中村宏

情報処理学会研究報告(Web) 2022 ( ARC-250 ) 2022

　More details

J-GLOBAL

researchmap
Demonstration of Gate-Level-Pipelined Floating-Point Units Using Single-Flux-Quantum Circuits

長岡一起, 加島亮太, 田中雅光, 川上哲志, 谷本輝夫, 山下太郎, 井上弘士, 藤巻朗

電子情報通信学会大会講演論文集(CD-ROM) 2022 2022 （ ISSN:1349-144X ）

　More details

J-GLOBAL

researchmap
単一磁束量子プロセッサ向けキャッシュメモリ構成法の検討と定量的評価

鴨志田圭吾, 石川伊織, 羽野祐太, 川上哲志, 谷本輝夫, 小野貴継, 田中雅光, 藤巻朗, 井上弘士

情報処理学会研究報告(Web) 2022 ( ARC-249 ) 2022

　More details

J-GLOBAL

researchmap
光パスゲート論理に基づく超低遅延光回路—特集集積ナノフォトニクス研究の最前線

新家昭彦, 石原亨, 井上弘士, 野崎謙悟, 納富雅也

NTT技術ジャーナル / 日本電信電話株式会社編 2018.5

　More details

Language：Japanese
アウトオブオーダ命令実行の依存グラフ表現に関する考察

谷本輝夫, 佐々木広, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016.8

　More details

Language：Japanese
マルチスケールフィルタ向けアクセラレータ・アーキテクチャの提案

上野伸也, GauthierLovic Eric, 井上弘士, 村上和彰

研究報告システムLSI設計技術（SLDM） 2012.10

　More details

Language：Japanese

Accelerator Architecture for Multi Scale Filter Operation
キャッシュウェイ割り当てとコード配置の同時最適化によるメモリアクセスエネルギーの削減

高田純司, 石原亨, 井上弘士

研究報告システムLSI設計技術（SLDM） 2011.10

　More details

Language：Japanese

Simultaneous Optimization of Cache Way Selection and Code Placement for Reducing the Memory Access Energy Consumption
キャッシュウェイ割り当てとコード配置の同時最適化によるメモリアクセスエネルギーの削減

高田純司, 石原亨, 井上弘士

電子情報通信学会技術研究報告. ICD, 集積回路 2011.10

　More details

Language：Japanese

Simultaneous Optimization of Cache Way Selection and Code Placement for Reducing the Memory Access Energy Consumption
The paper proposes a technique which simultaneously finds the optimal cache way allocation and code placement for given multiple tasks running on a single core processor. It reduces the energy consumption in a set-associative cache by activating only a single cache way at a time and deactivating the remaining cache ways. The technique also reduces the number of cache misses by changing the code placement in a main memory, which results in a reduction of the energy consumption in the main memory as well as the reduction of total execution time. Experiments using a commercial embedded processor demonstrate that the technique reduces the total energy consumption in the target processor system by 17% at the best case compared to the energy of the system which does not apply our technique.
マルチコア向けオンチップメモリ貸与法における実行コード生成法の改善 (集積回路)

福本尚人, 今里賢一, 井上弘士

電子情報通信学会技術研究報告 2010.1

　More details

Language：Japanese

Improving execution code generation for on-chip memory lending on multicores
3次元DRAM-プロセッサ積層実装を対象としたオンチップ・メモリ・アーキテクチャの提案と評価

橋口慎哉, 小野貴継, 井上弘士, 村上和彰

研究報告システムソフトウェアとオペレーティング・システム（OS） 2009.4

　More details

Language：Japanese

On-Chip Memory Architecture for DRAM Stacking Microprocessors
演算／メモリ性能バランスを考慮した Cell／B.E. 向けオンチップ・メモリ活用法とその評価

林徹生, 福本尚人, 今里賢一, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008.5

　More details

Language：Japanese

Performance Balancing: An Implementation of Efficient On-chip Memory Hierarchy on Cell/B.E.
We have proposed the concept of Performance Balancing to improve the CMP performance. This approach attempts to exploit the on-chip cores not only for executing the parallelized threads, but also for improving the memory performance. In this technique, it is very important to decide an appropriate number of cores dedicated to memory performance improvements. In this paper, we propose an algorithm to solve this problem and implement it on a Cell/B.E. processor. In our evaluation, it is observed that our approach can achieve 14% performance improvement in the best case compared to a conventional CMP model.
チップマルチプロセッサにおけるメモリ負荷変動の定量的解析

山口光章, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008.5

　More details

Language：Japanese

Quantitative Analysis of Memory Workload on Chip-Multiprocessors
Integrating multiple processor cores into a single chip, or chip-multiprocessors (CMPs) is one of the most promising approaches to achieve high-performance and low-power consumption at the same time. In CMPs employing a sheared L2 cache, conflict misses may be increased, because all of the cores share the limited cache resource. To solve this problem, this paper quantitatively analyzes the memory workload on CMPs. By means of observing the transition of a CPI stack, we can discuss the detail of the memory behavior. In this analysis, it is observed that intra- and inter-programs, there are time period in which the conflicts frequently take place.
チップマルチプロセッサにおけるメモリ負荷変動の定量的解析

山口光章, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008.5

　More details

Language：Japanese

Quantitative Analysis of Memory Workload on Chip-Multiprocessors
Integrating multiple processor cores into a single chip, or chip-multiprocessors (CMPs) is one of the most promising approaches to achieve high-performance and low-power consumption at the same time. In CMPs employing a sheared L2 cache, conflict misses may be increased, because all of the cores share the limited cache resource. To solve this problem, this paper quantitatively analyzes the memory workload on CMPs. By means of observing the transition of a CPI stack, we can discuss the detail of the memory behavior. In this analysis, it is observed that intra- and inter-programs, there are time period in which the conflicts frequently take place.
トランザクショナルメモリにおける並列実行トランザクション数動的制御法の提案とその評価

武田進, 島崎慶太, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008.5

　More details

Language：Japanese

Adaptive Management of Parallelism on Transactional Memories
This paper proposes a technique to improve the performance of CMPs by mans of managing the number of transactions to be executed in parallel. In parallel computing, we need to manage sheared data in order to ensure the exclusiveness. In transactional memories, it is allowed the threads to access the shared data, resulting in higher performance. This is because we can aggressively exploit thread-level parallelisms. However, when a conflict takes place in the transactional memory, the associated thread execution needs to be aborted in order to guarantee the correct execution results. This abort operation degrades the CMP performance. To solve this issue, we propose an adaptive management mechanism to throttle or un-throttle the thread-level parallelism. In our evaluation, it is observed that in the best case we can achieve 1.6x speedup.
トランザクショナルメモリにおける並列実行トランザクション数動的制御法の提案とその評価

武田進, 島崎慶太, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008.5

　More details

Language：Japanese

Adaptive Management of Parallelism on Transactional Memories
This paper proposes a technique to improve the performance of CMPs by mans of managing the number of transactions to be executed in parallel. In parallel computing, we need to manage sheared data in order to ensure the exclusiveness. In transactional memories, it is allowed the threads to access the shared data, resulting in higher performance. This is because we can aggressively exploit thread-level parallelisms. However, when a conflict takes place in the transactional memory, the associated thread execution needs to be aborted in order to guarantee the correct execution results. This abort operation degrades the CMP performance. To solve this issue, we propose an adaptive management mechanism to throttle or un-throttle the thread-level parallelism. In our evaluation, it is observed that in the best case we can achieve 1.6x speedup.
演算/メモリ性能バランスを考慮した Cell/B.E. 向けオンチップ・メモリ活用法とその評価

林徹生, 福本尚人, 今里賢一, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008.5

　More details

Language：Japanese

Performance Balancing : An Implementation of Efficient On-chip Memory Hierarchy on Cell/B.E.
We have proposed the concept of Performance Balancing to improve the CMP performance. This approach attempts to exploit the on-chip cores not only for executing the parallelized threads, but also for improving the memory performance. In this technique, it is very important to decide an appropriate number of cores dedicated to memory performance improvements. In this paper, we propose an algorithm to solve this problem and implement it on a Cell/B.E. processor. In our evaluation, it is observed that our approach can achieve 14% performance improvement in the best case compared to a conventional CMP model.
演算/メモリ性能バランスを考慮したCMP向けヘルパースレッド実行方式の提案と評価

今里賢一, 福本尚人, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008.5

　More details

Language：Japanese

Performance Balancing : An Efficient Helper-Thread Execution on CMPs
Conventional CMPs attempt to exploit the thread-level parallelism (TLP) by using all of the cores integrated in a chip. However, this kind of straightforward way does not always achieve the best performance. This is because the memory-wall problem becomes more critical in CMPs, resulting in poor performance in spite of high TLP. To solve this issue, we propose an efficient thread management technique, called performance balancing. We dare to throttle the TLP to execute software prefetchers as helper-threads. Our experimental results show 47% speed up in the best case compared with a conventional parallel execution.
演算／メモリ性能バランスを考慮した CMP 向けへルパースレッド実行方式の提案と評価

今里賢一, 福本尚人, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008.5

　More details

Language：Japanese

Performance Balancing: An Efficient Helper-Thread Execution on CMPs
Conventional CMPs attempt to exploit the thread-level parallelism (TLP) by using all of the cores integrated in a chip. However, this kind of straightforward way does not always achieve the best performance. This is because the memory-wall problem becomes more critical in CMPs, resulting in poor performance in spite of high TLP. To solve this issue, we propose an efficient thread management technique, called performance balancing. We dare to throttle the TLP to execute software prefetchers as helper-threads. Our experimental results show 47% speed up in the best case compared with a conventional parallel execution.
通信衝突削減のためのタスク配置最適化の評価

森江善之, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008.3

　More details

Language：Japanese

Evaluation of optimization of task allocation for reducing contentions
In this text, we evaluated the optimization of task allocation to avoid contentions that was the key factor of the communication performance degradation. We applied the optimization of task allocation controlling the timing of the message for avoiding contentions to the tree topology, and showed it was effectiveness. On the other hand, there were some optimizations of task allocation for reducing contentions. Those optimizations used the evaluation function which used the number of hops. Those optimizations against the mesh and torus topology were effective. We experimented and investigated what s the difference between the optimization of task allocation which the evaluation function was the number of contentions and the number of hops when the network topology was 3D mesh. We considered about it.
高信頼マイクロプロセッサ・アーキテクチャ

井上弘士

日本信頼性学会誌 : 信頼性 = The journal of Reliability Engineering Association of Japan 2008.1

　More details

Language：Japanese

Reliable Microprocessor Architectures
A hybrid design space exploration approach for a coarse-grained reconfigurable accelerator (システムLSI設計技術)

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

情報処理学会研究報告システムLSI設計技術（SLDM） 2008.1

　More details

Language：English

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator
Multitude parameters involved in the design process of a reconfigurable accelerator which is exploited in embedded systems brings about a remarkable complexity and large design space. One effective technique is design space exploration which is capable to find a right balance between the different design parameters. Quantitative design approach is an alternative which uses the data collected from applications; however it is time consuming and highly depends on designer observations and analyses and might not conclude to an optimal design. In this paper, a hybrid approach is introduced which uses an analytical approach to explore the design space for a reconfigurable accelerator and determine a wise design point based on the quantitative data collected from the targeted applications. It also provides flexibility for applying new design constraints as well as new applications characteristics. Furthermore, this approach is a methodological approach which reduces the design time and results in a design which satisfies the design goals. Experimental results show the efficacy of the hybrid approach.
演算/メモリ性能バランスを考慮したCMP向けオンチップ・メモリ貸与法の提案

林徹生, 今里賢一, 井上弘士, 村上和彰

情報処理学会研究報告組込みシステム（EMB） 2008.1

　More details

Language：Japanese

Execution/Memory Performance Balancing: An On-chip Memory Management Technique for High Performance CMP
This paper proposes performance balancing, that is core management technique focused on trade-off between calculation and memory performance. In CMPs, high-performance is achieved by exploiting TLP. However, resource sharing among the cores makes memory performance lower regardless of the already low performance compared with processor core's one. Thus, we have to consider not only scalability, but also the performance assumed ideal memory sub-systems. Our proposed technique attempts to select effective approach, exploit scalability or improve memory performance. We also focus on a software-controllable on-chip memory. By borrowing local memory of some cores to others, we achieve memory performance improvement, and try to improve processor performance. Our experimental results show 13% speed up in the best case, compared with conventional parallel processing on Cell Broadband Engine.
情報社会を支えるディペンダブル･プロセッサ

井上弘士

情報処理学会研究報告システムLSI設計技術（SLDM） 2007.10

　More details

Language：Japanese

Dependable Processors for Social Information Infrastructures
This paper introduces architectural supports to improve the efficiency of computer security. In the social information infrastructures, we exactly face to "Security Problem" such as computer viruses and information leaks. Although a number of techniques to improve security efficiency, which focus on network and system software components, have so far been proposed, still many threats exist. Since 1970s, microprocessors have made incredible progress in terms of performance. In addition, from 1990s, many techniques to reduce power or energy consumption have been developed. However, a few discussions for computer security at the processor level have done. Now, it is the time to start considering, how we can improve the security efficiency by means of providing architectural supports.
PSI-NSIM : 大規模並列システムの性能解析に向けた並列相互結合網シミュレータ

柴村英智, 薄田竜太郎, 本田宏明, 稲富雄一, 于雲青, 井上弘士, 青柳睦

電子情報通信学会技術研究報告. CPSY, コンピュータシステム 2007.10

　More details

Language：Japanese

PSI-NSIM : A Parallel Interconnection Network Simulator for Performance Analysis of Large-scale Parallel Systems
This paper presents an interconnection network simulator, PSI-NSIM, toward designing and performance analysis of large-scale parallel system. PSI-NSIM simulates desired interconnection network base on a configuration file which specifies specification of target network and a communication profile generated from an execution of application. Furthermore, this simulator provides not only various information for performance evaluation but estimates entire performance of system with fast and good accuracy. Then information for performance analysis and visualizing of application execution is also provided. In this paper, implementation of PSI-NSIM and results of performance evaluation of existing cluster system are reported.
通信タイミングを考慮した衝突削減のための MPI ランク配置最適化技術

森江善之, 末安直樹, 松本透, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

情報処理学会論文誌コンピューティングシステム（ACS） 2007.8

　More details

Language：Japanese

Optimization of MPI Rank Allocation Considering Communication Timing for Reducing Contention
In this paper, this work proposes the optimization of rank allocation technology of avoiding the communication contention that is the key factor of the communication performance degradation. This work proposes the objective function for high-quality Optimization of MPI rank allocation to be able to avoid a communication contention by considering the communicationtiming of each message. Moreover, in the evaluation experiment, this work checks how this objective function cuts down communication time. The communication pattern of the recursive doubling algorithm and the communication pattern of the application such as CG and umt2000 are used in the evaluation experiment. The ratio of reduction in the communication time are 45% or less for order rank allocation, 24% or less for previous work rank allocation in the experiment.
次世代スーパーコンピュータの設計開発に向けたシステム性能評価環境 PSI-SIM

柴村英智, 薄田竜太郎, 本田宏明, 稲富雄一, 于雲青, 井上弘士, 青柳睦

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2007.8

　More details

Language：Japanese

PSI-SIM: A System Performance Evaluation Environment Toward Next Generation Supercomputer Development
This paper presents a system performance evaluation environment, PSI-SIM, toward petascale next generation supercomputer development. This environment estimates performances of desired interconnect and system based on communication profile which generated from execution of practical parallel application, and supports easy application analysis and visualization. We propose a program code abstraction method for fast communication profile generation. Furthermore, PSI-SIM simulates applications and an existing cluster system, then the elapsed simulation times and the error rates of the estimation are discussed.
負荷ばらつきを考慮した MPI ブロードキャスト通信の動的最適化に関する研究

栗原康志, 曽我武史, HyacintheNzigouMamadou, 南里豪志, 末安直樹, 松本透, 井上弘士, 村上和彰

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2007.8

　More details

Language：Japanese

Dynamic optimization of broadcast communication according to the load balance
This work focuses on the problem that the load imbalance can decrease the performance of broadcast communication. To avoid the problem, the authors proposed a technique of optimization that adjusts the order of communications in a broadcast at runtime. In this technique, the information of the delay of each rank from the root rank is used to decide the optimal order. In This paper, a proto-type of this technique was implemented on a PC cluster, and showed that the optimization decreased the by 40% at maximum. In addition to that, it was confirmed to be able to reduce the communication time by about 25% or less by applying the broadcast of the proposal technique to the sparse matrix calculation.
高速かつ正確なキャッシュシミュレーション法とその評価

小野貴継, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007.6

　More details

Language：Japanese

Fast, Accurate Cache Simulation
This paper proposes a fast, accurate cache simulation technique for efficient design space exploration, and shows its efficiency by means of comparing with a related approach. Trace-driven simulation is a well known methodology to measure memory-system performance, e.g. cache hit rates. One of advantages of this method is the high-speed of simulations. Since the trend increases the complexity of microprocessor chips, e.g. CMPs, however, it is strongly required to achieve much faster simulations without sacrificing the accuracy of performance prediction. The proposed approach first attempts to characterize the memory-access patters, and then generates a small but well-constructed memory-access trace as a stimulus of cache simulators. In our evaluation, it is observed that the proposed technique reduces the trace size by 81.7% while the accuracy of cache miss rates is improved by 34.6%, compared with SimPoint approach.
大規模再構成可能データパスにおけるオンチップ・ネットワーク・アーキテクチャの検討

島崎慶太, 長野孝昭, 本田宏明, ファラハドメディプー, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007.6

　More details

Language：Japanese

On-chip Network Architecture for Large Scale Reconfigurable Datapath
Large Scale Reconfigurable Data Path (LSRDP) is a data path type processor accelarator. On the LSRDP, enormous Floating Point number processing Units (FPUs) are arranged as 2-dimensional array, each FPU and FPU network is reconfigurable. There is a trade-off relation about the area size between the number of FPUs and network configuration for the LSRDP. In this research, the LSRDP area size is estimated under condition that the initial integral part of the quantum chemistry two electron integral calculation is implemented and the crossbar switch is assumed to implement the network connecting each FPU array. As a result, it was obtained that each FPU in an array is connected with the nine FPUs in next array for the minimized LSRDP area size.
高速かつ正確なキャッシュシミュレーション法とその評価

小野貴継, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2007.5

　More details

Language：Japanese

Fast, Accurate Cache Simulation
This paper proposes a fast, accurate cache simulation technique for efficient design space exploration, and shows its efficiency by means of comparing with a related approach. Trace-driven simulation is a well known methodology to measure memory-system performance, e.g. cache hit rates. One of advantages of this method is the high-speed of simulations. Since the trend increases the complexity of microprocessor chips, e.g. CMPs, however, it is strongly required to achieve much faster simulations without sacrificing the accuracy of performance prediction. The proposed approach first attempts to characterize the memory-access patters, and then generates a small but well-constructed memory-access trace as a stimulus of cache simulators. In our evaluation, it is observed that the proposed technique reduces the trace size by 81.7% while the accuracy of cache miss rates is improved by 34.6%, compared with SimPoint approach.
The potential of temperature-aware configurable cache on energy reduction (計算機アーキテクチャ)

Hamid Noori, Maziar Goudarzi, Koji INOUE, Kazuaki MURAKAMI

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007.5

　More details

Language：English

The Potential of Temperature-Aware Configurable Cache on Energy Reduction
Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Moreover, temperature is another factor that exponentially increases the leakage current. In this paper, we show the effects of temperature and technology nodes on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. We show that a temperature-aware configurable cache is an effective way to save energy in finer technologies when the embedded system may be used in different temperatures. Our results show that using a temperature-aware configurable cache, up to 66% energy can be saved with only 1% performance penalty for instruction chache and 74% energy saving with 4.7% performance loss for data cache.
チップマルチプロセッサにおけるデータ・プリフェッチ効果の分析

福本尚人, 三原智伸, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007.5

　More details

Language：Japanese

Effect of Data Prefetching on Chip Multiprocessor
Chip MultiProcessors (or CMPs) can achieve higher performance by means of exploiting thread level parallelism. Increasing the number of processor cores in a chip dramatically improves the peak performance. However, since the memory bandwidth does not scale with the number of cores, the negative impact of the memory-wall problem becomes more critical. Data prefetching is a well known approach to compensating for the poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed that the processor core in a chip is only one. In CMP chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMPs should be different from that on single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetch operations issued during a program execution. Then, we discuss qualitatively and quantitatively the effect of prefetching to the memory performance. The experimental results show that the negative effect of invalidation of prefetched data is very small. In addition, it is observed that about 5% prefetch operations improve the cache hit rates of other cores.
チップマルチプロセッサにおけるデータ・プリフェッチ効果の分析

福本尚人, 三原智伸, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2007.5

　More details

Language：Japanese

Effect of Data Prefetching on Chip MultiProcessor
Chip MultiProcessors (or CMPs) can achieve higher performance by means of exploiting thread level parallelism. Increasing the number of processor cores in a chip dramatically improves the peak performance. However, since the memory bandwidth does not scale with the number of cores, the negative impact of the memory-wall problem becomes more critical. Data prefetching is a well known approach to compensating for the poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed that the processor core in a chip is only one. In CMP chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMPs should be different from that on single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetch operations issued during a program execution. Then, we discuss qualitatively and quantitatively the effect of prefetching to the memory performance. The experimental results show that the negative effect of invalidation of prefetched data is very small. In addition, it is observed that about 5% prefetch operations improve the cache hit rates of other cores.
動的再構成可能プロセッサ Vulcan2 とそのソフトウェア開発環境ISAccに関する研究

平木哲夫, 門内伸吾, 山崎陽介, 神戸隆行, GAUTHIER Lovic, MAURO GOULART FERREIRA Victor, TROUVE Antoine, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. RECONF, リコンフィギャラブルシステム : IEICE technical report 2007.5

　More details

Language：Japanese

A Study of the Dynamic Reconfigurable Processor Vulcan2 and Its Development Tool ISAcc
Application specific extensions of a processor provide higher performance. In this paper, the authors propose "Vulcan2" the Application specific processor with dynamically reconfigurable datapath and "ISAcc" Vulcan2's development tool, and demonstrate the efficiency of the proposed processor.
大規模再構成可能データパスにおけるオンチップ・ネットワーク・アーキテクチャの検討

島崎慶太, 長野孝昭, 本田宏明, メディプーファラハド, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2007.5

　More details

Language：Japanese

On-chip Network Architecture for Large Scale Reconfigurable Datapath
Large Scale Reconfigurable Data Path (LSRDP) is a data path type processor accelarator. On the LSRDP, enormous Floating Point number processing Units (FPUs) are arranged as 2-dimensional array, each FPU and FPU network is reconfigurable. There is a trade-off relation about the area size between the number of FPUs and network configuration for the LSRDP. In this research, the LSRDP area size is estimated under condition that the initial integral part of the quantum chemistry two electron integral calculation is implemented and the crossbar switch is assumed to implement the network connecting each FPU array. As a result, it was obtained that each FPU in an array is connected with the nine FPUs in next array for the minimized LSRDP area size.
通信タイミングを考慮したランク配置最適化技術

森江善之, 末安直樹, 松本透, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007.3

　More details

Language：Japanese

Optimization of rank allocation considerin communication timing
In this paper, it proposes the rank optimization of rank allocation technology of avoiding the communication contention that is the key factor of the deterioration of the communication performance. It proposes the method is possible to ward off a communication contention was allcated by considering the communication-timing of each message. Moreover, this method has a overhead that it has to add synchronous function. In the evaluation experiment, it check how does this method cut down communication time including that overhead. The communication pattern of the recursive doubling and the communication pattern of the real application such as CG and umt2000 are used in this evalucation experiments. The ratio of reduction in the communication time are 45 % or less for order rank allocation, 24 % or less for previous work rank allcation in the experiment.
単一磁束量子回路による再構成可能な大規模データパスをもつプロセッサ

高木直史, 村上和彰, 藤巻朗, 吉川信行, 井上弘士, 本田宏明

電子情報通信学会技術研究報告. SCE, 超伝導エレクトロニクス 2007.1

　More details

Language：Japanese

A processor with a large-scale reconfigurable data-path using rapid single flux quantum circuits
A processor with a large-scale reconfigurable data-path using rapid single flux quantum circuits is proposed for a 10TFLOPS desk-side superconductive computer.
Drowsyキャッシュにおけるモード切替アルゴリズムの評価

図子純平, 冨山宏之, 高田広章, 井上弘士

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006.11

　More details

Language：Japanese

Evaluation of Algorithm to Change Cache Line Mode in Drowsy Caches
In the design of embedded systems, especially battery-powered systems, it is important to reduce energy consumption. In these days, cache memories are used not only in general-purpose processors but also in processors for embedded systems. Static energy (leakage energy) consumed in cache has been increasing with the decrease of the feature size. The Drowsy cache is one of the techniques to reduce leakage energy consumption of caches. The Drowsy cache reduces leakage energy by changing cache line mode into the low-leakage mode. In the Drowsy cache, when the cache line in the low-leakage mode is accessed, it has to be changed into the normal mode, and it takes one or more clock cycles. Thus, these penalty cycles may significantly degrade the cache performance. In this paper, we propose three kinds of Way-Prediction Drowsy Cache which achieve a high-energy reduction with the minimum performance overhead. Experimental results demonstrate the effectiveness of the proposed cache architectures.
メモリ・アーキテクチャ・ベンチマーキング手法の提案

小野貴継, 井上弘士, 村上和彰

情報処理学会研究報告システム評価（EVA） 2006.8

　More details

Language：Japanese

A Methodology for Memory Architecture Benchmarking
In order to determine the memory architecture from a lot of design candidates, we use a trace-driven simulation. It is a common approach for evaluating memory architecture. However, it also demands much time. In this paper, we propose a Memory Architecture Benchmarking technique. It is possible that to reduce the simulation time while maintaining simulation accuracy. In order to evaluate validity of proposed technique, we measured the cache hit ratio. In our evaluation, the proposed technique reduces the simulation time about 77.6% and cache hit ratio prediction errors about 4.2% in the average.
近似文字列照合プログラム実行の特徴解析と高速化に関する検討

柴田圭, 馬場謙介, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. DC, ディペンダブルコンピューティング 2006.7

　More details

Language：Japanese

Analyzing the Characteristic of Approximate String Matching for Processor Performance Improvement
In this paper, we analyze characteristic of Bit Parallel Algorithm for approximate string matching. Current virus search based on exact matching scheme can not discover subspecies viruses made by altering known virus program. To solve this problem, exploiting approximate string matching is considered. In order to realize fast, high functional virus program search, we analyze the feature of program execution about approximate string matching. First, we analyze memory capacity to be required during program execution and the frequency of each instruction executed. As a result, we understand the L1 cache memory capacity equipped in present processors is enough. In addition we have found a bias in frequency of execution about sequentially executed instructions. Moreover, we have found we can expect 14% of peformance inprovement by taking advantage of reconfigurable functional unit.
チップマルチプロセッサにおけるキャッシュメモリの特性解析

三原智伸, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006.7

　More details

Language：Japanese

Analysis of Cache Memory Characteristics on a Chip MultiProcessor
To achieve higher performance, CMP(Chip MultiProcessor) is focused today. Because of narrow bus bandwidth and the memory wall problem, it is necessary to design the memory system which is suitable for CMP. In the system, on-chip cache architecture has a large impact on performance, and to decide sharing/dedicating an on-chip cache among multiple processor cores is a important choice. In this paper, we studied the difference of performance in shared-cacheCMP and dedicate-cacheCMP. We analyzed the factor which impacts memory-access-time qualitatively and quantitatively, and revealed that L2cache miss rate makes the largest gap between them.
キャッシュメモリ中の衰退ラインを利用したメモリ整合性検証の高速化

坂口高宏, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. DC, ディペンダブルコンピューティング 2006.7

　More details

Language：Japanese

Mitigating The Performance Impact of Memory Integrity Verification by Exploiting Cache Decay Lines
We focus on Memory Integrity Verificatoin that can detect memory corruption to the loading data by maintaining the state of the space of the memory for protection in a safe storage area. To waste on chip cache and the memory bandwidth, the execution of the Memory Integrity Verification gives a adverse effect to the processor performance. One of the factor of the performance decrease is causing competition between the data that the processor needs for the program execution and data for the Memory Integrity Verificatoin in cache. It happenes the memory access because of cache misses, and the processor performance decreases. To solve the issues, the data for Memory Integrity Verification is replaced in the cache line with low access frequency that is called Decay line. Therefore, it prevents evicting the data for program execution. We compare this method with past method, as a result, we can reduce the performance overhead by 23.8% on an average.
演算結果再利用による高信頼かつ低消費電力なプロセッサに関する検討

橋口陽祐, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2006.6

　More details

Language：Japanese

A Low-Power, Reliable Datapath by Reusing Execution Results
The decrease in the soft error tolerance in processors becomes a problem. The soft error is a phenomenon that the circuit does not malfunctions temporarily by the noise. To improve reliability, there is parity and ECC in the memory. However, it is difficult to add the error detection/correction code in combinational circuits. It enables the error detection by multiplexing the execution program. It has the problem that increases the energy consumption. In this research, We investigate the reliable datapath by reusing execution results. It does not execute the same instruction in detail. It maintains the result in a table, and obtains the result without ALU, The table with ECC can have reliability. The energy consumption depends on the table composition. Result of examining table composition, it can adjust the amount of the increased energy consumption to 6.3%.
プログラムの実行経路の偏りに着目した分岐予測法

築地孝典, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006.6

　More details

Language：Japanese

A Case for Hot-Path-based Branch Prediction
Modern high performance processors employ branch predictors. The accuracy of branch prediction influences the processor performance because the processor executes wrong instructions when a mis-prediction occurs. To improve accuracy of branch prediction, large scale and complex branch predictors have been proposed. However,the energy of branch predictors has been increasing. As mentioned above, when a mis-prediction occurs, total chip energy is increased due to the execution of invalid instructions. Therefore, achieving high accuracy of branch prediction and reducing the energy consumption of the branch predictor are very important. We propose a new method to solve the issues. It is well known that there is a small number of instruction paths executed frequently in program executions. In the hotpath, branch instructions tend to be output the same execution results, i.e. the same branch direction and the same target address. Moreover, the execution time of some hotpaths have a majority of the total execution time. A method of branch prediction we propose predicts by accessing to small memory that have branch instruction address and branch target address for hotpaths. We compare this method with Gshare predictor, As a result, it is observed that although the mis-prediction rate increases by 2.2 points, we can reduce the energy consumption by 40%.
プログラムの実行経路の偏りに着目した分岐予測法

築地孝典, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2006.6

　More details

Language：Japanese

A Case for Hot-Path-based Branch Prediction
Modern high performance processors employ branch predictors. The accuracy of branch prediction influences the processor performance because the processor executes wrong instructions when a mis-prediction occurs. To improve accuracy of branch prediction, large scale and complex branch predictors have been proposed. However,the energy of branch predictors has been increasing. As mentioned above, when a mis-prediction occurs, total chip energy is increased due to the execution of invalid instructions. Therefore, achieving high accuracy of branch prediction and reducing the energy consumption of the branch predictor are very important. We propose a new method to solve the issues. It is well known that there is a small number of instruction paths executed frequently in program executions. In the hotpath, branch instructions tend to be output the same execution results, i.e. the same branch direction and the same target address. Moreover, the execution time of some hotpaths have a majority of the total execution time. A method of branch prediction we propose predicts by accessing to small memory that have branch instruction address and branch target address for hotpaths. We compare this method with Gshare predictor, As a result, it is observed that although the mis-prediction rate increases by 2.2 points, we can reduce the energy consumption by 40%.
演算結果再利用による高信頼かつ低消費電力なプロセッサに関する検討

橋口陽祐, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006.6

　More details

Language：Japanese

A Low-Power, Reliable Datapath by Reusing Execution Results
The decrease in the soft error tolerance in processors becomes a problem. The soft error is a phenomenon that the circuit does not malfunctions temporarily by the noise. To improve reliability, there is parity and ECC in the memory. However, it is difficult to add the error detection/correction code in combinational circuits. It enables the error detection by multiplexing the execution program. It has the problem that increases the energy consumption. In this research, We investigate the reliable datapath by reusing execution results. It does not execute the same instruction in detail. It maintains the result in a table, and obtains the result without ALU. The table with ECC can have reliability. The energy consumption depends on the table composition. Result of examining table composition, it can adjust the amount of the increased energy consumption to 6.3%.
A Reconfigurable Functional Unit for Adaptable Custom Instructions(集積回路技術とアーキテクチャ技術の協調・融合へ向けた,プロセッサ,並列処理,システムLSIアーキテクチャ及び一般)

Noori Hamid, Mehdipour Farhad, Murakami Kazuaki, INOUE Koji, SAHEBZAMANI Morteza

電子情報通信学会技術研究報告. ICD, 集積回路 2006.6

　More details

Language：English

A Reconfigurable Functional Unit for Adaptable Custom Instructions
This paper presents a reconfigurable functional unit (RFU) for an adaptive dynamic extensible processor. The processor can tune its extended instructions to the target applications, after chip-fabrication, which brings about more flexibility. The custom instructions (CIs) are generated deploying the hot basic blocks during the training mode. In the normal mode, CIs are executed on the RFU. A quantitative approach was used for designing the RFU. The RFU is a matrix of functional units with 8 inputs and 6 outputs. Performance is enhanced up to 1.5 using the proposed RFU for 22 applications of Mibench. The size of configuration memory has been reduced by 40% through making the RFU partially reconfigurable, finding subsets of CIs and merging small CIs into one configuration. This processor needs no extra opcodes for CIs, new compiler, source code modification and recompilation.
新世代マイクロプロセッサアーキテクチャ（後編）:2.新しいデザインバランス 2.信頼性・安全性とプロセッサ

井上弘士

情報処理 2005.11

　More details

Language：Japanese

New Generation Microprocessor Architecture (2):Security and Reliability of Advanced Microprocessors
待機ラインへの参照密度に基づく低リーク・キャッシュの動的制御

小宮礼子, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2005.8

　More details

Language：Japanese

Dynamic Performance Optimization for Low-Leakage Caches based on Increase-Miss Density
A number of techniques to reduce cache leakage energy have so far been proposed. However, in these techniques, flushing the data of a turning off line causes a new cache miss. And, the increase miss degrade processor performance. We have analyzed the detail of cache-access behavior, and have found that there is a locality of accesses to the turning-off lines. Based on this observation, we propose a cache management technique to alleviate the negative effect of low-leakage caches. In our approach, cache lines having high degree of increase-miss locality are forced to stay in the high-speed but high-leakage mode. In our evaluation, the proposed scheme worsens the performance by only 5.0% with the same degree of energy reduction of the Cache decay approach.
実行振舞いを鍵情報とする不正プログラムの動的検出方式

井上弘士, 岩佐崇史

情報処理学会研究報告計算機アーキテクチャ（ARC） 2005.8

　More details

Language：Japanese

Run-Time Instruction Detection based on Dynamic Execution Behavior
To challenge the security problem, we propose a hardware-base intrusion detection technique which regards the dynamic program-execution behavior as a certification key. Based on secret key information, we determine an execution behavior. Then an object code which generates the determined execution behavior at run time is constructed by a secure compiler. While the program execution, a secure profiler monitors the execution behavior. If the secure profiler can not see the determined behavior, it alarms the microprocessor for terminating the current program execution. Since the viruses do not know the behavior required to continue the execution on the microprocessor, we can detect and prohibit the malicious attacks at the beginning of its execution.
待機状態ラインに対する参照局所性を考慮した低リーク・キャッシュの性能低下抑制方式

小宮礼子, 井上弘士, モシニャガ・ワシリー, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004.12

　More details

Language：Japanese

A Cache Management Technique via Sleep - Hit Locality to Alleviate Performance Impact of Low - Leakage Caches
A number of techniques to reduce cache leakage energy have so far been proposed. However, in these techniques, low speed accesses to a standby mode line degrade processor performance. We have analyzed the detail of cache-access behavior, and have found that there is a locality of accesses to the standby-mode lines. Based on this observation, we propose a cache management technique to alleviate the negative effect of low-leakage caches. In our approach, cache lines having high degree of sleep-hit locality are forced to stay in the high-speed but high-leakage mode. In our evaluation, it has bee observed that the Drowsy cache can achieve 84% of leakage reduction with 15% of performance degradation, while the proposed scheme worsens the performance by only 8～11% with the same degree of energy reduction of the Drowsy approach.
キャッシュ・ミス頻発命令とその特徴解析

堂後靖博, 三輪英樹, ヴィクトル・マウロ・グラール・フェヘイラ, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004.12

　More details

Language：Japanese

Characteristic Analysis of Delinquent Memory Access Instructions
Recent remarkable advances of VLSI technology have been increasing processor speed and DRAM capacity dramatically. However, the advances also have introduced a large and growing performance gap between the processor and DRAM, this problem is referred to as "Memory Waif", resulting in poor total system performance in spite of higher processor performance. In order to solve this problem, researchers have been proposed high-performance techniques to alleviate the effect of delinquent memory-access instructions. In this paper, we investigate the detail of behavior of the delinquent memory-access instructions. The results presented in this paper will be useful to develop new approaches against the memory wall problem.
キャッシュ・ミス頻発命令を考慮したメモリ・システムの高性能化

三輪英樹, 堂後靖博, ヴィクトルM グラールフェヘイラ, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004.12

　More details

Language：Japanese

A Method for Improving Memory System Performance Exploring Delinquent Loads
In recent years, the performance gap between microprocessor speed and main memory latency has been increasing. This problem prevents higher throughput improvements and is well-known in the literature as the Memory-Wall Problem (MWP). This paper proposes a new method to minimize the MWP effect by means of [re-computation]. The basic idea is to replace frequently cache-missed loads (or delinquent loads) with a piece of code that regenerates the missed value (recomputation code). This method can be used to reduce the number of main memory accesses and consequently alliviate the MWP. From the experiments, one can obtain up to 45.3% reduction on computation time for SPEC2000 benchmark programs.
デｰタパス分割に基づく高信頼プロセッサの提案とその予備評価

松坂茂治, 井上弘士

情報処理学会研究報告. SLDM, [システムLSI設計技術] 2004.12

　More details

Language：Japanese

A Dependable Processor Architecture with Data-Path Partioning
In order to maintain the high reliability of a computer system, it is necessary to detect the failure leading to an obstacle. In general, failure is detected by using time redundancy or spatial redundancy. However, in order to realize these redundancies, additional hardware and execution time increase. In this paper, the data-path partitioning which realizes spatial redundancy is proposed without making a large change of hardware. Moreover, the system which reduces an execution time overhead is proposed in consideration of the minimum bit width which is needed at the time of operation. In our evaluation, execution time became an average of 1.62 times (redundancy: two), and an average of 3.09 times (redundancy: four). Therefore, this technique is very effective.
不正プログラムの実行防止を目的とするオンチップ・キャッシュ・アーキテクチャ

井上弘士

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004.7

　More details

Language：Japanese

A Cache Architecture to Prevent Malicious Code Executions
This paper proposes an architectural support to improve computer security, called Secure Cache (SCache), and evaluates its energy/security efficiency. A number of malicious codes attempt to hijack program-execution flow by causing stack smashing that corrupts the return address stored in a stack. In order to avoid the return address corruption, SCache generates a replica data in the cache area. In our evaluation, for many benchmarks, it is observed that more than 99.7% of return-address loads can be protected.
オペランド再利用によるレジスタ・ファイルの低消費電力化

高村拓志, 井上弘士, G.MoshnyagaVasily

情報処理学会研究報告計算機アーキテクチャ（ARC） 2002.8

　More details

Language：Japanese

Reducing Power Consumption of Register Files through Operand Reuse
This paper proposes an energy reduction technique for register files. The proposed approach attempts to reuse operand data read from the register file in order to reduce the number of register-file accesses. If sequentially executed instructions, i and j, specify the same source operand, then the operand data read from the register file by the instruction i is reused for the instruction j. In this case, the operand fetch for the instruction j can be performed without register file activation, saving energy consumption. As well as the read operation, we can eliminate register-file write accesses by exploiting forwarding unit, which is used for solving RAW pipeline hazard problem. In our simulation, it is observed that the proposed approach can reduce the total number of register-file accesses by 62% from a conventional model.
低消費電力メディア・アプリケーション向けヒストリ・ベース・タグ比較キャッシュの評価

井上弘士, Moshnyaga Vasily G., 村上和彰

電子情報通信学会技術研究報告. DC, ディペンダブルコンピューティング 2002.4

　More details

Language：Japanese

Energy Efficiency of History-Based Tag-Comparison Cache for Media Applications
We have proposed a history-based tag-comparison cache (HBL cache) architecture for low-energy consumption. Jo conventional direct-mapped caches, tag checks are performed to examine whether the current reference hits the cache. On the other hand, the HBTC cache attempts to reduce the frequency of tag checks by exploiting execution footprints. In this paper, we improve the proposed HBTC cache, and evaluate performance/energy efficiency by using media applications. Simulation results show that such an approach can eliminate up to 95% of tag checks, saving the cache energy by 17%, while affecting the processor performance by 0.2%.
低消費電力メディア・アプリケーション向けヒストリ・ベース・タグ比較キャッシュの評価

井上弘士, Moshnyaga Vasily G., 村上和彰

電子情報通信学会技術研究報告. CPSY, コンピュータシステム 2002.4

　More details

Language：Japanese

Energy Efficiency of History-Based Tag-Comparison Cache for Media Applications
We have proposed a history-based tag-comparison cache (HBL cache) architecture for low-energy consumption. In conventional direct-mapped caches, tag checks are performed to examine whether the current reference hits the cache. On the other hand, the HBTC cache attempts to reduce the frequency of tag checks by exploiting execution footprints. In this paper, we improve the proposed HBTC cache, and evaluate performance/energy efficiency by using media applications. Simulation results show that such an approach can eliminate up to 95% of tag checks, saving the cache energy by 17%, while affecting the processor performance by 0.2%.
二電源電圧を用いた命令発行メモリの低消費電力化手法

辻寛司, 井上弘士, モシニャガワシリー

情報処理学会研究報告システムLSI設計技術（SLDM） 2001.11

　More details

Language：Japanese

Reducing Energy Dissipation of Complexity Adaptive Issue Queue by Dual Voltage Supply
This paper presents a novel architectural technique to reduce energy dissipation of adaptive issue queue, whose functionality is dynamically adjusted at runtime to match the changing computational demands of instruction stream. In contrast to existing schemes, the technique exploits a new freedom in queue design, namely the voltage per access. Since loading capacitance operated in the adaptive queue varies in time, the clock cycle budget becomes inefficiently exploited. We propose to trade-off the unused cycle time with supply voltage, lowering the voltage level when the queue functionality is reduced and increasing it with the activation of resources in the queue. Experiments show that the approach can save up to 36% of the issue queue energy without large performance and area overhead.
タグ比較結果の再利用によるキャッシュメモリの低消費電力化

井上弘士, MoshnyagaG.Vasily, 村上和彰

情報処理学会研究報告システムLSI設計技術（SLDM） 2001.11

　More details

Language：Japanese

A Low Power Cache Memory Architecture based on Tag Compare Reuse
This paper proposes a novel architecture for low-power instruction caches called "history-based look-up cache(HBL cache)". In conventional n-way set-associative caches, there are n locations where a cache line can be placed in the cache space, and all ways are activated on every cache access because of the parallel search strategy. On the other hand, the HBL cache attempts to reuse the tag comparison results, and reduces the cache-access energy by avoiding the unnecessary way activations. The tag-comparison results are recorded in an extended BTB(Branch Target Buffer)for branch prediction. In our evaluation, it is observed that the HBL cache reduces the energy consumption by about 72%, while it degrades the performance by only 0.2%, compared with a conventional set-associative cache.
データ圧縮による画像処理用メモリの低消費電力化手法とその評価

深川瑞香, 井上弘士, VasilyG.Moshnyaga

情報処理学会研究報告システムLSI設計技術（SLDM） 2001.11

　More details

Language：Japanese

Reducing power consumption of Video memory through data compression
This paper proposes an idea for reducing power consumption of video memories through data compression. In video memory systems, in-order-access memories are used, e.g., frame memory. In a conventional memory, all bitlines are activated for reading or writing. On the other hand, our approach attempts to compress the read(or write)data, and activates only bitlines corresponding to the difference-bits between the successively accessed data. As a result, we can reduce the power consumption for the memory access by means of reducing the total number of bitline switching. In our simulation, it is observed that our approach can reduce the power consumption of frame memory by 11%-16% for many video sequences.
A low-power instruction cache architecture exploiting program execution footprints

INOUE K.

Work-in-Progress Session in the 7th International Symposium on High-Performance Computer Architecture, Included in CD Proc. 2001.1

　More details

Language：Others

A Low-Power Instruction Cache Architecture Exploiting Program Execution Footprints
Performance/Energy Efficiency of Variable Line-Size Caches on Intelligent Memory Systems

Koji Inoue, Koji Kai, Kazuaki J. Murakami

Proc. of the 2nd Workshop on Intelligent Memory Systems 2000.11

　More details

Language：Others

DOI： 10.1007/3-540-44570-6_13
A High-Performance and Low-Power Cache Architecture with Speculative Way-Selection

INOUE Koji, ISHIHARA Tohru, MURAKAMI Kazuaki

IEICE transactions on electronics 2000.2

　More details

Language：English

A High-Performance and Low-Power Cache Architecture with Speculative Way-Selection
This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the procesor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.
MOE: A special-purpose parallel computer for high-speed, large-scale molecular orbital calculation Reviewed

Koji Hashimoto, Hiroto Tomita, Koji Inoue, Katsuhiko Metsugi, Kazuaki Murakami, Shinjiro Inabata, So Yamada, Nobuaki Miyakawa, Hajime Takashima, Kunihiro Kitamura, Shigeru Obara, Takashi Amisaki, Kazutoshi Tanabe, Umpei Nagashima

ACM/IEEE SC 1999 Conference, SC 1999 1999.11

　More details

Language：English

We are constructing a high-performance, special-purpose parallel machine for ab initio Molecular Orbital calculations, called MOE (Molecular Orbital calculation Engine). The sequential execution time is O(N4) where N is the number of basis functions, and most of time is spent to the calculations of electron repulsion integrals (ERIs). The calculation of ERIs have a lot of parallelism of O(N4), and therefore MOE tries to exploit the parallelism. This paper discuss the MOE architecture and examines important aspects of architecture design, which is required to calculate ERIs according to the "Obara method". We conclude that n-way parallelization is the most cost-effective, hence we designed the MOE prototype system with a host computer and many processing nodes. The processing node includes a 76 bit oating-point MULTIPLY-and-ADD unit and internal memory, etc., and it performs ERI computations efficiently. We estimate that the prototype system with 100 processing nodes calculate the energy of proteins in a few days.

DOI： 10.1109/SC.1999.10000

▼display all

Industrial property rights

Patent	Number of applications: 1	Number of registrations: 0
Utility model	Number of applications: 0	Number of registrations: 0
Design	Number of applications: 0	Number of registrations: 0
Trademark	Number of applications: 0	Number of registrations: 0

Professional Memberships

情報処理学会
電子情報通信学会
IEEE
ACM
電子情報通信学会

　 More details

researchmap
情報処理学会

　 More details

researchmap
IEEE

　 More details

researchmap
ACM

　 More details

researchmap

▼display all

Committee Memberships

ACM SIGMICRO Executive Committee Members Foreign country

2023.7 - 2026.6
主査主査 Domestic

2018.3 - 2022.3
情報処理学会システムアーキテクチャ研究会主査

2018.3 - 2022.3

　 More details

researchmap
Secretary Secretary Foreign country

2015.1 - 2016.12
幹事 Organizer Domestic

2012.4 - 2013.3

Academic Activities

TPC International contribution

International Symposium on Microarchitecture (MICRO) （ Austin ） 2024.11

　More details

Type：Competition, symposium, etc.
TPC International contribution

International Symposium on Computer Architecture (ISCA) （ Argentina ） 2024.6 - 2023.7

　More details

Type：Competition, symposium, etc.
学術システム研究センター研究員

Role(s)： Review, evaluation

2024.4 - 2027.3

　More details

Type：Scientific advice/Review
TPC International contribution

International Symposium on High-Performance Computer Architecture (HPCA) （ Edinburgh ） 2024.3

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Microarchitecture (MICRO) （ Others ） 2023.10

　More details

Type：Competition, symposium, etc.
日本学術会議連携会員

Role(s)： Review, evaluation

2023.10 - Present

　More details

Type：Scientific advice/Review
TPC International contribution

IEEE Micro Top Picks （その他） 2023.6

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Computer Architecture (ISCA) （ Others ） 2023.6

　More details

Type：Competition, symposium, etc.
JST PRESTO/CREST 量⼦・古典の異分野融合による共創型フロンティアの開拓領域アドバイザー

Role(s)： Review, evaluation

2023.6 - 2032.3

　More details

Type：Scientific advice/Review
Other International contribution

International Symposium on High-Performance Computer Architecture (HPCA) （ Others ） 2023.2

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Microarchitecture (MICRO) （ Others ） 2022.10

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Computer Architecture (ISCA) （ Others ） 2022.6

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on High-Performance Computer Architecture （ Others ） 2022.2

　More details

Type：Competition, symposium, etc.
International Symposium on High-Performance Computer Architecture International contribution

（ Others ） 2022.2

　More details

Type：Competition, symposium, etc.

researchmap
Other International contribution

International Symposium on Microarchitecture （ Others ） 2021.10

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Computer Architecture (ISCA) （ Others ） 2021.5 - 2021.6

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on High-Performance Computer Architecture （ Others ） 2021.2

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Microarchitecture （ Others ） 2020.10

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Computer Architecture （ Spain Spain ） 2020.5 - 2020.6

　More details

Type：Competition, symposium, etc.
次世代計算基盤検討部会委員

Role(s)： Review, evaluation

文部科学省 2020.4 - 2021.3

　More details

Type：Scientific advice/Review
Other International contribution

International Symposium on High-Performance Computer Architecture （ Others ） 2020.2

　More details

Type：Competition, symposium, etc.
JSTさきがけ革新的な量子情報処理技術基盤の創出領域アドバイザー

Role(s)： Review, evaluation

2019.5 - 2025.3

　More details

Type：Scientific advice/Review
Other International contribution

International Symposium on Microarchitecture （ Japan Japan ） 2018.10

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Computer Architecture （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2018.6

　More details

Type：Competition, symposium, etc.
JST さきがけ「革新的コンピューティング技術の開拓」領域総括

Role(s)： Review, evaluation

JST 2018.4 - 2023.3

　More details

Type：Scientific advice/Review
JST さきがけ「革新的コンピューティング技術の開拓」領域総括

JST 2018.4 - 2023.3

　More details

researchmap
Other International contribution

International Symposium on High-Performance Computer Architecture （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2018.2

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Computer Architecture （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2017.6

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Microarchitecture （ Taiwan Taiwan ） 2016.10

　More details

Type：Competition, symposium, etc.
Other International contribution

18th Asia and South Pacific Design Automation Conference （ Japan Japan ） 2013.1

　More details

Type：Competition, symposium, etc.
Other International contribution

The 41st International Conference on Parallel Processing （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2012.9

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Low Power Electronics and Design 2012 （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2012.7 - 2012.8

　More details

Type：Competition, symposium, etc.
Other International contribution

International Conference for High Performance Computing, Networking, Storage and Analysis （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2011.12

　More details

Type：Competition, symposium, etc.
その他 International contribution

The 19th Annual IFIP/IEEE Conference on Very Large Scale Integration 2011 （ Hong Kong Hong Kong ） 2011.10

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Low Power Electronics and Design 2011 （ Japan Japan ） 2011.8

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Low Power Electronics and Design 2011 （ Japan Japan ） 2011.8

　More details

Type：Competition, symposium, etc.
Other International contribution

The 6th IEEE International Conference on Networking, Architecture, and Storage （ China China ） 2011.7

　More details

Type：Competition, symposium, etc.
Other International contribution

11th International Forum on Embedded MPSoC and Multicore 2011 （ France France ） 2011.7

　More details

Type：Competition, symposium, etc.
Other International contribution

The IEEE International Symposium on VLSI 2011 （ India India ） 2011.7

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Low Power Electronics and Design （ Austin UnitedStatesofAmerica UnitedStatesofAmerica ） 2010.8

　More details

Type：Competition, symposium, etc.
Other International contribution

IEEE Computer Society Annual Symposium on VLSI （ Lixouri Kefalonia Greece Greece ） 2010.7

　More details

Type：Competition, symposium, etc.
その他 International contribution

International Forum on Embedded MPSoC and Multicore （ Japan ） 2010.6 - 2010.7

　More details

Type：Competition, symposium, etc.
Other International contribution

The IEEE Symposium on Low-Power and High-Speed Chips （ Yokohama Japan Japan ） 2010.4

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Embedded Multicore Systems-on-Chip （ Vienna Austria Austria ） 2009.9

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Low Power Electronics and Design 2009 （ San Francisco UnitedStatesofAmerica UnitedStatesofAmerica ） 2009.8

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Low Power Electronics and Design （ San Francisco, California UnitedStatesofAmerica UnitedStatesofAmerica ） 2009.8

　More details

Type：Competition, symposium, etc.
Other International contribution

IEEE Computer Society Annual Symposium on VLSI （ Tampa UnitedStatesofAmerica UnitedStatesofAmerica ） 2009.5

　More details

Type：Competition, symposium, etc.
Other International contribution

The IEEE Symposium on Low-Power and High-Speed Chips 2009 （ Yokohama Japan Japan ） 2009.4

　More details

Type：Competition, symposium, etc.
Other International contribution

The Workshop on Synthesis And System Integration of Mixed Information technologies 2009 （ Okinawa Japan Japan ） 2009.3

　More details

Type：Competition, symposium, etc.
Other International contribution

13th Asia and South Pacific Design Automation Conference 2009 （ Yokohama Japan Japan ） 2009.1

　More details

Type：Competition, symposium, etc.
Other International contribution

International Conference on Field-Programmable Technology 2008 （ Taipei Taiwan Taiwan ） 2008.12

　More details

Type：Competition, symposium, etc.
Other International contribution

MEDEA Workshop MEmory performance:DEaling with Applications, systems and architecture （ Toronto Canada Canada ） 2008.10

　More details

Type：Competition, symposium, etc.
Other International contribution

International Conference on Field Programmable Logic and Applications （ Heidelberg Germany Germany ） 2008.9

　More details

Type：Competition, symposium, etc.
Other International contribution

International Symposium on Low Power Electronics and Design 2008 （ Bangalore India India ） 2008.8

　More details

Type：Competition, symposium, etc.
Other International contribution

The IEEE Symposium on Low-Power and High-Speed Chips 20098 （ Yokohama Japan Japan ） 2008.4

　More details

Type：Competition, symposium, etc.
その他 International contribution

12th Asia and South Pacific Design Automation Conference 2008 （ Korea ） 2008.1

　More details

Type：Competition, symposium, etc.
その他

第57回電気関係学会九州支部連合大会（ Japan ） 2004.9

　More details

Type：Competition, symposium, etc.
その他

第17回回路とシステム軽井沢ワークショップ（ Japan ） 2004.4

　More details

Type：Competition, symposium, etc.
英文論文誌A 2005年4月特集号「Special Section on Selected Papers from the 17th Workshop on Circuits and Systems in Karuizawa」 International contribution

2004.1

　More details

Type：Academic society, research group, etc.

▼display all

Research Projects

縦型半導体ナノワイヤアレイ量子集積回路基盤技術の創成

2023.10 - 2029.3

JST

　 More details

Authorship：Coinvestigator(s)

本研究は、ナノワイヤアレイ量子集積回路の基盤技術と基本学理を構築することで、現行Si-MOSFETによる集積回路の消費電力を劇的に削減する超低消費電力エレクトロニクスの実現を目指す。特に、新構造素子を３次元網目状に集積した立体構造を前提とし、そのための新しいコンピュータアーキテクチャを探索する。
ポストムーア時代を見据えた超伝導コンピューティング技術の創成と展開

2022.6 - 2027.3

科研費基盤研究(S)

　 More details

Authorship：Principal investigator

今から約30年前、超伝導コンピュータの実現に向け世界でデバイス研究が活発化し、その後、冬の時代に突入した。しかしながら、この局面が大きく変わりつつある。これは、材料や回路技術の進歩に加え、ここ数年で計算機工学分野での研究が飛躍的に進み、革新的アーキテクチャが次々と誕生したことに起因する。コンピュータの性能向上を支え続けた半導体の微細化は2030年頃に終焉を迎える。このような状況において、次世代計算基盤の最有力候補として超伝導コンピューティングが再び注目され、今まさに、冬の時代に終止符が打たれようとしている。本研究の目的は、本分野を牽引し続ける我々の最先端基礎研究をシステムレベルへと昇華させ、極低温超伝導汎用コンピューティング技術として世界に先駆けて確立することにある。そのために、デバイスからアーキテクチャまでを包括したシステム階層縦横断型研究を遂行し、新奇デバイス活用コンピュータ・アーキテクチャを創成する。これこそが、デバイス多様性に基づくポストムーア時代の計算機工学の新展開となる。
ポストムーア時代を見据えた超伝導コンピューティング技術の創成と展開

Grant number：22H00518 2022.4 - 2026.3

Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (A)

井上弘士, 田中雅光, 川上哲志, 谷本輝夫, 廣川真男, 小野貴継

　 More details

Grant type：Scientific research funding

本研究の目的は、単一磁束量子回路向けアーキテクチャを牽引し続ける我々の最先端基礎研究をシステムレベルへと昇華させ、極低温超伝導汎用コンピューティング技術として世界に先駆けて確立することにある。最初の2年間において、各種理論の構築、原理検証のためのチップ試作、アーキテクチャ概念設計、デバイスモデリング、といった要素技術開発を進める。そして3年目でこれらを統合したマイクロアーキテクチャ探索を実施し、最終年にて詳細設計ならびに総合評価を実施する。

CiNii Research
JST ムーンショット：2050年までに、経済・産業・安全保障を飛躍的に発展させる誤り耐性型汎用量子コンピュータを実現

2022.2 - 2026.3

　 More details

Authorship：Coinvestigator(s)
超伝導量子回路の集積化技術の開発

2022.2 - 2026.3

JST

　 More details

Authorship：Coinvestigator(s)

超伝導量子コンピュータを対象にした「冷凍機内マルチステージ・ヘテロジニアス量子制御機構アーキテクチャ」の探索を目的とする。具体的には、①誤り訂正符号回路アーキテクチャの策定と設計、②システムレベル量子コンピュータアーキテクチャ探索環境の構築と評価・分析、③冷凍機内マルチステージ（特に、mKと4K）間での協調動作の指針策定（定量的評価に基づく）、を行う。
Creation and development of superconducting computing technology for post-Moore era

Grant number：22H05000 2022 - 2026

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (S)

井上弘士, 田中雅光, 中村宏, 川上哲志, 板垣奈穂, 谷本輝夫, 浜屋宏平

　 More details

Authorship：Principal investigator Grant type：Scientific research funding

本研究の狙いは「超伝導デバイスの活用を前提とした新計算原理の創出と革新的コンピューティング技術の開拓」にある。世界最先端となるこれまでの基礎研究を起点とし、1) SFQ回路に最適な情報表現法とそれに基づく極低温演算メカニズムの導出、2) 異種新奇デバイス融合による極低温新メモリ/通信方式の探求、3)これらに基づく極低温超伝導汎用コンピュータ・アーキテクチャの創成、を目指す。

CiNii Research
超伝導量子回路の集積化技術の開発

2022 - 2025

戦略的な研究開発の推進ムーンショット型研究開発事業

　 More details

Authorship：Coinvestigator(s) Grant type：Contract research
脳の仕組みに倣った省エネ型の人工知能関連技術の開発・実証事業

2021.10 - 2024.3

総務省

　 More details

Authorship：Coinvestigator(s)
近似計算手法を制御する進化型コンピュータのアーキテクチャの検討

2019.4 - 2020.3

Joint research

　 More details

Authorship：Principal investigator Grant type：Other funds from industry-academia collaboration
My-IoT開発プラットフォームの研究開発

2019.1 - 2022.3

内閣府

　 More details

Authorship：Principal investigator

本研究では、利用者のIoTシステムを自身で容易に構築でき、さらに現場で日常に利用されているパソコンを使うようにIoTシステムを簡単に使えるいわゆるエッジセントリックなIoTシステムアーキテクチャとして「My-IoTプラットフォーム構想」を提案する。この「My-IoTプラットフォーム構想」では、従来のIoTの各種アセットを生かすだけでなく、ローカルPCを使うようにIoTシステムを利用できるような革新的な技術開発を行う。IoT開発者に頼まなくても、利用者自ら習熟容易で簡易に導入可能なIoTシステム設計・開発・運用を可能とすることで、開発コストの大幅な削減とIoT導入の障壁を取り除く。また、プラットフォーム提供者だけでなく、プラットフォーム利用者自ら作った設計資産を登録できる「IoTストア」を整備することで、開発者や利用者が、IoTシステム開発・利活用のノウハウを無償・有償で共有できる、いわゆるシェアリング要素の発展を込めたエコシステムを構築する。この構想を実現すべく、仮想化システムアーキテクチャ、次世代エッジコンピューティング、環境適応型エッジアクチュエーション、エッジプラットフォーム自動構築・開発環境に関する研究開発を行う。また、ユースケースを想定した実証実験を行うとともに、九州地方の企業を中心としたコミュニティを形成し、研究成果の普及に努める。
ポストムーア時代を支える100ギガヘルツ級時空間超伝導コンピューティング

Grant number：19H01105 2019 - 2021

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (A)

　 More details

Authorship：Principal investigator Grant type：Scientific research funding
ポストムーア時代を支える100ギガヘルツ級時空間超伝導コンピューティング

Grant number：19H01105 2019 - 2021

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (A)

　 More details

Authorship：Principal investigator Grant type：Scientific research funding
低炭素AI処理基盤のための革新的超伝導コンピューティング

2018.10 - 2023.3

JST

　 More details

Authorship：Principal investigator

本研究の目的は、来たるべくAI社会を支える極低温コンピューティング基盤の実用化を念頭に、その主要構成要素となるAI処理エンジンSFNuroを開発し、その実現可能性ならびに情報処理インフラとしてのCO2排出量削減効果を示すことにある。SFNuroは単一磁束量子（SFQ：single-flux-quantum）回路を用いた深層学習向けニューラルネットワーク処理エンジンであり、極低温環境でのコンピューティング環境基盤として位置づけられる。上図に示すRSFQやその派生形（Energy-efficient RSFQ、RQL, AQFP, HSTP）など単一磁束量子を利用した超伝導回路を「SFQ回路」と呼ぶが、これらは従来のMOS-FETでは実現できない超高速動作を低電力で行うことが可能であり、ポストムーア時代を見据えた上で有望なコンピューティング環境の一つである。過去にもSFQに関する研究成果が報告されているが、①アーキテクチャレベルの探索、ならびに、②応用を見据えた最適化が十分に行われていなかった。また、③完全動作を追求するが故に動作マージンを確保せざるを得ず、その結果として電力効率に限界が生じていた。これら①〜③は、従来研究において既存CMOS汎用プロセッサを模倣したアーキテクチャを採っていたことに起因する。これらを解決するためには、SFQデバイスや回路の利点を最大限に活かし、かつ、欠点を隠蔽するシステムアーキテクチャを抜本的に再構築しなければならない。そこで本研究では、SFQデバイスの特性を最大限に発揮し、その上で欠点を隠蔽するためのシステム構成法を、回路・アーキテクチャ・アルゴリズムの技術レイヤを跨いだ横断的最適化により導き出す。
近似計算手法を制御する進化型コンピュータのアーキテクチャの検討

2018.4 - 2019.3

Joint research

　 More details

Authorship：Principal investigator Grant type：Other funds from industry-academia collaboration
低炭素AI処理基盤のための革新的超伝導コンピューティング

2018 - 2022

JST Strategic Basic Research Program (Ministry of Education, Culture, Sports, Science and Technology)

　 More details

Authorship：Principal investigator Grant type：Contract research
My-IoT開発プラットフォームの研究開発

2018 - 2022

戦略的イノベーション創造プログラム（SIP）第2期／フィジカル空間デジタルデータ処理基盤

　 More details

Authorship：Principal investigator Grant type：Contract research
物理事象空間に基づくサイバーセキュリティ技術

Grant number：17K19984 2017 - 2018

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Challenging Research(Exploratory)

　 More details

Authorship：Principal investigator Grant type：Scientific research funding
シリコン限界を凌駕する100ギガヘルツ級超伝導プロセッサ・アーキテクチャの研究

2016.4 - 2019.3

日本学術振興会

　 More details

Authorship：Principal investigator

本研究は、ポストシリコン時代を支えるコンピューティング要素技術として、消費電力5ワット程度かつ動作周波数100ギガヘルツ級の超高性能低消費電力な超伝導プロセッサ・アーキテクチャを世界に先駆けて開発する。また、主要構成部品のチップ試作ならびにシステムレベル・シミュレーションにより、その有効性ならびに実現可能性を明かにする。計算機工学ならびに超伝導工学のを跨いだ分野横断型研究であり、超伝導素子の利用を前提としたアーキテクチャと回路のコデザインを実施する。これにより、シリコンに変わる新デバイスを利用したプロセッサ構成法を示すとともに、その実現に必要となる超伝導回路設計技術を確立する。
シリコン限界を凌駕する100ギガヘルツ級超伝導プロセッサ・アーキテクチャの研究

Grant number：16H02796 2016 - 2018

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (B)

　 More details

Authorship：Principal investigator Grant type：Scientific research funding
集積ナノフォトニクスによる超低レイテンシ光演算技術の研究

2015.12 - 2021.3

JST

　 More details

Authorship：Coinvestigator(s)

本研究では、この問題を根本的に解決するために、ナノフォトニクスの精密制御技術を駆使した新しい光コンピューティング技術を提案し、情報処理分野に破壊的イノベーションを引き起こすことを目指す。光コンピュータは 80-90 年代に活発に研究されたが、その後 CMOS に対する優位性を見いだせずに衰退した技術と位置付けられている。本研究では、当時の光コンピュータ研究に関する分析を踏まえて、今後 10-20 年先のレイテンシボトルネックを解消するという目的の元に、新しい演算技術を提案する。
集積ナノフォトニクスによる超低レイテンシ光演算技術の研究

2015 - 2020

JST CREST

　 More details

Authorship：Coinvestigator(s) Grant type：Contract research
宇宙空間コンピューティングの実現に向けた超伝導プロセッサアーキテクチャの研究

Grant number：26540022 2014 - 2015

Grants-in-Aid for Scientific Research Grant-in-Aid for Exploratory Research

　 More details

Authorship：Principal investigator Grant type：Scientific research funding
ポストペタスケールシステムのための電力マネージメントフレームワークの開発

2012.10 - 2018.3

JST

　 More details

Authorship：Coinvestigator(s)

ポストペタスケール高性能計算システムでは、供給電力、あるいは熱設計電力制約の中でハードウェア資源を投入し、運用時のピーク消費電力が制約を超えないことを保証する従来の設計思想では、アプリケーションを今後の大規模システムに対してスケールさせることは難しい。そこで、本研究課題では、ピーク消費電力が制約を超過することを積極的に許し、ハードウェアの電力性能ノブを最適化することで実効電力を制約以下に制御するシステム形態がポストペタスケール高性能計算システムのあるべき姿との認識に立ち、これを前提とするアーキテクチャのコンセプトとする。このような電力制約適応型システムでは、従来のように利用可能な全ハードウェア資源を使い切るという発想ではなく、限られた電力資源を各アプリケーションに、またその中でも計算・記憶・通信という各要素に適応的に配分し、性能やシステムの電力効率を最適化することが重要となる。この適応的な電力制御を行うことができれば、単一システムのもと、電力性能ノブの調整次第で様々なハードウェア資源への要求に対応でき、多くのアプリケーションに適用可能なシステムが構築可能となる。電力制約適応型システム上で高性能かつ高電力効率を達成するためには、アプリケーションの特徴や運用状況等に合わせた電力制御・電力管理がシステムソフトウェアの最も重要な役割の一つとなるが、現状では十分なソフトウェア資産が構築されていないばかりか、システムアーキテクチャや各ソフトウェア階層に求められる要件も明白ではない。そこで、本研究では電力制約適応型システムにおいて、ハードウェアに搭載された電力性能ノブ制御をアプリケーションの特性および運用状況に合わせて最適化し、アプリケーションの性能とシステム全体の電力効率を向上させることを目指す。そのための要素技術として１）アプリケーションの特徴と運用状況に合わせた電力性能ノブ最適化技術、２）大規模アプリケーション向け電力性能挙動予測技術、３）システムソフトウェアから効果的に電力性能ノブを制御可能なシステムアーキテクチャ、の３項目を研究開発する。１）ではライブラリやミドルウェアを含むシステムソフトウェアと性能最適化ツールを、２）では電力予測ツール群を、３）ではソフトウェアからハードウェア依存の最適化を解放するための電力性能ノブ抽象化手法を開発し、最終的にポストペタスケール時代の電力マネージメントフレームワークとして、電力資源を有効利用できる計算環境を創出することが本研究の目的である。
ポストペタスケールシステムのための電力マネージメントフレームワークの開発

2012 - 2017

JST CREST

　 More details

Authorship：Coinvestigator(s) Grant type：Contract research
SMYLEプロジェクト

2010.12 - 2012.3

独立行政法人新エネルギー・産業技術総合開発機構（日本）

　 More details

Authorship：Principal investigator

低消費電力メニーコアの実現においては、大多数の小規模コアの徹底した使用率の向上と、その動作時に消費する電力の大幅な削減が最も重要となる。そして、「コア数にスケール可能な高性能化（コア数を増やせばより性能が高くなる）」と「コア数にスケール可能な低消費電力化（コア数を増やせばより消費電力を削減できる）」といったメニーコアならではの技術開発の実施が急務の課題である。そこで本事業では、組込みシステムにおける低消費電力メニーコアのあるべき姿として「仮想アクセラレータとその実行プラットフォームとしてのメニーコア」を提案し、それを可能にするアーキテクチャの開発、各種APIの策定、ならびに、コンパイラを含めたアプリケーション開発環境の開発を行う。また、シミュレーションならびにプロトタイプにより有効性を明らかにすると共に、提案メニーコアの適応分野に関する調査を実施し実用化に向けた方向性を示す。提案方式では、ハードウェアに柔軟性を持たせ、コンパイラによるアーキテクチャの決定を可能にする。これにより自動並列化戦略の選択肢を拡大することで、多種多様な応用が想定される組込みシステムにおいてもコア数にスケール可能な高い性能を実現できる。また、0.5〜0.6V程度の極低電圧動作において生じる諸問題をメニーコアの豊富なハードウェア資源の徹底利用により解決する。これにより、コア数にスケール可能な低消費電力化が可能となる。
本事業の実施に関しては、従来の固定観念に捕らわれない斬新的かつ実効的な体制で実施する。具体的には、九州大学（全体統括、アーキテクチャ）、立命館大学（コンパイラ）、電気通信大学（低消費電力手法）の若手研究者と、現在急成長中のベンチャー企業であるフィックスターズ（プログラミングとコンパイラ）ならびにトプスシステムズ（プロセッサ開発とその応用展開）の5組織による強固な連携体制を採る。また、本事業実施場所としては、九州大学大学院システム情報科学研究院井上研究室、立命館大学理工学部電子情報デザイン学科　冨山研究室、電気通信大学大学院情報システム学研究科　近藤研究室、株式会社フィックスターズ本社（大崎）、ならびに、株式会社トプスシステムズ本社（つくば）とする。
SMYLEメニーコア

2010.12 - 2012.3

独立行政法人新エネルギー・産業技術総合開発機構（日本）

　 More details

Authorship：Principal investigator

低消費電力メニーコアの実現においては、大多数の小規模コアの徹底した使用率の向上と、その動作時に消費する電力の大幅な削減が最も重要となる。そして、「コア数にスケール可能な高性能化（コア数を増やせばより性能が高くなる）」と「コア数にスケール可能な低消費電力化（コア数を増やせばより消費電力を削減できる）」といったメニーコアならではの技術開発の実施が急務の課題である。そこで本事業では、組込みシステムにおける低消費電力メニーコアのあるべき姿として「仮想アクセラレータとその実行プラットフォームとしてのメニーコア」を提案し、それを可能にするアーキテクチャの開発、各種APIの策定、ならびに、コンパイラを含めたアプリケーション開発環境の開発を行う。また、シミュレーションならびにプロトタイプにより有効性を明らかにすると共に、提案メニーコアの適応分野に関する調査を実施し実用化に向けた方向性を示す。提案方式では、ハードウェアに柔軟性を持たせ、コンパイラによるアーキテクチャの決定を可能にする。これにより自動並列化戦略の選択肢を拡大することで、多種多様な応用が想定される組込みシステムにおいてもコア数にスケール可能な高い性能を実現できる。また、0.5〜0.6V程度の極低電圧動作において生じる諸問題をメニーコアの豊富なハードウェア資源の徹底利用により解決する。これにより、コア数にスケール可能な低消費電力化が可能となる。
本事業の実施に関しては、従来の固定観念に捕らわれない斬新的かつ実効的な体制で実施する。具体的には、九州大学（全体統括、アーキテクチャ）、立命館大学（コンパイラ）、電気通信大学（低消費電力手法）の若手研究者と、現在急成長中のベンチャー企業であるフィックスターズ（プログラミングとコンパイラ）ならびにトプスシステムズ（プロセッサ開発とその応用展開）の5組織による強固な連携体制を採る。また、本事業実施場所としては、九州大学大学院システム情報科学研究院井上研究室、立命館大学理工学部電子情報デザイン学科　冨山研究室、電気通信大学大学院情報システム学研究科　近藤研究室、株式会社フィックスターズ本社（大崎）、ならびに、株式会社トプスシステムズ本社（つくば）とする。
「極低電力回路・システム技術開発（グリーンITプロジェクト）」研究開発項目⑦「低消費電力メニーコア用アーキテクチャとコンパイラ技術」

2010 - 2012

新エネルギー・産業技術総合開発機構（NEDO）

　 More details

Authorship：Principal investigator Grant type：Contract research
オンチップ・スーパーコンピューティングを可能にするメニーコア・プロセッサの研究

2009.4 - 2013.3

日本学術振興会（日本）

　 More details

Authorship：Principal investigator

本研究では、次世代情報化社会を支える基盤要素技術の1つとして、オンチップ・スーパーコンピューティングを可能にする「新時代3次元メニーコア・プロセッサ」を開発する。また、プロトタイピングならびにシミュレーションを実施し、提案プロセッサの有効性と実現可能性を実証する。具体的には、1個のLSIチップに3次元実装された数百個のプロセッサ・コア（以降コアと略す）を適応的に協調動作させ、図1に示すように中規模スーパーコンピュータと同等の性能を達成しつつ、環境問題対策としての消費電力削減、ならびに、安定・安全運用のための信頼性/安全性の向上をも可能にする。これにより、図2のような近未来情報社会を支える高性能基幹サーバでの実用化を目指す。
マルチコア・プロセッサの実効性能最大化を目的としたコア・オーケストレーション技術の開発

2009.4 - 2012.3

半導体理工学研究センター：STARC（日本）

　 More details

Authorship：Principal investigator

本研究の目的は、マルチコア・プロセッサが本来有する潜在能力を最大限に引出すべく、複数コアが適応的に協調実行する（つまり、必要に応じて助け合い実行する）コア・オーケストレーション技術を確立することにある。これにより、ハードウェア・コストや消費電力を殆ど増加することなく、従来の並列実行方式と比較して60%以上の性能向上を目指す（これまでの予備実験結果に基づきこの目標値を設定）。また、本研究ではテストチップ試作ならびにプロトタイピングにより、提案方式の実現可能性を実証する。
エネルギー効率の最大化を目的とした適応型3次元マイクロプロセッサ・アーキテクチャの研究

2009.1 - 2012.12

独立行政法人新エネルギー・産業技術総合開発機構：NEDO若手グラント（日本）

　 More details

Authorship：Principal investigator

本研究では、「半導体デバイスの3次元実装技術」と「アーキテクチャ技術」を融合し、エネルギー効率を最大化する新しいマイクロプロセッサを開発する。具体的には、「複数プロセッサ・コア＋動的再構成可能アクセラレータ＋大容量メモリ」を3次元に積層した適応型次世代マイクロプロセッサ・アーキテクチャを提案する。また、その潜在能力を最大限引き出すための協調実行方式ならびにコンパイル技術を確立し、提案方式の有効性を示すと共に、実用化を見据えたプロトタイピングにより実現可能性を実証する。
オンチップ・スーパーコンピューティングを可能にするメニーコア･プロセッサの研究

Grant number：21680005 2009 - 2012

Grants-in-Aid for Scientific Research Grant-in-Aid for Young Scientists (A)

　 More details

Authorship：Principal investigator Grant type：Scientific research funding
エネルギー効率の最大化を目的とした適応型3次元マイクロプロセッサ・アーキテクチャの研究

2008 - 2012

独立行政法人新エネルギー・産業技術総合開発機構（NEDO若手グラント）

　 More details

Authorship：Principal investigator Grant type：Contract research
単一磁束量子回路による再構成可能な低電力高性能プロセッサ

2006.9

　 More details

Authorship：Coinvestigator(s)

10テラフロップス程度の計算能力をもつ、デスクサイドに設置可能なコンピュータを、超伝導単一磁束量子（SFQ）回路による再構成可能な大規模データパス（RDP）を有するプロセッサによって実現することを目指し、アーキテクチャ、演算回路からデバイスに至る研究を行う。現在のCMOS半導体集積回路技術を用い、並列プロセッサ方式で実現する場合に比べ、消費電力がプロセッサ部で10,000分の１以下、コンピュータ全体で約400分の１、空調機や冷凍機も含めて約100分の１に抑制されると予想される。本研究では、コンピュータアーキテクチャ、算術演算回路、SFQ回路のそれぞれの分野で研究業績を有する研究者が協力して研究を進め、RDPアーキテクチャ技術の確立、SFQ回路による再構成可能な回路の構成法の開発、SFQ−RDPに適した浮動小数点演算ユニットの構成法の開発などを行い、それにより大規模SFQ-RDPを有する10テラフロップスコンピュータの基盤技術を確立する。
ペタスケール・システムインターコネクト技術の開発

2005.4 - 2008.3

文部科学省

　 More details

Authorship：Coinvestigator(s)

PSIプロジェクトとは、ペタフロップス超級スーパーコンピュータシステムの構成において数千〜数十万規模の高速計算ノードを相互結合するシステムインターコネクト技術を対象に、現状のシステムよりもコスト対性能比で１桁上を目指して高性能化、高機能化、低コスト化を同時に達成するための３つの要素技術、すなわち、①光パケットスイッチと超小型光リンク技術、②動的通信最適化によるMPI高速化、③システムインターコネクトの総合性能評価技術を開発するプロジェクトです。
高信頼化と低消費電力化の両立を目的とした環境適応型プロセッサに関する研究

2005.4 - 2007.3

日本学術振興会（日本）

　 More details

Authorship：Principal investigator

本研究では、次世代の情報化社会を支える基盤技術として、「耐故障性の向上と低消費エネルギー化の両立を目的した環境適応型プロセッサ・システム」を開発する。本研究では、個人携帯型電子機器システムの使用を前提とし、耐故障性の向上だけでなく、安全性までも考慮に入れたディペンダブル・プロセッサを開発します。また、信頼性と消費エネルギーのトレードオフに関する解析も行います。
高信頼化と低消費電力化の両立を目的とした環境適応型プロセッサに関する研究

Grant number：17680005 2005 - 2007

Grants-in-Aid for Scientific Research Grant-in-Aid for Young Scientists (A)

　 More details

Authorship：Principal investigator Grant type：Scientific research funding
安全で低消費エネルギーなプロセッサに関する研究

2004.9 - 2005.3

Research commissions

　 More details

Authorship：Principal investigator Grant type：Other funds from industry-academia collaboration
安全で低消費エネルギーなプロセッサに関する研究

2003.9 - 2007.3

科学技術振興機構

　 More details

Authorship：Principal investigator

In next social infrastructures based on advanced information technology, microprocessor systems will deeply infiltrate into our daily lives, for example, electric government, electric money, ubiquitous computing, and so on. To achieve steady social environment, microprocessors have to solve at least two issues: improvement in the safety and reduction of more energy. In this study, we propose a novel processor system to solve the computer-virus problem, and analyze an existing trade-off between safety and energy consumption. Our processor system exploits program-execution behavior as a confidential key to achieve on-line program certification.
安全で低消費エネルギーなプロセッサに関する研究

2003 - 2006

科学技術振興機構個人型研究さきがけ

　 More details

Authorship：Principal investigator Grant type：Contract research
予測技術に基づく高性能/低消費電力メモリシステムの開発

2002.4 - 2005.3

日本学術振興会（日本）

　 More details

Authorship：Principal investigator

We develop dynamic optimization techniques to achieve high-performance and low-energy consumption at the same time. Our approach monitors memory-access behavior, and attempts to eliminate unnecessary operations.
予測技術を用いた高性能/低消費電力メモリ・システムの開発

Grant number：14702064 2002 - 2004

Grants-in-Aid for Scientific Research Grant-in-Aid for Young Scientists (A)

　 More details

Authorship：Principal investigator Grant type：Scientific research funding

▼display all

Educational Activities

修士ならびに学部教育においては、「問題解決能力の習得」に重きを置き、新しいアイデアの考案からその有効性の実証までを一環して教育している。また、博士課程の学生においては、これに加え、「問題発見能力の習得」を中心とした指導を行っている。また、海外大学や研究機関との共同研究を通して国際的な教育にも力を入れている。博士後期学生の海外留学も推進している。世界最先端研究を通して、次世代のコンピュータアーキテクチャ技術を支える人材を育成する。

Award for Educational Activities

九州大学工学講義賞

2021.10 九州大学

Award-winner：井上弘士

Class subject

【通年】情報理工学講究

2023.4 - 2024.3 Full year
【通年】情報理工学演習

2023.4 - 2024.3 Full year
【通年】情報理工学研究Ⅰ

2023.4 - 2024.3 Full year
Research in Information Science and Technology I

2023.4 - 2024.3 Full year
Seminar in Information Science and Technology

2023.4 - 2024.3 Full year
Research in Information Science and Technology I

2023.4 - 2024.3 Full year
【通年】情報理工学講究

2023.4 - 2024.3 Full year
【通年】情報理工学演習

2023.4 - 2024.3 Full year
【通年】情報理工学研究Ⅰ

2023.4 - 2024.3 Full year
Seminar in Information Science and Technology

2023.4 - 2024.3 Full year
情報理工学論議Ⅰ

2023.4 - 2023.9 First semester
情報理工学論述Ⅰ

2023.4 - 2023.9 First semester
情報理工学読解

2023.4 - 2023.9 First semester
Presentation Methods in Information Science and Technology

2023.4 - 2023.9 First semester
Presentation Methods in Information Science and Technology

2023.4 - 2023.9 First semester
情報理工学論議Ⅰ

2023.4 - 2023.9 First semester
情報理工学論述Ⅰ

2023.4 - 2023.9 First semester
情報理工学読解

2023.4 - 2023.9 First semester
Advanced Computer System Architecture

2023.4 - 2023.6 Spring quarter
Advanced Computer System Architecture

2023.4 - 2023.6 Spring quarter
コンピュータアーキテクチャⅡ

2022.10 - 2023.3 Second semester
集積回路工学通論B

2022.6 - 2022.8 Summer quarter
コンピュータアーキテクチャⅠ（EC）

2022.6 - 2022.8 Summer quarter
コンピュータアーキテクチャⅠ（B)

2022.6 - 2022.8 Summer quarter
情報理工学講究

2022.4 - 2023.3 Full year
情報理工学研究Ⅰ

2022.4 - 2023.3 Full year
情報理工学演習

2022.4 - 2023.3 Full year
情報理工学論議Ⅰ

2022.4 - 2022.9 First semester
集積回路工学通論

2022.4 - 2022.9 First semester
情報知能工学演習第二

2022.4 - 2022.9 First semester
情報知能工学講究第二

2022.4 - 2022.9 First semester
情報理工学読解

2022.4 - 2022.9 First semester
情報理工学論述Ⅰ

2022.4 - 2022.9 First semester
Advanced Computer System Architecture

2022.4 - 2022.6 Spring quarter
集積回路工学通論A

2022.4 - 2022.6 Spring quarter
コンピュータシステム・アーキテクチャ特論

2022.4 - 2022.6 Spring quarter
コンピュータシステム・アーキテクチャ特論

2022.4 - 2022.6 Spring quarter
Advanced Computer System Architecture

2022.4 - 2022.6 Spring quarter
(IUPE)Computer Architecture I

2021.12 - 2022.2 Winter quarter
情報知能工学講究第一

2021.10 - 2022.3 Second semester
コンピュータアーキテクチャⅡ

2021.10 - 2022.3 Second semester
情報理工学演示

2021.10 - 2022.3 Second semester
コンピュータアーキテクチャⅡ

2021.10 - 2022.3 Second semester
情報知能工学演習第三

2021.10 - 2022.3 Second semester
情報知能工学講究第三

2021.10 - 2022.3 Second semester
情報知能工学演習第一

2021.10 - 2022.3 Second semester
組込みソフトウェア特論

2021.6 - 2021.8 Summer quarter
コンピュータアーキテクチャⅠ

2021.6 - 2021.8 Summer quarter
コンピュータアーキテクチャⅠ（A前半，B）

2021.6 - 2021.8 Summer quarter
集積回路工学通論B

2021.6 - 2021.8 Summer quarter
組込みシステム特論

2021.6 - 2021.8 Summer quarter
[M2][通信/社会分野]組込みシステム特論

2021.6 - 2021.8 Summer quarter
[M2][計算機分野]組込みシステム特論

2021.6 - 2021.8 Summer quarter
Advanced Seminar in Social Information Systems Engineering

2021.4 - 2022.3 Full year
情報理工学研究Ⅰ

2021.4 - 2022.3 Full year
情報理工学演習

2021.4 - 2022.3 Full year
国際演示技法

2021.4 - 2022.3 Full year
知的財産技法

2021.4 - 2022.3 Full year
ティーチング演習

2021.4 - 2022.3 Full year
先端プロジェクト管理技法

2021.4 - 2022.3 Full year
Scientific English Presentation

2021.4 - 2022.3 Full year
Intellectual Property Management

2021.4 - 2022.3 Full year
Exercise in Teaching

2021.4 - 2022.3 Full year
Advanced Project Management Technique

2021.4 - 2022.3 Full year
計算機構特別講究

2021.4 - 2022.3 Full year
Advanced Research in Computer Systems and Applications

2021.4 - 2022.3 Full year
情報知能工学特別講究第一

2021.4 - 2022.3 Full year
情報知能工学特別講究第二

2021.4 - 2022.3 Full year
知的情報システム工学特別演習

2021.4 - 2022.3 Full year
社会情報システム工学特別演習

2021.4 - 2022.3 Full year
Advanced Research in Advanced Information Technology I

2021.4 - 2022.3 Full year
Advanced Research in Advanced Information Technology II

2021.4 - 2022.3 Full year
Adv Semi in Intelligent Information Systems Engineering

2021.4 - 2022.3 Full year
[M2]Exercise in Embedded System

2021.4 - 2021.9 First semester
集積回路工学通論

2021.4 - 2021.9 First semester
組込みシステム演習

2021.4 - 2021.9 First semester
[M2]組込みシステム演習

2021.4 - 2021.9 First semester
情報理工学読解

2021.4 - 2021.9 First semester
[M2]情報知能工学演習第二

2021.4 - 2021.9 First semester
[M2]情報知能工学講究第二

2021.4 - 2021.9 First semester
Exercise in Embedded System

2021.4 - 2021.9 First semester
[M2]Advanced Computer System Architecture

2021.4 - 2021.6 Spring quarter
集積回路工学通論A

2021.4 - 2021.6 Spring quarter
コンピュータシステム・アーキテクチャ特論

2021.4 - 2021.6 Spring quarter
[M2]コンピュータシステム・アーキテクチャ特論

2021.4 - 2021.6 Spring quarter
Advanced Computer System Architecture

2021.4 - 2021.6 Spring quarter
(IUPE)Computer Architecture I

2020.12 - 2021.2 Winter quarter
コンピュータアーキテクチャⅡ

2020.10 - 2021.3 Second semester
電気情報工学入門Ⅱ

2020.10 - 2021.3 Second semester
コンピュータアーキテクチャⅡ

2020.10 - 2021.3 Second semester
情報知能工学演習第一

2020.10 - 2021.3 Second semester
情報知能工学演習第三

2020.10 - 2021.3 Second semester
情報知能工学講究第一

2020.10 - 2021.3 Second semester
情報知能工学講究第三

2020.10 - 2021.3 Second semester
集積回路工学通論B

2020.6 - 2020.8 Summer quarter
コンピュータアーキテクチャⅠ

2020.6 - 2020.8 Summer quarter
コンピュータアーキテクチャⅠ（B）

2020.6 - 2020.8 Summer quarter
Advanced Seminar in Social Information Systems Engineering

2020.4 - 2021.3 Full year
国際演示技法

2020.4 - 2021.3 Full year
知的財産技法

2020.4 - 2021.3 Full year
ティーチング演習

2020.4 - 2021.3 Full year
先端プロジェクト管理技法

2020.4 - 2021.3 Full year
Scientific English Presentation

2020.4 - 2021.3 Full year
Intellectual Property Management

2020.4 - 2021.3 Full year
Exercise in Teaching

2020.4 - 2021.3 Full year
Advanced Project Management Technique

2020.4 - 2021.3 Full year
計算機構特別講究

2020.4 - 2021.3 Full year
Advanced Research in Computer Systems and Applications

2020.4 - 2021.3 Full year
情報知能工学特別講究第一

2020.4 - 2021.3 Full year
情報知能工学特別講究第二

2020.4 - 2021.3 Full year
知的情報システム工学特別演習

2020.4 - 2021.3 Full year
社会情報システム工学特別演習

2020.4 - 2021.3 Full year
Advanced Research in Advanced Information Technology I

2020.4 - 2021.3 Full year
Advanced Research in Advanced Information Technology II

2020.4 - 2021.3 Full year
Adv Semi in Intelligent Information Systems Engineering

2020.4 - 2021.3 Full year
情報知能工学講究第二

2020.4 - 2020.9 First semester
電気情報工学入門Ⅰ

2020.4 - 2020.9 First semester
コンピュータシステム・アーキテクチャ特論

2020.4 - 2020.9 First semester
情報知能工学演習第二

2020.4 - 2020.9 First semester
集積回路工学通論

2020.4 - 2020.6 Spring quarter
集積回路工学通論A

2020.4 - 2020.6 Spring quarter
(IUPE)Computer Architecture I

2019.12 - 2020.2 Winter quarter
情報知能工学講究第三

2019.10 - 2020.3 Second semester
コンピュータアーキテクチャⅡ

2019.10 - 2020.3 Second semester
コンピュータアーキテクチャⅡ

2019.10 - 2020.3 Second semester
情報知能工学演習第一

2019.10 - 2020.3 Second semester
情報知能工学演習第三

2019.10 - 2020.3 Second semester
情報知能工学講究第一

2019.10 - 2020.3 Second semester
コンピュータ・アーキテクチャⅠ

2019.6 - 2019.8 Summer quarter
コンピュータアーキテクチャⅠ（B)

2019.6 - 2019.8 Summer quarter
集積回路工学通論B

2019.6 - 2019.8 Summer quarter
集積回路工学通論A/B

2019.4 - 2019.9 First semester
集積回路工学通論

2019.4 - 2019.9 First semester
コンピュータアーキテクチャ特論

2019.4 - 2019.9 First semester
コンピュータシステム・アーキテクチャ特論

2019.4 - 2019.9 First semester
情報知能工学演習第二

2019.4 - 2019.9 First semester
情報知能工学講究第二

2019.4 - 2019.9 First semester
集積回路工学通論A

2019.4 - 2019.6 Spring quarter
コンピュータ・アーキテクチャⅡ

2018.10 - 2019.3 Second semester
コンピュータアーキテクチャⅡ

2018.10 - 2019.3 Second semester
コンピュータアーキテクチャⅡ

2018.10 - 2019.3 Second semester
情報知能工学演習第一

2018.10 - 2019.3 Second semester
情報知能工学演習第三

2018.10 - 2019.3 Second semester
情報知能工学講究第一

2018.10 - 2019.3 Second semester
情報知能工学講究第三

2018.10 - 2019.3 Second semester
コンピュータ・アーキテクチャⅠ

2018.6 - 2018.8 Summer quarter
コンピュータアーキテクチャⅠ

2018.6 - 2018.8 Summer quarter
コンピュータシステム・アーキテクチャ特論

2018.4 - 2018.9 First semester
コンピュータアーキテクチャ特論

2018.4 - 2018.9 First semester
コンピュータシステム・アーキテクチャ特論

2018.4 - 2018.9 First semester
情報知能工学演習第二

2018.4 - 2018.9 First semester
情報知能工学講究第二

2018.4 - 2018.9 First semester
情報知能工学講究第三

2017.10 - 2018.3 Second semester
コンピュータアーキテクチャⅡ

2017.10 - 2018.3 Second semester
コンピュータアーキテクチャⅡ

2017.10 - 2018.3 Second semester
情報知能工学演習第一

2017.10 - 2018.3 Second semester
情報知能工学演習第三

2017.10 - 2018.3 Second semester
情報知能工学講究第一

2017.10 - 2018.3 Second semester
コンピュータアーキテクチャⅠ

2017.6 - 2017.8 Summer quarter
Advanced Research in Computer Systems and Applications

2017.4 - 2018.3 Full year
国際演示技法

2017.4 - 2018.3 Full year
知的財産技法

2017.4 - 2018.3 Full year
ティーチング演習

2017.4 - 2018.3 Full year
先端プロジェクト管理技法

2017.4 - 2018.3 Full year
Overseas Internship

2017.4 - 2018.3 Full year
Scientific English Presentation

2017.4 - 2018.3 Full year
Intellectual Property Management

2017.4 - 2018.3 Full year
Exercise in Teaching

2017.4 - 2018.3 Full year
Advanced Project Management Technique

2017.4 - 2018.3 Full year
情報知能工学特別講究第一

2017.4 - 2018.3 Full year
情報知能工学特別講究第二

2017.4 - 2018.3 Full year
Advanced Research in Advanced Information Technology I

2017.4 - 2018.3 Full year
Advanced Research in Advanced Information Technology II

2017.4 - 2018.3 Full year
知的情報システム工学特別演習

2017.4 - 2018.3 Full year
社会情報システム工学特別演習

2017.4 - 2018.3 Full year
Adv Semi in Intelligent Information Systems Engineering

2017.4 - 2018.3 Full year
Advanced Seminar in Social Information Systems Engineering

2017.4 - 2018.3 Full year
計算機構特別講究

2017.4 - 2018.3 Full year
コンピュータ・アーキテクチャⅠ

2017.4 - 2017.9 First semester
ﾌﾟﾛｸﾞﾗﾐﾝｸﾞ演習

2017.4 - 2017.9 First semester
コンピュータアーキテクチャ特論

2017.4 - 2017.9 First semester
コンピュータシステム・アーキテクチャ特論

2017.4 - 2017.9 First semester
情報知能工学演習第二

2017.4 - 2017.9 First semester
情報知能工学講究第二

2017.4 - 2017.9 First semester
コンピュータシステム・アーキテクチャ特論

2017.4 - 2017.9 First semester
コンピュータ・アーキテクチャⅡ

2017.4 - 2017.9 First semester
コンピュータ・アーキテクチャⅠ

2016.4 - 2016.9 First semester
コンピュータ・アーキテクチャⅡ

2016.4 - 2016.9 First semester
コンピュータシステム・アーキテクチャ特論

2016.4 - 2016.9 First semester
ハードウェア設計論特論

2015.10 - 2016.3 Second semester
コンピュータ・アーキテクチャⅠ

2015.4 - 2015.9 First semester
コンピュータシステム・アーキテクチャ特論

2015.4 - 2015.9 First semester
回路理論Ⅰ

2015.4 - 2015.9 First semester
ハードウェア設計論特論

2014.10 - 2015.3 Second semester
コンピュータアーキテクチャ特論

2014.4 - 2014.9 First semester
コンピュータ・アーキテクチャⅠ

2014.4 - 2014.9 First semester
情報処理演習I

2013.10 - 2014.3 Second semester
コンピュータアーキテクチャ特論

2013.4 - 2013.9 First semester
コンピュータ・アーキテクチャⅠ

2013.4 - 2013.9 First semester
コンピュータ・アーキテクチャⅠ

2012.4 - 2012.9 First semester
コンピュータアーキテクチャ特論

2012.4 - 2012.9 First semester
コンピュータアーキテクチャ特論

2011.4 - 2011.9 First semester
コンピュータ・アーキテクチャⅠ

2011.4 - 2011.9 First semester
コンピュータアーキテクチャ特論

2010.4 - 2010.9 First semester
コンピュータ・アーキテクチャⅠ

2010.4 - 2010.9 First semester
コンピュータ・アーキテクチャⅠ

2009.4 - 2009.9 First semester
コンピュータアーキテクチャ特論

2009.4 - 2009.9 First semester
計算機構成論Ⅰ

2008.10 - 2009.3 Second semester
システム・アーキテクチャ特論

2008.10 - 2009.3 Second semester
情報論理学

2008.10 - 2009.3 Second semester
コンピュータ・アーキテクチャⅠ

2008.4 - 2008.9 First semester
システム・アーキテクチャ特論

2007.10 - 2008.3 Second semester
情報論理学

2007.10 - 2008.3 Second semester
コンピュータ・アーキテクチャⅠ

2007.4 - 2007.9 First semester
システムアーキテクチャ特論

2006.10 - 2007.3 Second semester
システム・アーキテクチャ特論

2005.10 - 2006.3 Second semester
情報科学講究

2005.10 - 2006.3 Second semester
情報論理学

2005.10 - 2006.3 Second semester
情報理学演習第一

2005.4 - 2006.3 Full year
情報科学特別研究

2005.4 - 2006.3 Full year
基礎情報学特別演習

2005.4 - 2006.3 Full year
基礎情報学特別講究

2005.4 - 2006.3 Full year
情報理学特別演習第一

2005.4 - 2006.3 Full year
情報理学特別講究第一

2005.4 - 2006.3 Full year
情報理学特別研究

2005.4 - 2006.3 Full year
情報理学講究第二

2005.4 - 2006.3 Full year
情報理学講究第一

2005.4 - 2006.3 Full year

▼display all

FD Participation

2024.5 Title：科研費の最近の動向について

Visiting, concurrent, or part-time lecturers at other universities, institutions, etc.

2023 国立情報学研究所 Classification:Affiliate faculty Domestic/International Classification:Japan
2022 国立情報学研究所 Classification:Affiliate faculty Domestic/International Classification:Japan
2021 国立情報学研究所 Classification:Affiliate faculty Domestic/International Classification:Japan
2020 国立情報学研究所 Classification:Affiliate faculty Domestic/International Classification:Japan
2013 北九州市立大学 Classification:Part-time lecturer Domestic/International Classification:Overseas

Semester, Day Time or Duration：前期
2012 北九州市立大学 Classification:Part-time lecturer Domestic/International Classification:Overseas

Semester, Day Time or Duration：前期
2011 北九州市立大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：前期
2010 北九州市立大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：前期、隔週
2009 北九州市立大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：前期、集中講義
2007 北九州市立大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：前期集中講義
2007 福岡大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：後期４限
2006 北九州市立大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：前期集中講義
2006 福岡大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：後期
2005 福岡大学 Classification:Part-time lecturer Domestic/International Classification:Japan

Semester, Day Time or Duration：後期火曜日４限

▼display all

Participation in international educational events, etc.

UPWARDS

UPWARDS

Other educational activity and Special note

2023 Class Teacher 学部
2022 Class Teacher 学部
2021 Class Teacher 学部
2020 Class Teacher 学部
2013 Class Teacher 学部
2012 Class Teacher 学部
2011 Class Teacher 学部

▼display all

Outline of Social Contribution and International Cooperation activities

企業を対象とした研究成果報告会や、国際会議での役員として活動している。また、ソウル大学とも共同で研究を進めている。

Social Activities

中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2009.8

　More details

Audience：General,　Scientific,　Company,　Civic organization,　Governmental agency

Type：Seminar, workshop
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2009.8

　More details

Audience：Infants,　Schoolchildren,　Junior students,　High school students

Type：Seminar, workshop
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2009.8

　More details

Type：Seminar, workshop

researchmap
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2009.8

　More details

Type：Science cafe

researchmap
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2008.8

　More details

Audience：General,　Scientific,　Company,　Civic organization,　Governmental agency

Type：Seminar, workshop
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2008.8

　More details

Audience：Infants,　Schoolchildren,　Junior students,　High school students

Type：Seminar, workshop
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2008.8

　More details

Type：Seminar, workshop

researchmap
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2008.8

　More details

Type：Science cafe

researchmap

▼display all