研究者詳細 - 井上弘士

お知らせ

写真a

イノウエ　コウジ

井上弘士

KOJI INOUE

所属

システム情報科学研究院情報知能工学部門教授
日本エジプト科学技術連携センター（併任）
情報基盤研究開発センター（併任）
システムＬＳＩ研究センター（併任）
マス・フォア・イノベーション連係学府（併任）
理学部物理学科（併任）
工学部電気情報工学科（併任）
システム情報科学府情報理工学専攻（併任）

連絡先

電話番号

0928023793

プロフィール

安全で安定した情報化社会システムを実現するためには、コンピュータ・システムの高性能化や低消費電力化だけでなく、安全性や信頼性の向上が極めて重要となります。そこで我々は、次世代の情報化社会を支える基盤技術として、「利用環境や利用状況に応じて、性能、消費エネルギー、安全性、信頼性のバランスを調節できるコンピュータシステム・アーキテクチャ」に関する研究を行っています。また、仮想世界でのシミュレーションだけでなく、実際にVLSIの設計も行っています。

ホームページ

https://sites.google.com/view/kojiinoue-jp/

外部リンク

研究分野

ものづくり技術（機械・電気電子・化学工学） / 制御、システム工学

学位

工学博士

経歴

株式会社　横河電機（1996年4月〜1996年12月）
福岡大学（2001年4月〜2004年8月）

学歴

九州工業大学 Graduate School, Division of Information Engineering

- 1996年

　詳細を見る

researchmap
九州工業大学情報工学研究科情報科学

- 1996年

　詳細を見る

国名：日本国

researchmap
九州工業大学 Faculty of Computer Science and Systems Engineering

- 1994年

　詳細を見る

researchmap
九州工業大学情報工学部知能情報工学科

- 1994年

　詳細を見る

国名：日本国

researchmap

研究テーマ・研究キーワード

研究テーマ：次世代コンピュータシステム・アーキテクチャに関する研究

研究キーワード：プロセッサ／メモリアーキテクチャ、　高性能／消費電力／安全・高信頼コンピューティング、新デバイス・コンピューティング、超伝導コンピューティング、量子コンピューティング

研究期間： 2004年9月

受賞

Design Contest Award Honorable Mention

2017年8月 IEEE The 23rd International Symposium on Low Power Electronics and Design (ISLPED)

　詳細を見る

1.6-mW, 56-GHz Arithmetic Logic Unit Based on Superconductor Single-Flux-Quantum Logic Circuit
2011年ハイパフォーマンスコンピューティングと計算科学シンポジウム最優秀論文賞

2011年1月
平成20年度科学技術分野の文部科学大臣表彰若手科学者賞

2008年4月文部科学省
第15回回路とシステム（軽井沢）ワークショップ奨励賞

2003年1月

　詳細を見る

若手奨励賞
第4回 LSI IPデザイン・アワードチャレンジ賞

2002年1月

　詳細を見る

LSI IPデザイン・アワードチャレンジ賞
情報処理学会創立40周年記念論文賞

2001年1月

　詳細を見る

情報処理学会創立40周年記念論文賞

▼全件表示

論文

SuperCore: An Ultra-Fast Superconducting Processor for Cryogenic Applications 査読国際誌

Choi J., Byun I., Hong J., Min D., Kim J., Cho J., Jeong H., Tanaka M., Inoue K., Kim J.

Proceedings of the Annual International Symposium on Microarchitecture Micro 1532 - 1547 2024年11月（ ISSN:10724451 ISBN:9798350350579 ）

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Proceedings of the Annual International Symposium on Microarchitecture Micro

Superconductor single-flux-quantum (SFQ) logic family has been recognized as a promising technology for cryogenic applications (e.g., quantum computing, astronomy, metrology) thanks to its ultra-fast and low-energy characteristics. Therefore, recent efforts in SFQ-based computing have focused on developing fast and low-power SFQ processors for cryogenic applications. However, there still has been little progress toward a convincing SFQ processor design due to the critical performance challenges originating from its extremely deep pipeline. In this paper, we propose a super-fast and low-power in-order SFQ processor by tackling the challenges from the deep pipeline. First, we develop a minimal-depth SFQ processor pipeline with novel architecture-level ideas. Next, we conduct in-depth performance analyses and identify three real performance bottlenecks in the deeply pipelined SFQ processors (i.e., stall/flush logic, RAW stall, fetch unit). Finally, we propose SuperCore, our super-fast SFQ-based processor architecture, with three SFQ-friendly solutions that effectively resolve the identified bottlenecks. With our solutions applied, SuperCore achieves 11 times speed-up over the SFQ processor baseline. In addition, SuperCore achieves six times speed-up and consumes up to 193 times less power compared to in-order CMOS processors running at 4K.

DOI： 10.1109/MICRO61859.2024.00112

Scopus
QIsim: Architecting 10+K Qubit QC Interfaces Toward Quantum Supremacy 査読

Min D., Byun I., Kim J., Tanaka M., Kim J., Choi J., Inoue K.

Proceedings - International Symposium on Computer Architecture 1 - 16 2023年6月（ ISSN:10636897 ISBN:9798400700958 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Proceedings - International Symposium on Computer Architecture

量子コンピュータにおける量子ビットと古典処理のインタフェースアーキテクチャの探索と提案。

DOI： 10.1145/3579371.3589036

Scopus

CiNii Research

researchmap
Q3DE: A fault-tolerant quantum computer architecture for multi-bit burst errors by cosmic rays 査読

Yasunari Suzuki, Takanori Sugiyama, Tomochika Arai, Wang Liao, Koji Inoue, Teruo Tanimoto

2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO) 2022-October 1110 - 1125 2022年10月（ ISSN:10724451 ISBN:9781665462723 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：IEEE

宇宙船が量子ビットの誤り耐性に与える影響を分析し、この問題を解決する誤り訂正アルゴリズムとアーキテクチャを提案。

DOI： 10.1109/MICRO56248.2022.00079

Scopus

CiNii Research

researchmap

その他リンク： https://dblp.uni-trier.de/db/conf/micro/micro2022.html#SuzukiSALIT22
XQsim: modeling cross-technology control processors for 10+K qubit quantum computers.

Ilkwon Byun, Junpyo Kim, Dongmoon Min, Ikki Nagaoka, Kosuke Fukumitsu, Iori Ishikawa, Teruo Tanimoto, Masamitsu Tanaka, Koji Inoue, Jangwoo Kim

Proceedings of the 49th Annual International Symposium on Computer Architecture 366 - 382 2022年6月（ ISSN:10636897 ISBN:9781450386104 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：ACM

量子誤り訂正アーキテクチャの探索と改善に関する提案。

DOI： 10.1145/3470496.3527417

Scopus

CiNii Research

researchmap

その他リンク： https://dblp.uni-trier.de/db/conf/isca/isca2022.html#ByunKMNFITTIK22
Superconductor Computing for Neural Networks.

Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kosuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, Koji Inoue

IEEE Micro 41 ( 3 ) 19 - 26 2021年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

超伝導単一磁束量子回路を用いたAIアクセラレータアーキテクチャの提案。

DOI： 10.1109/MM.2021.3070488
SuperNPU: An Extremely Fast Neural Processing Unit Using Superconducting Logic Devices.

Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kosuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, Koji Inoue

53rd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO) 58 - 72 2020年10月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

超伝導単一磁束量子回路を用いたAIアクセラレータアーキテクチャの提案。

DOI： 10.1109/MICRO50266.2020.00018
Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing.

Yuichi Inadomi, Tapasya Patki, Koji Inoue, Mutsumi Aoyagi, Barry Rountree, Martin Schulz 0001, David K. Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC) 78 - 12 2015年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/2807591.2807638
Performance prediction of large-scale parallell system and application using macro-level simulation.

Ryutaro Susukita, Hisashige Ando, Mutsumi Aoyagi, Hiroaki Honda, Yuichi Inadomi, Koji Inoue, Shigeru Ishizuki, Yasunori Kimura, Hidemi Komatsu, Motoyoshi Kurokawa, Kazuaki J. Murakami, Hidetomo Shibamura, Shuji Yamamura, Yunqing Yu

Proceedings of the ACM/IEEE Conference on High Performance Computing(SC) 20 - 20 2008年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/SC.2008.5220091
Design and Implementation of Opto-Electrical Hybrid Floating-Point Multipliers

Inaba T., Ono T., Inoue K., Kawakami S.

IEICE Transactions on Information and Systems E108.D ( 1 ) 2 - 11 2025年1月（ ISSN:09168532 ）

　詳細を見る

出版者・発行元：IEICE Transactions on Information and Systems

The performance improvement by CMOS circuit technology is reaching its limits. Many researchers have been studying computing technologies that use emerging devices to challenge such critical issues. Nanophotonic technology is a promising candidate for tackling the issue due to its ultra-low latency, high bandwidth, and low power characteristics. Although previous research develops hardware accelerators by exploiting nanophotonic circuits for AI inference applications, there has never been considered for the acceleration of training that requires complex Floating-Point (FP) operations. In particular, the design balance between optical and electrical circuits has a critical impact on the latency, energy, and accuracy of the arithmetic system, and thus requires careful consideration of the optimal design. In this study, we design three types of Opto-Electrical Floating-point Multipliers (OEFMs): accuracy-oriented (Ao-OEFM), latency-oriented (Lo-OEFM), and energy-oriented (Eo-OEFM). Based on our evaluation, we confirm that Ao-OEFM has high noise resistance, and Lo-OEFM and Eo-OEFM still have sufficient calculation accuracy. Compared to conventional electrical circuits, Lo-OEFM achieves an 87% reduction in latency, and Eo-OEFM reduces energy consumption by 42%.

DOI： 10.1587/transinf.2024PAP0003

Scopus
Approximate SFQ-based Computing Architecture Modeling with Device-level Guidelines

Mundhe P., Hano Y., Kawakami S., Tanimoto T., Tanaka M., Inoue K., Byun I.

IEEE Computer Architecture Letters 24 ( 2 ) 253 - 256 2025年（ ISSN:15566056 ）

　詳細を見る

出版者・発行元：IEEE Computer Architecture Letters

Single-flux-quantum (SFQ) logic has emerged as a promising post-Moore technology thanks to its ultra-fast and lowenergy operation. However, despite progress in various fields, its feasibility is questionable due to the prohibitive cooling cost. Proven conventional ideas, such as approximate computing, may help to resolve this challenge. However, introducing such ideas has been impossible due to the complex performance, power, and error trade-offs originating from the unique SFQ device characteristics. This work introduces approximate SFQ-based computing (AxSFQ) with an architecture modeling framework and essential design guidelines. Our optimized device-level AxSFQ showcases 30˜ 100 times energy efficiency improvement, which motivates further circuit and architecture-level exploration

DOI： 10.1109/LCA.2025.3573740

Scopus
C3-VQA: Cryogenic Counter-Based Coprocessor for Variational Quantum Algorithms

Ueno Y., Imamura S., Tomida Y., Tanimoto T., Tanaka M., Tabuchi Y., Inoue K., Nakamura H.

IEEE Transactions on Quantum Engineering 6 1 - 17 2025年（ eISSN:2689-1808 ）

　詳細を見る

掲載種別：研究論文（学術雑誌）出版者・発行元：IEEE Transactions on Quantum Engineering

Cryogenic quantum computers play a leading role in demonstrating quantum advantage. Given the severe constraints on the cooling capacity in cryogenic environments, thermal design is crucial for the scalability of these computers. The sources of heat dissipation include passive inflow via intertemperature wires and the power consumption of components located in the cryostat, such as wire amplifiers and quantum-classical interfaces. Thus, a critical challenge is to reduce the number of wires by reducing the required intertemperature bandwidth while maintaining minimal additional power consumption in the cryostat. One solution to address this challenge is near-data processing using ultralow-power computational logic within the cryostat. Based on the workload analysis and domain-specific system design focused on variational quantum algorithms (VQAs), we propose the cryogenic counter-based coprocessor for VQAs (C3-VQA) to enhance the design scalability of cryogenic quantum computers under the thermal constraint. The C3-VQA utilizes single-flux-quantum logic, which is an ultralow-power superconducting digital circuit that operates at the 4 K environment. The C3-VQA precomputes a part of the expectation value calculations for VQAs and buffers intermediate values using simple bit operation units and counters in the cryostat, thereby reducing the required intertemperature bandwidth with small additional power consumption. Consequently, the C3-VQA reduces the number of wires, leading to a reduction in the total heat dissipation in the cryostat. Our evaluation shows that the C3-VQA reduces the total heat dissipation at the 4 K stage by 30% and 81% under sequential-shot and parallel-shot execution scenarios, respectively. Furthermore, a case study in quantum chemistry shows that the C3-VQA reduces total heat dissipation by 87% with a 10 000-qubit system.

DOI： 10.1109/tqe.2024.3521442

Scopus

researchmap
Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration

Fei Z., Lyu M., Kawakami S., Inoue K.

IEEE Computer Architecture Letters 24 ( 1 ) 81 - 84 2025年（ ISSN:15566056 ）

　詳細を見る

出版者・発行元：IEEE Computer Architecture Letters

The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.

DOI： 10.1109/LCA.2025.3548080

Scopus
Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications

Babai A.M.A., Inoue K.

IEEE Computer Architecture Letters 24 ( 1 ) 137 - 140 2025年（ ISSN:15566056 ）

　詳細を見る

出版者・発行元：IEEE Computer Architecture Letters

Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.

DOI： 10.1109/LCA.2025.3563105

Scopus
SPDID: A Secure and Privacy-Preserving Decentralized Identity utilizing Blockchain and PUF 査読国際誌

He Y., Fan W., Inoue K.

Proceedings of the IEEE International Conference on Trust Security and Privacy in Computing and Communications Trustcom ( 2024 ) 1622 - 1623 2024年12月（ ISSN:2324898X ISBN:9798331506209 ）

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Proceedings of the IEEE International Conference on Trust Security and Privacy in Computing and Communications Trustcom

For internet users, using digital identities has become increasingly essential. Traditional systems often rely on centralized control, which no longer meets privacy demands. Decentralized identity systems offer a more transparent and resilient solution by giving users control over their identity data through distributed ledgers, decentralized identifiers, and verifiable credentials. However, existing schemes are often based on assumptions such as the presence of a reliable anonymous credential issuer, making them challenging to implement in real-world scenarios. Furthermore, they overlook handling user credentials securely and struggle with ensuring user authentication while providing privacy. These systems are unable to provide public audits of issued credentials without jeopardizing sensitive data. This paper introduces SPDID, a novel decentralized identity system based on blockchain technology is introduced. First, it transforms legacy documents into anonymous credentials without interaction or alteration using zero-knowledge proofs and Pedersen commitments. SPDID also employs the physical unclonable function (PUF) to design a secure key management system resilient to physical attacks. Additionally, it suggests a user authentication method that is unlinkable and resistant to Sybil without the need for new trusted third parties. Furthermore, SPDID uses Merkle trees to construct a credential issuance list that is publicly auditable on the blockchain. The system's security and practical performance are demonstrated through security analysis and a prototype implementation on Hyperledger Fabric.

DOI： 10.1109/TrustCom63139.2024.00223

Scopus
Multithreaded Edge-Assisted Visual SLAM with Keyframe Backup Mechanism 査読国際誌

Xia C., Wang Y., Inoue K.

Proceedings 2024 12th International Symposium on Computing and Networking Candar 2024 259 - 265 2024年11月（ ISBN:9798331528362 ）

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Proceedings 2024 12th International Symposium on Computing and Networking Candar 2024

Simultaneous Localization and Mapping (SLAM) has rapidly advanced on mobile devices, particularly camera-based Visual SLAM, which enables spatial perception and autonomous positioning by processing continuous image data. Due to its high memory and processing demands, it is challenging to deploy and execute continuously for a long time on mobile devices. The edge-assisted architecture that mitigates resource constraints by offloading heavy tasks to an edge server becomes optimal for settling the problem. However, existing studies suffer from high data synchronization delay, which is handled in the tracking module, resulting in prolonged tracking interruption and poor system robustness and accuracy. Based on a typical edge-assisted Visual SLAM system, we analyze the impact of the data synchronization process and propose a new multithreaded tracking solution with a keyframe backup mechanism. Verifying through two standard datasets, we evaluate our system's robustness and localization accuracy. The results show that the proposed system reduces tracking interruption by up to 88.1% and significantly improves the coverage, a critical robustness metric of the SLAM system, by up to 25.5%. Additionally, our proposed solution significantly improves localization accuracy, especially in rotation scenarios, by up to 26.7%.

DOI： 10.1109/CANDAR64496.2024.00041

Scopus
Performance evaluation of all intra Kvazaar and x265 HEVC encoders on embedded system Nvidia Jetson platform

James R., Abo-Zahhad M., Inoue K., Sayed M.S.

Journal of Real-Time Image Processing 21 ( 3 ) 2024年5月（ ISSN:18618200 ）

　詳細を見る

出版者・発行元：Journal of Real-Time Image Processing

The growing demand for high-quality video requires complex coding techniques that cost resource consumption and increase encoding time which represents a challenge for real-time processing on Embedded Systems. Kvazaar and x265 encoders are two efficient implementations of the High-Efficient Video Coding (HEVC) standard. In this paper, the performance of All Intra Kvazaar and x265 encoders on the Nvidia Jetson platform was evaluated using two coding configurations; highspeed preset and high-quality preset. In our work, we used two scenarios, first, the two encoders were run on the CPU, and based on the average encoding time Kvazaar proved to be 65.44% and 69.4% faster than x265 with 1.88% and 0.6% BD-rate improvement over x265 at high-speed and high-quality preset, respectively. In the second scenario, the two encoders were run on the GPU of the Nvidia Jetson, and the results show the average encoding time under each preset is reduced by half of the CPU-based scenario. In addition, Kvazaar is 54.5% and 56.70% faster with 1.93% and 0.45% BD-rate improvement over x265 at high-speed and high-quality preset, respectively. Regarding the scalability, the two encoders on the CPU are linearly scaled up to four threads and speed remains constant afterward. On the GPU, the two encoders are scaled linearly with the number of threads. The obtained results confirmed that, Kvazaar is more efficient and that it can be used on Embedded Systems for real-time video applications due to its high speed and performance over the x265 HEVC encoder.

DOI： 10.1007/s11554-024-01429-5

Scopus
TinyEmergencyNet: a hardware-friendly ultra-lightweight deep learning model for aerial scene image classification

Mogaka O.M., Zewail R., Inoue K., Sayed M.S.

Journal of Real-Time Image Processing 21 ( 2 ) 2024年4月（ ISSN:18618200 ）

　詳細を見る

出版者・発行元：Journal of Real-Time Image Processing

In the context of emergency response applications, real-time situational awareness is vital. Unmanned aerial vehicles (UAVs) with imagers have emerged as crucial tools for providing timely information in such scenarios. Convolutional neural networks (CNN) are effective in image processing. However, the deployment of CNN models in UAVs faces significant challenges. The CNN models involve large number of parameters and energy-costly floating-point computations beyond the memory and power available on-board the UAVs. To address these challenges, we propose a co-design optimization approach for deploying the EmergencyNet CNN model on resource-constrained UAVs. Our strategy includes channel-wise pruning to reduce the size and optimize the network architecture. Additionally, we apply additive powers-of-two (APoT) quantization to further compress the model and enhance computational efficiency. Using channel-wise network pruning we derive TinyEmergencyNet that is only 155KB in memory size and 50% smaller than EmergencyNet. This proposed approach is evaluated on Aerial Image Disaster Event Recognition (AIDER) dataset. We have achieved an F1-score of 93.6% with 4-bit APoT quantization that closely approaches the full precision (32-bit) accuracy of 94%. Furthermore, hardware-friendly bit-shifting operations as a result of APoT quantization present an added advantage in hardware accelerator implementations. This work pioneers the joint application of channel-wise pruning and non-uniform APoT quantization on EmergencyNet, presenting a suitable solution tailored for UAV-based emergency response applications.

DOI： 10.1007/s11554-024-01430-y

Scopus
CFChain: A Crowdfunding Platform that Supports Identity Authentication, Privacy Protection, and Efficient Audit

He Y., Chen J., Inoue K.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 14493 LNCS 146 - 167 2024年3月（ ISSN:0302-9743 ISBN:9789819708611 eISSN:1611-3349 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Charity crowdfunding is a technique for raising funds that involves collecting modest contributions from a vast number of individuals or groups via established crowdfunding platforms or other digital avenues. The objective is to provide support for charitable organizations, social welfare initiatives, or personal requirements. The widespread adoption of the Internet and the rapid advancement of digital technology have facilitated the global dissemination and promotion of charity crowdfunding. However, crowdfunding platforms have recently experienced a decline in credibility due to various factors such as fraudulent donations, inadequate fund management, and other forms of disorder. The blockchain’s decentralization and anti-tampering features exhibit a high degree of compatibility with the requirements of a crowdfunding platform. Most current state-of-the-art techniques do not ensure the non-linkability of user identities in the face of sybil attacks, nor do they offer a streamlined auditing mechanism for crowdsourcing modest donations that simultaneously preserves transactional privacy. This paper presents a novel crowdfunding system called CFChain based on blockchain technology. Initially, the distributed identity and BLS signature are employed to establish a user authentication mechanism, enabling CFChain to withstand sybil attacks while preserving the non-linkability of user identities. Subsequently, a crowdfunding mechanism is constructed utilizing zero-knowledge proofs to facilitate streamlined auditing procedures while safeguarding donations’ confidentiality. Additionally, a security analysis of CFChain is presented. The system prototype is subsequently implemented on the Hyperledger Fabric. Empirical evidence indicates that the efficiency of CFChain is viable.

DOI： 10.1007/978-981-97-0862-8_10

Scopus

researchmap
Inter-Temperature Bandwidth Reduction in Cryogenic QAOA Machines

Ueno Y., Tomida Y., Tanimoto T., Tanaka M., Tabuchi Y., Inoue K., Nakamura H.

IEEE Computer Architecture Letters 23 ( 1 ) 1 - 4 2024年1月（ ISSN:1556-6056 eISSN:1556-6064 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）出版者・発行元：IEEE Computer Architecture Letters

The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers.

DOI： 10.1109/lca.2023.3322700

Scopus

CiNii Research

researchmap
Late Breaking Results: Single Flux Quantum Based Brownian Circuits for Ultra-Law-Power Computing

Kawakami S., Ohtusbo Y., Inoue K., Tanaka M.

Proceedings -Design, Automation and Test in Europe, DATE 2024年（ ISSN:15301591 ISBN:9798350348590 ）

　詳細を見る

出版者・発行元：Proceedings -Design, Automation and Test in Europe, DATE

This paper proposes a random walk circuit imple-mentation with single flux quantum devices, essential for Brownian circuits, to reduce processing energy consumption dramatically. SPICE-based simulation demonstrating its functional operation and random walks can be achieved via the Shapiro- Wilk test. Furthermore, we developed a Monte Carlo simulator for Brownian circuits, enabling functionality verification and computation step distribution analysis. Latency/energy evaluation using a half-adder as a case study revealed that proposed circuits could reduce energy consumption by 1/1260 and offer an opportunity for low-power computing systems.

Scopus
CrowdChain: A privacy-preserving crowdfunding system based on blockchain and PUF

He Y., Inoue K.

Peer-to-Peer Networking and Applications 17 ( 6 ) 3669 - 3687 2024年（ ISSN:19366442 ）

　詳細を見る

出版者・発行元：Peer-to-Peer Networking and Applications

Crowdfunding refers to the online collection of certain capital from a vast number of individuals or groups that each contribute a relatively small amount. Recently, the credibility of crowdfunding platforms has been undermined by fraudulent projects, inadequate fund management, and other forms of disorder. The decentralization and anti-tampering features of blockchain provide the possibility to solve the above problems, and many studies have proposed blockchain-based crowdfunding schemes. However, the existing state-of-the-art methods do not provide user authentication, transaction auditing, and identity management in a privacy-preserving way. Accordingly, this paper presents a novel blockchain-based crowdfunding system called CrowdChain. Initially, the distributed identity and BLS signature are employed to establish a user authentication mechanism, enabling CrowdChain to withstand Sybil attacks while preserving the non-linkability of user identities. Secondly, the physically unclonable function (PUF) is used to generate keys associated with digital identities that are not stored in external devices to resist physical attacks. Subsequently, a crowdfunding mechanism is constructed utilizing zero-knowledge proofs to facilitate streamlined auditing procedures while safeguarding the confidentiality of transactions. Additionally, the formal security analysis proves the security of the CrowdChain scheme. The system prototype is implemented on the Hyperledger Fabric. Empirical evidence indicates the viable efficiency of CrowdChain.

DOI： 10.1007/s12083-024-01785-w

Scopus
2A20 中高生を対象とした半導体人材育成教育

井上弘士, 山田順治, 木本香苗, 駒澤聡亮

工学教育研究講演会講演論文集 2024 ( 0 ) 126 - 127 2024年（ ISSN:21898928 eISSN:24241458 ）

　詳細を見る

記述言語：日本語出版者・発行元：公益社団法人日本工学教育協会

DOI： 10.20549/jseeja.2024.0_126

CiNii Research
Empirical Power-performance Analysis of Layer-wise CNN Inference on Single Board Computers

Ng K.Y., Babai A.M.A., Tanimoto T., Kawakami S., Inoue K.

Journal of Information Processing 31 478 - 494 2023年7月（ eISSN:1882-6652 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）出版者・発行元：Journal of Information Processing

This paper analyzes the impact of input sparsity and DFS/DVFS configurations for single-board computers on the execution time, power, and energy of each VGG16 layer as the first step towards efficient CNN inference on single-board computers. For this purpose, we first develop a power and execution time measurement environment and perform experiments using Raspberry Pi 4 and NVIDIA Jetson Nano. Our results show that clock frequency strongly correlates with execution time and power. Inversely, input sparsity has a weak correlation with execution time and power. Then, we show that a coarse-grained DVFS model can explain over 96% of the variations in the power of each VGG16 layer even when sets of clock frequency and voltage on the single-board computer are unavailable.

DOI： 10.2197/ipsjjip.31.478

Scopus

researchmap
50-GFLOPS Floating-Point Adder and Multiplier Using Gate-Level-Pipelined Single-Flux-Quantum Logic with Frequency-Increased Clock Distribution 査読

Nagaoka I., Kashima R., Tanaka M., Kawakami S., Tanimoto T., Yamashita T., Inoue K., Fujimaki A.

IEEE Transactions on Applied Superconductivity 33 ( 4 ) 1 - 11 2023年6月（ ISSN:1051-8223 eISSN:1558-2515 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）出版者・発行元：IEEE Transactions on Applied Superconductivity

We demonstrate the functioning of a high-throughput, gate-level-pipelined floating-point adder and multiplier over 50 GHz. The gate-level-pipelined floating-point adder and multiplier requires dedicated circuit blocks to wait until other circuit blocks complete calculations because of the dependence between their sign, exponent, and significand parts. We revealed that the resultant delay difference of the waiting circuit blocks hinders high-frequency operation if the predesigned circuit blocks with the fixed clock distribution are connected in a simple manner. We showed that clock distribution needs to synchronize with every pipeline stage regardless of the circuit blocks to minimize the delay difference between the circuit blocks for circuits containing the waiting circuit blocks (e.g., the floating-point adder and multiplier). We designed a 5-bit floating-point adder and multiplier to demonstrate the effectiveness of the clock distribution experimentally. The test chips were fabricated using AIST 10-kA/cm$\boldsymbol{^{2}}$ Advanced Process 2. We verified the high-speed operation at over 50 GHz in the floating-point adder and multiplier. The maximum clock frequency and throughput of the floating-point adder were 56 GHz and 56 GFLOPS, respectively. The corresponding values for the floating-point multiplier were 63 GHz and 63 GFLOPS, respectively.

DOI： 10.1109/tasc.2023.3250614

Scopus

CiNii Research

researchmap
A High-Throughput Multiply-Accumulate Unit With Long Feedback Loop Using Low-Voltage Rapid Single-Flux Quantum Circuits 査読

Nagaoka I., Kashima R., Ishida K., Tanaka M., Yamashita T., Ono T., Inoue K., Fujimaki A.

IEEE Transactions on Applied Superconductivity 33 ( 3 ) 1 - 8 2023年4月（ ISSN:1051-8223 eISSN:1558-2515 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）出版者・発行元：IEEE Transactions on Applied Superconductivity

In this article, we demonstrated a high-throughput gate-level-pipelined 8-bit multiply-accumulate (MAC) unit with a long feedback loop using low-voltage rapid single-flux quantum (LV-RSFQ) logic. The long feedback loop in the MAC unit is an obstacle for high-throughput operation because the logic gates must wait for the delayed inputs from the feedback loop. The LV-RSFQ logic makes high-frequency operation even more difficult by larger and more variable feedback delay. We design the feedback loop by using counter-flow clocking and adding many D flip-flops to divide the long feedback loop into shorter paths. The target clock frequency of the MAC unit with a feedback loop was set to 30 GHz by the experimental results of the MAC unit without a feedback loop. We model the clock frequency and its circuit overhead in a feedback loop to design the feedback loop in the MAC unit achieving 30 GHz with a minimum overhead. The test chips are fabricated using the national institute of advanced industrial science and technology (AIST) 10-kA/cm 2 Advanced Process 2. We have successfully obtained high-throughput 30-GHz operations in the LV-RSFQ MAC unit with a long feedback loop by using the model-based design. The maximum operating frequency of the MAC unit reaches 40 GHz.

DOI： 10.1109/tasc.2023.3239329

Scopus

CiNii Research

researchmap
Next Generation Cryogenic Superconductor Computing: From Classical to Quantum

井上弘士

2023年4月

　詳細を見る

記述言語：英語出版者・発行元：IEEE

Moore’s Law, doubling the number of transistors in a chip every two years, has so far contributed to the evolution of computer systems. Unfortunately, we cannot expect sustainable transistor shrinking anymore, marking the beginning of the so-called post-Moore era. Therefore, it has become essential to explore emerging devices, and superconductor single-flux-quantum (SFQ) logic that operates in a 4.2- kelvin environment is a promising candidate. Josephson junctions (JJs) are used as switching elements in SFQ logic to compose a superconductor ring (SFQ ring) that can store (or trap) and transfer a single magnetic flux quantum. It fundamentally operates with the voltage pulse-driven nature that makes it possible to achieve extremely low-latency and low-energy JJ switching. This talk shares the history of our SFQ Research, e.g., revisiting microarchitecture and demonstrating over 30 GHz microprocessors, AI accelerator designs, and recently targeting quantum computers. Then, the role of computer architecture for such emerging device computing is discussed.

CiNii Research
A Hybrid Opto-Electrical Floating-point Multiplier 査読

Inaba T., Ono T., Inoue K., Kawakami S.

Proceedings - 2022 IEEE 15th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2022 313 - 320 2022年12月（ ISBN:9781665464994 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Proceedings - 2022 IEEE 15th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2022

The performance improvement by CMOS circuit technology is reaching its limits. Many researchers have been studying computing technologies that use emerging devices to challenge such critical issues. Nanophotonic technology is a promising candidate due to its ultra-low latency, high bandwidth, and low power natures. The advanced research activity of nanophotonic computing is to design hardware accelerators for AI inference applications. However, few considerations about nanophotonic accelerators for AI training applications have been conducted. The main reason is that state-of-the-art nanophotonic AI accelerators involve integer operations, whereas floating-point (FP) sum-of-products dominate the training process. However, to the best of the authors' knowledge, there are no optical circuits that target floating-point arithmetic units. This study proposes a novel Opto-Electrical Floating-point Multiplier (OEFM) toward ultra-low-latency, a power-efficient nanophotonic accelerator for AI training applications. We design a microarchitecture of OEFM, including a novel optical integer multiplier and other electrical components. Based on our evaluation framework, we analyze the calculation accuracy of the proposed multiplier and OEFM. Experimental results show that OEFM achieves a 56 % reduction in latency and a 41 % reduction in energy consumption compared with a conventional electrical circuit.

DOI： 10.1109/mcsoc57363.2022.00057

Scopus

CiNii Research

researchmap
Implementation of Edge-cloud Cooperative CNN Inference on an IoT Platform 査読

Wang Y., Shibamura H., Ng K.Y., Inoue K.

Proceedings - 2022 IEEE 15th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2022 337 - 344 2022年12月（ ISBN:9781665464994 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Proceedings - 2022 IEEE 15th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2022

Since the Internet of Things (IoT) has become more widely used in various industrial situations, Artificial Intelligence (AI) programs, particularly Convolutional Neural Network (CNN) applications, are projected to be implemented on edge devices to meet high-accuracy and huge industry computing needs. Offloading computing-intensive workloads to the cloud is a promising solution for compact energy-constrained edge devices, but it tends to incur significant costs in total execution latency. For flexible and fine-grained offloading, this paper aims to design and implement an edge-cloud cooperative CNN inference framework on an IoT platform by targeting TensorFlow Lite. We have confirmed the implementation's feasibility and accuracy through the verification of implementing LeNet, AlexNet, and VGGNet. Intending to perform high-performance edge-cloud AI executions on the presented IoT platform, we evaluate the performance overhead (total execution latency) of the provided implementation and identify the current bottlenecks of the target platform for enhancing it.

DOI： 10.1109/mcsoc57363.2022.00060

Scopus

researchmap
Design and Analysis of a Nano-photonic Processing Unit for Low-Latency Recurrent Neural Network Applications 査読

Sato E., Inoue K., Kawakami S.

Proceedings - 2022 IEEE 15th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2022 321 - 329 2022年12月（ ISBN:9781665464994 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：Proceedings - 2022 IEEE 15th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2022

Recurrent neural networks (RNNs) have achieved high performance in inference processing that handles time-series data. Among them, hardware acceleration for fast processing RNNs is helpful for tasks where real-time performance is es-sential, such as speech recognition and stock market prediction. The nano-photonic neural network accelerator is an approach that takes advantage of the high speed, high parallelism, and low power consumption of light to achieve high performance in neural network processing. However, existing methods are inefficient for RNNs due to significant overhead caused by the absence of recursive paths and the immaturity of the model to be designed. Therefore, architectural considerations that take advantage of RNN characteristics are essential for low latency. This paper proposes a fast and low-power processing unit for RNNs that introduces activation functions and recursion processing using optical devices. We clarified the impact of noise on the proposed circuit's calculation accuracy and inference accuracy. As a result, the calculation accuracy deteriorated significantly in proportion to the increase in the number of recursions, but the effect on inference accuracy was negligible. We also compared the performance of the proposed circuit to an all-electric design and a hybrid design that processes the vector-matrix product optically and the recursion electrically. As a result, the performance of the proposed circuit improves latency by 467x, reduces power consumption by 93.0% compared with the all-electrical design, improves latency by 7.3x, and reduces power consumption by 58.6% compared with the hybrid design.

DOI： 10.1109/mcsoc57363.2022.00058

Scopus

CiNii Research

researchmap
A 57.2GHz 11.2mW 8-bit General Purpose Superconductor Microprocessor with Dual-Clocking Scheme 査読

Nagaoka I., Kashima R., Nakano T., Tanaka M., Yamashita T., Inoue K., Fujimaki A.

2022 IEEE Asian Solid-State Circuits Conference, A-SSCC 2022 - Proceedings 1 - 3 2022年11月（ ISBN:9781665471435 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：2022 IEEE Asian Solid-State Circuits Conference, A-SSCC 2022 - Proceedings

A superconductor single-flux-quantum (SFQ) logic 8-bit microprocessor is demonstrated up to 57.2 GHz with a measured power consumption of 11.2 mW. The microprocessor has an ultradeep, gate-level pipelining containing many feedback paths and communications between components. The arrival clock timings at all the logic gates are ultra-precisely tuned using two different clocking schemes, called 'concurrent-flow' and 'counter-flow,' to achieve extremely high clock frequency operation over 50 GHz. Low-temperature circumstances enable us to conduct super delay-intensive layout design by controlling delays of all waveguide interconnects in the order of sub-picosecond precision.

DOI： 10.1109/a-sscc56115.2022.9980802

Scopus

CiNii Research

researchmap
An Edge Autonomous Lamp Control with Camera Feedback 査読

Matsushita S., Tanimoto T., Kawakami S., Ono T., Inoue K.

2022 IEEE 8th World Forum on Internet of Things, WF-IoT 2022 1 - 7 2022年10月（ ISBN:9781665491532 ）

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）出版者・発行元：2022 IEEE 8th World Forum on Internet of Things, WF-IoT 2022

Recently IoT edge devices have become more diverse and lower cost. In addition, small low-power single-board computers' computing performance has significantly increased. These conditions make it possible to process locally without communicating to the cloud. Since the advantages of in-edge processing are security and privacy, we applied in-edge IoT to smart homes with rich private information to be secured. In in-edge processing, conventional cloud-managed abnormality monitoring and system maintenance cannot be involved. We developed a lamp control system with in-edge processing. It detects failures using camera image processing and recovers from the failure. The abnormalities of the image processing are detected by monitoring cyclic outdoor brightness change observed on windows captured with the same camera. We have developed a prototype system with Python with OpenCV and FastAPI, etc., over PHP-based lamp timer control while keeping source code size small and considering validation easiness. The camera detectors work at 10 FPS on Python with as small as 1607 total source code lines (three times of code lines against the original lamp control timer).

DOI： 10.1109/wf-iot54382.2022.10152281

Scopus

CiNii Research

researchmap
Q3DE: A fault-tolerant quantum computer architecture for multi-bit burst errors by cosmic rays

鈴木泰成, 杉山太香典, アライトモチカ, 井上弘士, 谷本輝夫

IEEE/ACM International Symposium on Microarchitecture (MICRO) 2022 1110 - 1125 2022年10月

　詳細を見る

記述言語：英語出版者・発行元：Institute of Electrical and Electronics Engineers (IEEE)

Demonstrating small error rates by integrating quantum error correction (QEC) into an architecture of quantum computing is the next milestone towards scalable fault-tolerant quantum computing (FTQC). Encoding logical qubits with superconducting qubits and surface codes is considered a promising candidate for FTQC architectures. In this paper, we propose an FTQC architecture, which we call Q3DE, that enhances the tolerance to multi-bit burst errors (MBBEs) by cosmic rays with moderate changes and overhead. There are three core components in Q3DE: in-situ anomaly DEtection, dynamic code DEformation, and optimized error DEcoding. In this architecture, MBBEs are detected only from syndrome values for error correction. The effect of MBBEs is immediately mitigated by dynamically increasing the encoding level of logical qubits and re-estimating probable recovery operation with the rollback of the decoding process. We investigate the performance and overhead of the Q3DE architecture with quantum-error simulators and demonstrate that Q3DE effectively reduces the period of MBBEs by 1000 times and halves the size of their region. Therefore, Q3DE significantly relaxes the requirement of qubit density and qubit chip size to realize FTQC. Our scheme is versatile for mitigating MBBEs, i.e., temporal variations of error properties, on a wide range of physical devices and FTQC architectures since it relies only on the standard features of topological stabilizer codes.

CiNii Research
Design of Variable Bit-Width Arithmetic Unit Using Single Flux Quantum Device

Iori Ishikawa, Ikki Nagaoka, Ryota Kashima, Koki Ishida, Kosuke Fukumitsu, Keitaro Oka, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Akira Fujimaki, Koji Inoue

2022 IEEE International Symposium on Circuits and Systems (ISCAS) 2022-May 3547 - 3551 2022年5月（ ISSN:02714310 ISBN:9781665484855 ）

　詳細を見る

出版者・発行元：IEEE

This paper presents the design of an ultra-high-speed, low-power arithmetic unit that supports variable bit-width operations with single flux quantum (SFQ) technology. Because of the high-speed nature of superconductor devices, we can achieve extremely high power-performance efficiency that cannot be achieved by state-of-the-art CMOS devices. To implement the complex function to support the variable bit-width feature, we introduce a novel circuit architecture to maintain the high-speed operation over 50GHz. Our prototype chip design successfully demonstrated 53.5GHz 1.59mW operations.

DOI： 10.1109/iscas48785.2022.9937317

Scopus

CiNii Research
九州大学と名古屋大学の連携が世界最先端のコンピュータ技術をリードする！次世代超伝導コンピュータ・アーキテクチャ

井上弘士

低温工学 57 ( 6 ) 382 - 383 2022年（ ISSN:03892441 eISSN:18800408 ）

　詳細を見る

記述言語：日本語出版者・発行元：公益社団法人低温工学・超電導学会 (旧社団法人低温工学協会)

DOI： 10.2221/jcsj.57.382

CiNii Research
Fast Screen Content Coding in HEVC Using Machine Learning.

Emad Badry, Koji Inoue, Mohammed Sharaf Sayed

IEEE Access 9 154659 - 154666 2021年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1109/ACCESS.2021.3125697
Demonstration of a 52-GHz Bit-Parallel Multiplier Using Low-Voltage Rapid Single-Flux-Quantum Logic 査読国際誌

Ikki Nagaoka, Koki Ishida, Masamitsu Tanaka, Kyosuke Sano, Taro Yamashita, Takatsugu Ono, Koji Inoue, Akira Fujimaki

IEEE Transactions on Applied Superconductivity 31 ( 5 ) 1 - 5 2021年8月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

DOI： 10.1109/tasc.2021.3071996
Decision Tree Models and Early Splitting Termination in Screen Content Extension of High Efficiency Video Coding.

Emad Badry, Koji Inoue, Mohammed Sharaf Sayed

IEEE Access 8 143437 - 143452 2020年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1109/ACCESS.2020.3014163
How Many Trials Do We Need for Reliable NISQ Computing?

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, Koji Inoue

2020 IEEE Computer Society Annual Symposium on VLSI(ISVLSI) 288 - 290 2020年7月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISVLSI49217.2020.00059
Practical Error Modeling Toward Realistic NISQ Simulation.

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, Koji Inoue

2020 IEEE Computer Society Annual Symposium on VLSI(ISVLSI) 291 - 293 2020年7月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISVLSI49217.2020.00060
32 GHz 6.5 mW Gate-Level-Pipelined 4-Bit Processor using Superconductor Single-Flux-Quantum Logic

Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue

2020 IEEE Symposium on VLSI Circuits, VLSI Circuits 2020 2020 IEEE Symposium on VLSI Circuits, VLSI Circuits 2020 - Proceedings 2020年6月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

A Single-Flux-Quantum (SFQ) 4-bit throughput-oriented processor has successfully been demonstrated at up to 32 GHz with the measured power consumption of 6.5 mW. This is the first implementation of the gate-level-pipelined processor, and it achieves 2.5 Tera-Operations Per Watt (TOPS/W) by circuit and architectural optimizations.

DOI： 10.1109/VLSICircuits18222.2020.9162826
32 GHz 6.5 mW Gate-Level-Pipelined 4-Bit Processor using Superconductor Single-Flux-Quantum Logic.

Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue

IEEE Symposium on VLSI Circuits 1 - 2 2020年6月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/VLSICircuits18222.2020.9162826
Enhancing a manycore-oriented compressed cache for GPGPU

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Inoue Koji

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 22 - 31 2020年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

GPUs can achieve high performance by exploiting massive-thread parallelism. However, some factors limit performance on GPUs, one of which is the negative effects of L1 cache misses. In some applications, GPUs are likely to suffer from L1 cache conflicts because a large number of cores share a small L1 cache capacity. A cache architecture that is based on data compression is a strong candidate for solving this problem as it can reduce the number of cache misses. Unlike previous studies, our data compression scheme attempts to exploit the value locality existing within not only intra cache lines but also inter cache lines. We enhance the structure of a last-level compression cache proposed for general purpose manycore processors to optimize against shared L1 caches on GPUs. The experimental results reveal that our proposal outperforms the other compression cache for GPUs by 11 points on average.
Enhancing a manycore-oriented compressed cache for GPGPU.

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 22 - 31 2020年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/3368474.3368491
Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core.

Susumu Mashimo, Ryota Shioya, Koji Inoue

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 207 - 216 2020年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/3368474.3368496
An open source FPGA-optimized out-of-order RISC-V soft processor

Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima

18th International Conference on Field-Programmable Technology, ICFPT 2019 Proceedings - 2019 International Conference on Field-Programmable Technology, ICFPT 2019 63 - 71 2019年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

High-performance soft processors in field-programmable gate arrays (FPGAs) have become increasingly important as recent large FPGA systems have relied on soft processors to run many complex workloads, like a network software stack. An out-of-order (OoO) superscalar approach is a good candidate to improve performance in such cases, as evidenced from OoO hard processor studies. Recent studies have revealed, however, that conventional OoO processor components do not fit well in an FPGA, and it is thus important to carefully design such components for FPGA characteristics. Hence, we propose the RSD processor: a new, open-source OoO RISC-V soft processor optimized for an FPGA. The RSD supports many aggressive OoO execution features, like speculative scheduling, OoO memory instruction execution and disambiguation, a memory dependence predictor, and a non-blocking cache. While the RSD supports such aggressive features, it also leverages FPGA characteristics. Therefore, it consumes fewer FPGA resources than are consumed by existing OoO soft processors, which do not support such aggressive features well. We first introduce the end result of the RSD microarchitecture design and then describe several novel optimization techniques. The RSD achieves up to 2.5-times higher Dhrystone MIPS while using 60% fewer registers and 64% fewer lookup tables (LUTs) as compared to state-of-the-art, open-source OoO processors.

DOI： 10.1109/ICFPT47387.2019.00016
Evaluating the Impact of Energy Efficient Networks on HPC Workloads.

Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele

26th IEEE International Conference on High Performance Computing, Data, and Analytics(HiPC) 301 - 310 2019年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/HiPC.2019.00044
An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor.

Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi 0001, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima

International Conference on Field-Programmable Technology(FPT) 63 - 71 2019年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ICFPT47387.2019.00016
Evaluating the Impact of Energy Efficient Networks on HPC Workloads

Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele

26th Annual IEEE International Conference on High Performance Computing, HiPC 2019 Proceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019 301 - 310 2019年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Interconnection networks grow larger as supercomputers include more nodes and require higher bandwidth for performance. This scaling significantly increases the fraction of power consumed by the network, by increasing the number of network components (links and switches). Typically, network links consume power continuously once they are turned on. However, recent proposals for energy efficient interconnects have introduced low-power operation modes for periods when network links are idle. Low-power operation can increase messaging time when switching a link from low-power to active operation. We extend the TraceR-CODES network simulator for power modeling to evaluate the impact of energy efficient networking on power and performance. Our evaluation presents the first study on both single-job and multi-job execution to realistically simulate power consumption and performance under congestion for a large-scale HPC network. Results on several workloads consisting of HPC proxy applications show that single-job and multi-job execution favor different modes of low power operation to have significant power savings at the cost of minimal performance degradation.

DOI： 10.1109/HiPC.2019.00044
Novel frontier of photonics for data processing—Photonic accelerator 査読国際誌

Novel frontier of photonics for data processing—Photonic accelerator

APL Photonics 4 ( 090901 ) 2019年9月

　詳細を見る

記述言語：英語

DOI： 10.1063/1.5108912
Novel frontier of photonics for data processing-Photonic accelerator 査読国際誌

Kitayama, Ken-ichi; Notomi, Masaya; Naruse, ; Inoue, Koji;, Koji; Kawakami, Satoshi; Uchida, Atsushi

APL PHOTONICS 4 ( 9 ) 2019年9月

　詳細を見る

記述言語：英語

DOI： 10.1063/1.5108912
Novel frontier of photonics for data processing-Photonic accelerator 査読

Ken Ichi Kitayama, Masaya Notomi, Makoto Naruse, Koji Inoue, Satoshi Kawakami, Atsushi Uchida

APL Photonics 4 ( 9 ) 2019年9月

　詳細を見る

記述言語：英語

In the emerging Internet of things cyber-physical system-embedded society, big data analytics needs huge computing capability with better energy efficiency. Coming to the end of Moore's law of the electronic integrated circuit and facing the throughput limitation in parallel processing governed by Amdahl's law, there is a strong motivation behind exploring a novel frontier of data processing in post-Moore era. Optical fiber transmissions have been making a remarkable advance over the last three decades. A record aggregated transmission capacity of the wavelength division multiplexing system per a single-mode fiber has reached 115 Tbit/s over 240 km. It is time to turn our attention to data processing by photons from the data transport by photons. A photonic accelerator (PAXEL) is a special class of processor placed at the front end of a digital computer, which is optimized to perform a specific function but does so faster with less power consumption than an electronic general-purpose processor. It can process images or time-serial data either in an analog or digital fashion on a real-time basis. Having had maturing manufacturing technology of optoelectronic devices and a diverse array of computing architectures at hand, prototyping PAXEL becomes feasible by leveraging on, e.g., cutting-edge miniature and power-efficient nanostructured silicon photonic devices. In this article, first the bottleneck and the paradigm shift of digital computing are reviewed. Next, we review an array of PAXEL architectures and applications, including artificial neural networks, reservoir computing, pass-gate logic, decision making, and compressed sensing. We assess the potential advantages and challenges for each of these PAXEL approaches to highlight the scope for future work toward practical implementation.

DOI： 10.1063/1.5108912
Efficient Autoencoder-Based Human Body Communication Transceiver for WBAN.

Abdelhay Ali, Koji Inoue, Ahmed Shalaby 0001, Mohammed Sharaf Sayed, Sabah Mohamed Ahmed

IEEE Access 7 117196 - 117205 2019年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1109/ACCESS.2019.2936796
Demonstration of an Energy-Efficient, Gate-Level-Pipelined 100 TOPS/W Arithmetic Logic Unit Based on Low-Voltage Rapid Single-Flux-Quantum Logic

Ikki Nagaoka, Masamitsu Tanaka, Kyosuke Sano, Taro Yamashita, Akira Fujimaki, Koji Inoue

17th IEEE International Superconductive Electronics Conference, ISEC 2019 ISEC 2019 - International Superconductive Electronics Conference 2019年7月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

We report the successful operation of an energy-efficient 8-bit arithmetic logic unit (ALU) based on bit-parallel, gate-Ievel-pipelining, and low-voltage rapid single-flux-quantum (LV-RSFQ) approaches. We implemented the ALU using a 10-kA/cm² Nb process. The bias voltage was optimized to obtain high energy efficiency. Although lowed bias voltage leads to difficulty in timing design, we solved the problem by precise timing control. The operating frequency reached 30 GHz. Thanks to these high-throughput and low-energy technologies, we realized highly energy-efficient operation over 100 tera-operations per second per watt (TOPS/W).

DOI： 10.1109/ISEC46533.2019.8990905
Critical Path Based Microarchitectural Bottleneck Analysis for Out-of-Order Execution.

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 102-A ( 6 ) 758 - 766 2019年6月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transfun.E102.A.758
ナノフォトニック・ニューラルネットワークアクセラレータ向け統合評価環境査読国際誌

川上哲志, 小野貴継, 井上弘士, 納富雅也

電子情報通信学会論文誌 J102-A ( No.6 ) 2019年6月

　詳細を見る

記述言語：日本語掲載種別：研究論文（学術雑誌）
Critical Path based Microarchitectural Bottleneck Analysis for Out-of-Order Execution 査読国際誌

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

IEICE Transactions 2019年6月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Hardware friendly algorithm for earthquakes discrimination based on wavelet filter bank and support vector machine

Omar M. Saad, Ahmed Shalaby, Inoue Koji, Mohammed S. Sayed

2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 115 - 118 2019年4月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Discrimination between earthquakes and explosion is one of the main challenges in the field of seismology. In some cases, the explosions recorded as an earthquake or vice verse, which can contaminate the seismic catalog. Rapid discrimination is required to support the real-time seismic application. The discrimination algorithm is based on a wavelet filter bank to extract the discriminative features, and support vector machine (SVM) as a classifier. Therefore; we propose to optimize the hardware implementation of the discrimination algorithm on Field Programmable Gate Array (FPGA). First, we implement the wavelet filter bank using optimized lifting scheme. Then, we utilize the linear classifier to implement the SVM classifier. Finally, we optimize the hardware resources of the discrimination algorithm to be utilized on low-cost FPGA called TE0711 board (Xilinx Artix7). The implemented design is utilized 1.2% and 39.8% of the FPGA's Look Up Table (LUT) and register resources, respectively.

DOI： 10.1109/JEC-ECC.2018.8679531
Message from the Prof. Koji Inoue

Koji Inoue

2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 IV 2019年4月

　詳細を見る

記述言語：英語

DOI： 10.1109/JEC-ECC.2018.8679541
Improving lifetime in MLC phase change memory using slow writes

Takatsugu Ono, Zhe Chen, Inoue Koji

2018 Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 2018 Proceedings of the Japan-Africa Conference on Electronics, Communications, and Computations, JAC-ECC 2018 65 - 68 2019年4月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper reports the performance and endurance impacts of a slow-write approach for a multi-level cell (MLC) of phase change memory (PCM). An MLC improves the density of PCM, but the endurance is a critical issue. To extend the lifetime of the cell, a slow-write approach is one of the techniques that is used. However, the slow-write approach increases the program execution time because it takes a long time. In this paper, we discuss three types of slow-write approach for MLC and evaluate the endurance and performance quantitatively to understand the effectiveness of our approach. Our evaluation results show that one of the approaches enhances the endurance of MLC PCM 1.57 times with a 1.41 % performance degradation on average compared with the conventional write operation.

DOI： 10.1109/JEC-ECC.2018.8679540
29.3 A 48GHz 5.6mW Gate-Level-Pipelined Multiplier Using Single-Flux Quantum Logic

Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki

2019 IEEE International Solid-State Circuits Conference, ISSCC 2019 2019 IEEE International Solid-State Circuits Conference, ISSCC 2019 460 - 462 2019年3月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

A multiplier based on superconductor single-flux-quantum (SFQ) logic is demonstrated up to 48GHz with the measured power consumption of 5.6 mW. The multiplier performs 8 × 8 - bit signed multiplication every clock cycle. The design is based on a bit-parallel, gate-level-pipelined structure that exploits ultimately high-throughput performance of SFQ logic. The test chip fabricated using a 1.0- μ {m}, 9-layer process consists of 20,251 Nb/AlOx/Nb Josephson junctions (JJs). The correctness of operation is verified by on-chip high-speed testing.

DOI： 10.1109/ISSCC.2019.8662351
A 48GHz 5.6mW Gate-Level-Pipelined Multiplier Using Single-Flux Quantum Logic.

Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki

IEEE International Solid- State Circuits Conference(ISSCC) 460 - 462 2019年2月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISSCC.2019.8662351
Radio propagation characteristics-based spoofing attack prevention on wireless connected devices

Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti R. Sarangi, Koji Inoue

Journal of Information Processing 27 322 - 334 2019年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

© 2019 Information Processing Society of Japan. A spoofing attack is a critical issue in wireless communication in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Not depending on information processing such as encryption technology, this method can be applied to sensing devices etc. which commonly have many resource restrictions. We call the space from which attacks achieve success as the “attack space.” In order to confine the attack space inside of the target system to prevent spoofing attacks from the outside, formulation of the relationship between combinations of transceivers and the attack space is necessary. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which uses these RPCs simultaneously. We take a tire pressure monitoring system (TPMS) as a case study of this method and execute a security evaluation based on radio-wave-propagation simulation. The simulation results assuming multiple noise environments all indicate that it is possible to eliminate the attack possibility from a distant location.

DOI： 10.2197/ipsjjip.27.322
Radio Propagation Characteristics-Based Spoofing Attack Prevention on Wireless Connected Devices 査読国際誌

Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti Sarangi, Koji Inoue

IPSJ ACS 2019年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Performance Analysis of CPU and DRAM Power Constrained Systems with Magnetohydrodynamic Simulation Code

Keiichiro Fukazawa, Masatsugu Ueda, Yuichi Inadomi, Mutsumi Aoyagi, Takayuki Umeda, Koji Inoue

20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018 Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018 626 - 631 2019年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Presently the power consumption of supercomputer system becomes a critical issue to develop the exascale supercomputer system. On the other hand, the power consumption character of applications is not so considered by the applications developers because their main interest is how fast to run their applications. In this study, we examine and evaluate the power consumption behavior of our Magnetohydrodynamic simulation code which solves the planetary magnetosphere under the constrained power of CPU and DRAM on the x86 computer system. As the results, we found there are some regions in the simulation code which decrease the calculation performance or do not affect the performance under the power capping. This indicates the capability of power optimization without performance degradation using the dynamic power capping in running the application. In addition, we obtained the specific power consumption combinations between CPU and DRAM which greatly affect the calculation performance.

DOI： 10.1109/HPCC/SmartCity/DSS.2018.00113
Critical path based microarchitectural bottleneck analysis for out-of-order execution 査読

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E102A ( 6 ) 758 - 766 2019年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

SUMMARY Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks.

DOI： 10.1587/transfun.E102.A.758
Radio propagation characteristics-based spoofing attack prevention on wireless connected devices 査読

Mihiro Sonoyama, Takatsugu Ono, Haruichi Kanaya, Osamu Muta, Smruti R. Sarangi, Koji Inoue

Journal of information processing 27 322 - 334 2019年

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

A spoofing attack is a critical issue in wireless communication in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Not depending on information processing such as encryption technology, this method can be applied to sensing devices etc. which commonly have many resource restrictions. We call the space from which attacks achieve success as the “attack space.” In order to confine the attack space inside of the target system to prevent spoofing attacks from the outside, formulation of the relationship between combinations of transceivers and the attack space is necessary. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which uses these RPCs simultaneously. We take a tire pressure monitoring system (TPMS) as a case study of this method and execute a security evaluation based on radio-wave-propagation simulation. The simulation results assuming multiple noise environments all indicate that it is possible to eliminate the attack possibility from a distant location.

DOI： 10.2197/ipsjjip.27.322
Parallel Precomputation with Input Value Prediction for Model Predictive Control Systems 査読国際誌

Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Koji Inoue

2018年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Situation-Based Dynamic Frame-Rate Control for On-Line Object Tracking, 査読

Yusuke Inoue, Takatsugu Ono, Koji Inoue

International Japan-Africa Conference on Electronics, Communications and Computations 129 - 132 2018年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Situation-Based Dynamic Frame-Rate Control for On-Line Object Tracking,

DOI： 10.1109/jec-ecc.2018.8679545
Improving Lifetime in MLC Phase Change Memory Using Slow Writes 査読

Takatsugu Ono, Zhe Chen, Koji Inoue

2018 International Japan-Africa Conference on Electronics, Communications and Computations (JAC-ECC) 65 - 68 2018年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/jec-ecc.2018.8679540
Parallel Precomputation with Input Value Prediction for Model Predictive Control Systems.

Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Koji Inoue

IEICE Transactions on Information & Systems 101-D ( 12 ) 2864 - 2877 2018年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transinf.2018PAP0003
Real-Time Frame-Rate Control for Energy-Efficient On-Line Object Tracking

Yusuke INOUE, Takatsugu ONO, Koji INOUE

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E101.A ( 12 ) 2297 - 2307 2018年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transfun.e101.a.2297
Real-Time frame-rate control for energy-efficient on-line object tracking 査読

Yusuke Inoue, Takatsugu Ono, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E101A ( 12 ) 2297 - 2307 2018年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

On-line object tracking (OLOT) has been a core technology in computer vision, and its importance has been increasing rapidly. Because this technology is utilized for battery-operated products, energy consumption must be minimized. This paper describes a method of adaptive frame-rate optimization to satisfy that requirement. An energy trade-off occurs between image capturing and object tracking. Therefore, the method optimizes the frame-rate based on always changed object speed for minimizing the total energy while taking into account the trade-off. Simulation results show a maximum energy reduction of 50.0%, and an average reduction of 35.9% without serious tracking accuracy degradation.

DOI： 10.1587/transfun.E101.A.2297
Power management framework for post-petascale supercomputers

Masaaki Kondo, Ikuo Miyoshi, Koji Inoue, Shinobu Miwa

Advanced Software Technologies for Post-Peta Scale Computing The Japanese Post-Peta CREST Research Project 249 - 269 2018年12月

　詳細を見る

記述言語：英語

Power consumption is a first class design constraint for developing future exascale computing systems. To achieve exascale system performance with realistic power provisioning of 20-30MW, we need to improve power-performance efficiency significantly compared to today's supercomputer systems. In order to maximize effective performance within a power constraint, investigating how to optimize power resource allocation to each hardware component or each job submitted to the system is necessary. We have been conducting research and development on a software framework for code optimization and system power management for the power-constraint adaptive systems. We briefly introduce the research efforts for maximizing application performance under a given power constraint, power-aware resource manager, and power-performance simulation and analysis framework for future supercomputer systems.

DOI： 10.1007/978-981-13-1924-2_13
Parallel precomputation with input value prediction for model predictive control systems 査読

Satoshi Kawakami, Takatsugu Ono, Toshiyuki Ohtsuka, Inoue Koji

IEICE Transactions on Information and Systems E101D ( 12 ) 2864 - 2877 2018年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

We propose a parallel precomputation method for real-time model predictive control. The key idea is to use predicted input values produced by model predictive control to solve an optimal control problem in advance. It is well known that control systems are not suitable for multi- or many-core processors because feedback-loop control systems are inherently based on sequential operations. However, since the proposed method does not rely on conventional thread-/data-level parallelism, it can be easily applied to such control systems without changing the algorithm in applications. A practical evaluation using three real-world model predictive control system simulation programs demonstrates drastic performance improvement without degrading control quality offered by the proposed method.

DOI： 10.1587/transinf.2018PAP0003
Real-time Frame-Rate Control for Energy-Efficient On-Line Object Tracking 招待査読国際誌

Yusuke Inoue, Takatsugu Ono, Koji Inoue

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences, 2018年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Automatic Arrival Time Detection for Earthquakes Based on Stacked Denoising Autoencoder 査読

Omar M. Saad, Koji Inoue, Ahmed Shalaby, Lotfy Samy, Mohammed S. Sayed

IEEE Geoscience and Remote Sensing Letters 15 ( 11 ) 1687 - 1691 2018年11月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

The accurate detection of P-wave arrival time is imperative for determining the hypocenter location of an earthquake. However, precise detection of onset time becomes more difficult when the signal-to-noise ratio (SNR) of the seismic data is low, such as during microearthquakes. In this letter, a stacked denoising autoencoder (SDAE) is proposed to smooth the background noise. The SDAE acts as a denoising filter for the seismic data. In the proposed algorithm, the SDAE is utilized to reduce background noise such that the onset time becomes more clear and sharp. Afterward, a hard decision with one threshold is used to detect the onset time of the event. The proposed algorithm is evaluated on both synthetic and field seismic data. As a result, the proposed algorithm outperforms the short-time average/long-time average and the Akaike information criterion algorithms. The proposed algorithm accurately picks the onset time of 94.1% for 407 field seismic waveforms with a standard deviation error of 0.10 s. In addition, the results indicate that the proposed algorithm can pick arrival times accurately for weak SNR seismic data with SNR higher than -14 dB.

DOI： 10.1109/LGRS.2018.2861218
Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded Programs 招待査読国際誌

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

IEICE Transactions on Information and Systems 2018年9月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
光パスゲート論理に基づく光波長多重並列加算器(2) ～熱光学スイッチによる動作実証～

新家昭彦, 石原亨, 野崎謙悟, 北翔太, 井上弘士, Cong Guangwei, 山田浩治, 納富雅也

応用物理学会学術講演会講演予稿集 2018.2 934 - 934 2018年9月

　詳細を見る

記述言語：日本語

Optical WDM parallel adder based on optical pass gate logic (2) ~ Experimental study using thermo-optic switch ~

DOI： 10.11470/jsapmeeting.2018.2.0_934
超伝導単一磁束量子回路による50~GHzビット並列演算マイクロプロセッサに向けた要素回路設計招待査読

田中雅光, 佐藤諒, 石田浩貴, 畑中湧貴, 松井祐一, 小野貴継, 井上弘士, 藤巻朗

2018年9月

　詳細を見る

記述言語：日本語
Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded Programs

Satoshi IMAMURA, Yuichiro YASUI, Koji INOUE, Takatsugu ONO, Hiroshi SASAKI, Katsuki FUJISAWA

IEICE Transactions on Information and Systems E101.D ( 9 ) 2247 - 2257 2018年9月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transinf.2017edp7296
Evaluating energy-efficiency of DRAM channel interleaving schemes for multithreaded programs 査読

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

IEICE Transactions on Information and Systems E101D ( 9 ) 2247 - 2257 2018年9月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

The power consumption of server platforms has been increasing as the amount of hardware resources equipped on them is increased. Especially, the capacity of DRAM continues to grow, and it is not rare that DRAM consumes higher power than processors on modern servers. Therefore, a reduction in the DRAM energy consumption is a critical challenge to reduce the system-level energy consumption. Although it is well known that improving row buffer locality (RBL) and bank-level parallelism (BLP) is effective to reduce the DRAM energy consumption, our preliminary evaluation on a real server demonstrates that RBL is generally low across 15 multithreaded benchmarks. In this paper, we investigate the memory access patterns of these benchmarks using a simulator and observe that cache line-grained channel interleaving schemes, which are widely applied to modern servers including multiple memory channels, hurt the RBL each of the benchmarks potentially possesses. In order to address this problem, we focus on a row-grained channel interleaving scheme and compare it with three cache line-grained schemes. Our evaluation shows that it reduces the DRAM energy consumption by 16.7%, 12.3%, and 5.5%on average (up to 34.7%, 28.2%, and 12.0%) compared to the other schemes, respectively.

DOI： 10.1587/transinf.2017EDP7296
Autoencoder based Features Extraction for Automatic Classification of Earthquakes and Explosions

Omar M. Saad, Inoue Koji, Ahmed Shalaby, Lotfy Sarny, Mohammed S. Sayed

17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018 Proceedings - 17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018 445 - 450 2018年9月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Monitoring illegal explosions is mandatory for the safety of human life, environment, and protect the important buildings such as High-dam in Egypt. This kind of monitoring can be accomplished by detecting and identifying the explosions. If an illegal explosion happens such as quarry blast, an alarm should be reported to the government to take immediate action. However, the main problem is that many measured signals from received explosions are similar to earthquakes in their shape and both cannot differentiate from each other. Also, incorrect classification possibly will distort the real seismicity nature of the region. This problem motivates us to search for unique discriminating features to distinguish between earthquakes and explosions with precise accuracy. Therefore, in this paper, we propose to extract the discriminative features based on Autoencoder from the first few seconds after the P-wave arrival time of the event. The discriminative features are found to be in the first 60 samples after the arrival time of P-wave. Thus the first stage of the proposed algorithm is extracting the discriminative features via the Autoencoder. Then, softmax classifies the event based on these extracted features. The proposed algorithm achieves a classification accuracy of 98.55% when applied to 900 earthquakes and quarry blasts waveforms recorded by Egyptian National Seismic Network (ENSN).

DOI： 10.1109/ICIS.2018.8466464
Analyzing resource trade-offs in hardware overprovisioned supercomputers

Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel Ellsworth, Barry Rountree, Martin Schulz

32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018 Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018 526 - 535 2018年8月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Hardware overprovisioned systems have recently been proposed as a viable alternative for a power-efficient design of next-generation supercomputers. A key challenge for such systems is to determine the degree of overprovisioning, which refers to the number of extra nodes that need to be installed under a given power constraint. In this paper, we first show that the degree of overprovisioning depends on dynamic parameters, such as the job mix as well as the global power constraint, and that static decisions can result in limited system throughput. We then study an exhaustive combination of adaptive resource management strategies that span three job scheduling algorithms, four power capping techniques, and three node boot-up mechanisms to understand the trade-off space involved. We then draw conclusions about how these strategies can adaptively control the degree of overprovisioning and analyze their impact on job throughput and power utilization.

DOI： 10.1109/IPDPS.2018.00062
Automatic Arrival Time Detection for Earthquakes Based on Stacked Denoising Autoencoder.

Omar M. Saad, Koji Inoue, Ahmed Shalaby 0001, Lotfy Samy, Mohammed Sharaf Sayed

IEEE Geoscience and Remote Sensing Letters 15 ( 11 ) 1687 - 1691 2018年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1109/LGRS.2018.2861218
VMOR: Microarchitectural Support for Operand Access in an Interpreter.

Susumu Mashimo, Ryota Shioya, Koji Inoue

IEEE Computer Architecture Letters 17 ( 2 ) 217 - 220 2018年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1109/LCA.2018.2866243
An Integrated Nanophotonic Parallel Adder 査読国際誌

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, and Masaya Notomi

ACM Journal on Emerging Technologies in Computing Systems (JETC) 2018年7月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
VMOR: Microarchitectural Support for Operand Access in an Interpreter 査読国際誌

Mashimo, Susumu; Shioya, Ryota; Inoue, Koji

IEEE COMPUTER ARCHITECTURE LETTERS 17 ( 2 ) 217 - 220 2018年7月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

DOI： 10.1109/LCA.2018.2866243
Ultralow-latency optical circuit based on optical pass gate logic 査読

Akihiko Shinya, Kengo Nozaki, Masaya Notomi, Tohru Ishihara, Koji Inoue

NTT Technical Review 16 ( 7 ) 33 - 38 2018年7月

　詳細を見る

記述言語：英語

A novel light speed computing technology has been developed by NTT, Kyoto University, and Kyushu University that employs nanophotonic technology in critical paths and thus overcomes the problem of operational latency that is the chief limiting factor in conventional electronic circuits. The ultimate objective of this work is to develop an ultrahigh-speed optoelectronic arithmetic processor. This article provides an overview of our recent work and describes the successful implementation of this novel optical computing technology.
An Integrated Nanophotonic Parallel Adder.

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi

ACM Journal on Emerging Technologies in Computing Systems 14 ( 2 ) 26 - 20 2018年7月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1145/3178452
An Integrated Nanophotonic Parallel Adder 査読国際誌

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, and Masaya Notomi,

ACM Journal on Emerging Technologies in Computing Systems (JETC) Volume 14 ( Issue 2, Article No. 26 ) 26:1 - 26:20 2018年6月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Performance Analysis of CPU and DRAM Power Constrained Systems with Magnetohydrodynamic Simulation Code.

Keiichiro Fukazawa, Masatsugu Ueda, Yuichi Inadomi, Mutsumi Aoyagi, Takayuki Umeda, Koji Inoue

20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems(HPCC/SmartCity/DSS) 626 - 631 2018年6月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/HPCC/SmartCity/DSS.2018.00113
Autoencoder based Features Extraction for Automatic Classification of Earthquakes and Explosions.

Omar M. Saad, Koji Inoue, Ahmed Shalaby 0001, Lotfy Sarny, Mohammed Sharaf Sayed

17th IEEE/ACIS International Conference on Computer and Information Science(ICIS) 445 - 450 2018年6月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ICIS.2018.8466464
Towards Ultra High-Speed Cryogenic Single-Flux-Quantum Computing 招待査読国際誌

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue

IEICE Transactions on Electronics 2018年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Analyzing Resource Trade-offs in Hardware Overprovisioned Supercomputers.

Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel A. Ellsworth, Barry Rountree, Martin Schulz 0001

2018 IEEE International Parallel and Distributed Processing Symposium(IPDPS) 526 - 535 2018年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/IPDPS.2018.00062
Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing 招待査読国際誌

Ishida, Koki; Tanaka, Masamitsu; Ono, Takatsugu; Inoue, Koji

IEICE TRANSACTIONS ON ELECTRONICS E101C ( 5 ) 359 - 369 2018年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transele.E101.C.359
Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing.

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue

IEICE Transactions on Electronics 101-C ( 5 ) 359 - 369 2018年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transele.E101.C.359
CPCI Stack Metric for Accurate Bottleneck Analysis on OoO Microprocessors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

5th International Symposium on Computing and Networking, CANDAR 2017 Proceedings - 2017 5th International Symposium on Computing and Networking, CANDAR 2017 166 - 172 2018年4月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks.

DOI： 10.1109/CANDAR.2017.60
Wireless Spoofing-Attack Prevention Using Radio-Propagation Characteristics

Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Inoue Koji

15th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 Proceedings - 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 502 - 510 2018年3月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

A spoofing attack is a critical issue in wireless communication in embedded systems in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Since RPCs are natural phenomena, this method does not depend on information processing such as encryption technology. We call the space from which attacks achieve success "attack space". By formulating the relationship between combinations of transceivers and the attack space, this method can be used in embedded systems. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which use these RPCs simultaneously for preventing wireless spoofing-attacks. We explain the results of a validity evaluation for the proposed model based on radio-wave-propagation simulation assuming free space and a noisy environment.

DOI： 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.94
Wireless Spoofing-Attack Prevention Using Radio-Propagation Characteristics

Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Koji Inoue

Proceedings - 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 2017 IEEE 15th International Conference on Pervasive Intelligence and Computing, 2017 IEEE 3rd International Conference on Big Data Intelligence and Computing and 2017 IEEE Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2017 2018-January 502 - 510 2018年3月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

© 2017 IEEE. A spoofing attack is a critical issue in wireless communication in embedded systems in which a malicious transmitter outside a system attempts to be genuine. As a countermeasure against this, we propose a device-authentication method based on position identification using radio-propagation characteristics (RPCs). Since RPCs are natural phenomena, this method does not depend on information processing such as encryption technology. We call the space from which attacks achieve success "attack space". By formulating the relationship between combinations of transceivers and the attack space, this method can be used in embedded systems. In this research, we consider two RPCs, the received signal strength ratio (RSSR) and the time difference of arrival (TDoA), and construct the attack-space model which use these RPCs simultaneously for preventing wireless spoofing-attacks. We explain the results of a validity evaluation for the proposed model based on radio-wave-propagation simulation assuming free space and a noisy environment.

DOI： 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.94
Low-latency optical parallel adder based on a binary decision diagram with wavelength division multiplexing scheme

A. Shinya, T. Ishihara, K. Inoue, K. Nozaki, S. Kita, M. Notomi

Optical Data Science: Trends Shaping the Future of Photonics 2018 Optical Data Science Trends Shaping the Future of Photonics 2018年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

We propose an optical parallel adder based on a binary decision diagram that can calculate simply by propagating light through electrically controlled optical pass gates. The CARRY and CARRY operations are multiplexed in one circuit by a wavelength division multiplexing scheme to reduce the number of optical elements, and only a single gate constitutes the critical path for one digit calculation. The processing time reaches picoseconds per digit when we use a 100-μm-long optical path gates, which is ten times faster than a CMOS circuit.

DOI： 10.1117/12.2296842
Ultralow latency computation based on integrated nanophotonics

Masaya Notomi, Kengo Nozaki, Shota Kita, Akihiko Shinya, Tohru Ishihara, Inoue Koji

JSAP-OSA Joint Symposia, JSAP 2018 JSAP-OSA Joint Symposia, JSAP 2018 2018年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Moore's law for CMOS computers is still continuing, but its near-future saturation is now being discussed. One of the serious saturations is about its latency. The computation delay for a CMOS transistor is already saturated above 10 ps, which will be problematic when ultralow-latency response is required for broad-band data streams, even with parallelization or pipe-line processing. We regard that optical circuits may serve as ultralow-latency computation circuits if they are small enough and tightly combined with electronic circuits. The former requires nanophotonic devices/circuits and the former requires OE/EO conversion with ultrasmall capacitance.
Dependence Graph Model for Accurate Critical Path　Analysis on Out-of-Order Processors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

Journal of Information Processing 2017年12月

　詳細を見る

記述言語：英語
Dependence Graph Model for Accurate Critical Path Analysis on Out-of-Order Processors.

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

Journal of Information Processing 25 983 - 992 2017年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.2197/ipsjjip.25.983
CPCI Stack: Metric for Accurate Bottleneck Analysis on OoO Microprocessors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

2017 Fifth International Symposium on Computing and Networking (CANDAR) 166 - 172 2017年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/candar.2017.60
Production Hardware Overprovisioning Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, Martin Schulz

31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017 Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017 957 - 966 2017年6月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.

DOI： 10.1109/IPDPS.2017.107
Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework.

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel A. Ellsworth, Barry Rountree, Martin Schulz 0001

2017 IEEE International Parallel and Distributed Processing Symposium(IPDPS) 957 - 966 2017年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/IPDPS.2017.107
Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors.

Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Hiroshi Sasaki 0001

IEEE Computer Architecture Letters 16 ( 2 ) 111 - 114 2017年3月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1109/LCA.2017.2684813
Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors

Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Hiroshi Sasaki

IEEE Computer Architecture Letters 2017年3月

　詳細を見る

記述言語：英語
単一磁束量子回路向けマイクロプロセッサのアーキテクチャ探索

石田浩貴, 田中雅光, Takatsugu Ono, Inoue Koji

情報処理学会論文誌 2017年3月

　詳細を見る

記述言語：日本語
Preface 査読

Jens Knoop, Wolfgang Karl, Martin Schulz, Koji Inoue

30th International Conference on Architecture of Computing Systems, ARCS 2017 Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10172 LNCS 2017年1月

　詳細を見る

記述言語：英語
Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

2016 High Performance Graph Data Management and Processing, HPGDMP 2016 Proceedings of HPGDMP 2016 High Performance Graph Data Management and Processing - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis 17 - 24 2017年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Graph analysis applications have been widely used in real services such as road-traffic analysis and social network services. Breadth-first search (BFS) is one of the most representative algorithms for such applications; therefore, many researchers have tuned it to maximize performance. On the other hand, owing to the strict power constraints of modern HPC systems, it is necessary to improve power efficiency (i.e., performance per watt) when executing BFS. In this work, we focus on the power efficiency of DRAM and investigate the memory access pattern of a state-of-the-art BFS implementation using a cycle-accurate processor simulator. The results reveal that the conventional address mapping schemes of modern memory controllers do not efficiently exploit row buffers in DRAM. Thus, we propose a new scheme called per-row channel interleaving and improve the DRAM power efficiency by 30.3% compared to a conventional scheme for a certain simulator setting. Moreover, we demonstrate that this proposed scheme is effective for various configurations of memory controllers.

DOI： 10.1109/HPGDMP.2016.010
Evaluating the impacts of code-level performance tunings on power efficiency

Satoshi Imamura, Keitaro Oka, Yuichiro Yasui, Yuichi Inadomi, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Koji Inoue, Takatsugu Ono

2016 IEEE International Conference on Big Data (Big Data) 362 - 369 2016年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/bigdata.2016.7840624
Single-flux-quantum cache memory architecture

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Koji Inoue

13th International SoC Design Conference, ISOCC 2016 ISOCC 2016 - International SoC Design Conference Smart SoC for Intelligent Things 105 - 106 2016年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Single-flux-quantum (SFQ) logic is promising technology to realize an incredible microprocessor which operates over 100 GHz due to its ultra-fast-speed and ultra-lowpower natures. Although previous work has demonstrated prototype of an SFQ microprocessor, the SFQ based L1 cache memory has not well optimized: A large access latency and strictly limited scalability. This paper proposes a novel SFQ cache architecture to support fast accesses. The sub-Arrayed structure applied to the cache produces better scalability in terms of capacity. Evaluation results show that the proposed cache achieves 1.8X fast access speed.

DOI： 10.1109/ISOCC.2016.7799755
An integrated optical parallel adder as a first step towards light speed data processing

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi

13th International SoC Design Conference, ISOCC 2016 ISOCC 2016 - International SoC Design Conference Smart SoC for Intelligent Things 123 - 124 2016年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Integrated optical circuits with nanophotonic devices have attracted significant attention due to its low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-Transistor. This paper first introduces a concept of the optical pass-gate logic, and then proposes a parallel adder circuit based on the optical passgate logic. Experimental results obtained with an optoelectronic circuit simulator show advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.

DOI： 10.1109/ISOCC.2016.7799721
Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping

Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

2016 High Performance Graph Data Management and Processing Workshop (HPGDMP) 17 - 24 2016年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/hpgdmp.2016.010
Accuracy analysis of machine learning-based performance modeling for microprocessors

Yoshihiro Tanaka, Keitaro Oka, Takatsugu Ono, Koji Inoue

4th International Japan-Egypt Conference on Electronic, Communication and Computers, JEC-ECC 2016 Proceedings of the 2016 4th International Japan-Egypt Conference on Electronic, Communication and Computers, JEC-ECC 2016 83 - 86 2016年7月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper analyzes accuracy of performance models generated by machine learning-based empirical modeling methodology. Although the accuracy strongly depends on the quality of learning procedure, it is not clear what kind of learning algorithms and training data set (or feature) should be used. This paper inclusively explores the learning space of processor performance modeling as a case study. We focus on static architectural parameters as training data set such as cache size and clock frequency. Experimental results show that a tree-based non-linear regression modeling is superior to a stepwise linear regression modeling. Another observation is that clock frequency is the most important feature to improve prediction accuracy.

DOI： 10.1109/JEC-ECC.2016.7518973
An integrated optical parallel adder as a first step towards light speed data processing.

Tohru Ishihara, Akihiko Shinya, Koji Inoue, Kengo Nozaki, Masaya Notomi

International SoC Design Conference(ISOCC) 123 - 124 2016年7月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISOCC.2016.7799721
From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era.

Satoshi Matsuoka, Hideharu Amano, Kengo Nakajima, Koji Inoue, Tomohiro Kudoh, Naoya Maruyama, Kenjiro Taura, Takeshi Iwashita, Takahiro Katagiri, Toshihiro Hanawa, Toshio Endo

Proceedings of the ACM International Conference on Computing Frontiers 274 - 281 2016年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/2903150.2906830
Accuracy analysis of machine learning-based performance modeling for microprocessors

Yoshihiro Tanaka, Keitaro Oka, Takatsugu Ono, Koji Inoue

2016 Fourth International Japan-Egypt Conference on Electronics, Communications and Computers (JEC-ECC) 2016年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/jec-ecc.2016.7518973
From FLOPS to BYTES Disruptive change in high-performance computing towards the post-moore era

Satoshi Matsuoka, Hideharu Amano, Kengo Nakajima, Koji Inoue, Tomohiro Kudoh, Naoya Maruyama, Kenjiro Taura, Takeshi Iwashita, Takahiro Katagiri, Toshihiro Hanawa, Toshio Endo

ACM International Conference on Computing Frontiers, CF 2016 2016 ACM International Conference on Computing Frontiers - Proceedings 274 - 281 2016年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Slowdown and inevitable end in exponential scaling of processor performance, the end of the so-called"Moore's Law" is predicted to occur around 2025-2030 timeframe. Because CMOS semiconductor voltage is also approaching its limits, this means that logic transistor power will become constant, and as a result, the system FLOPS will cease to improve, resulting in serious consequences for IT in general, especially supercomputing. Existing attempts to overcome the end of Moore's law are rather limited in their future outlook or applicability. We claim that data-oriented parameters, such as bandwidth and capacity, or BYTES, are the new parameters that will allow continued performance gains for periods even after computing performance or FLOPS ceases to improve, due to continued advances in storage device technologies and optics, and manufacturing technologies including 3-D packaging. Such transition from FLOPS to BYTES will lead to disruptive changes in the overall systems from applications, algorithms, software to architecture, as to what parameter to optimize for, in order to achieve continued performance growth over time. We are launching a new set of research efforts to investigate and devise new technologies to enable such disruptive changes from FLOPS to BYTES in the Post-Moore era, focusing on HPC, where there is extreme sensitivity to performance, and expect the results to disseminate to the rest of IT.

DOI： 10.1145/2903150.2906830
Evaluating the impacts of code-level performance tunings on power efficiency

Satoshi Imamura, Keitaro Oka, Yuichiro Yasui, Yuichi Inadomi, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Koji Inoue, Takatsugu Ono

4th IEEE International Conference on Big Data, Big Data 2016 Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 362 - 369 2016年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

As the power consumption of HPC systems will be a primary constraint for exascale computing, a main objective in HPC communities is recently becoming to maximize power efficiency (i.e., performance per watt) rather than performance. Although programmers have spent a considerable effort to improve performance by tuning HPC programs at a code level, tunings for improving power efficiency is now required. In this work, we select two representative HPC programs (Graph500 and SDPARA) and evaluate how traditional code-level performance tunings applied to these programs affect power efficiency. We also investigate the impacts of the tunings on power efficiency at various operating frequencies of CPUs and/or GPUs. The results show that the tunings significantly improve power efficiency, and different types of tunings exhibit different trends in power efficiency by varying CPU frequency. Finally, the scalability and power efficiency of state-of-the-art Graph500 implementations are explored on both a single-node platform and a 960-node supercomputer. With their high scalability, they achieve 27.43 MTEPS/Watt with 129.76 GTEPS on the single-node system and 4.39 MTEPS/Watt with 1,085.24 GTEPS on the supercomputer.

DOI： 10.1109/BigData.2016.7840624
Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing

Yuichi Inadomi, Tapasya Patki, Koji Inoue, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi

International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 Proceedings of SC 2015 The International Conference for High Performance Computing, Networking, Storage and Analysis 2015年11月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

A key challenge in next-generation supercomputing is to effectively schedule limited power resources. Modern processors suffer from increasingly large power variations due to the chip manufacturing process. These variations lead to power inhomogeneity in current systems and manifest into performance inhomogeneity in power constrained environments, drastically limiting supercomputing performance. We present a first-of-its-kind study on manufacturing variability on four production HPC systems spanning four microarchitectures, analyze its impact on HPC applications, and propose a novel variation-aware power budgeting scheme to maximize effective application performance. Our low-cost and scalable budgeting algorithm strives to achieve performance homogeneity under a power constraint by deriving application-specific, module-level power allocations. Experimental results using a 1,920 socket system show up to 5.4X speedup, with an average speedup of 1.8X across all benchmarks when compared to a variation-unaware power allocation scheme.

DOI： 10.1145/2807591.2807638
Message from the IEEE MCSoC-15 Program Co-Chairs 査読

José Ayala, Fumio Arakawa, Inoue Koji

9th IEEE International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2015 Proceedings - IEEE 9th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2015 xi 2015年11月

　詳細を見る

記述言語：英語

DOI： 10.1109/MCSoC.2015.5
Characterization and cross-platform analysis of high-throughput accelerators

Keitaro Oka, Wenhao Jia, Margaret Martonosi, Koji Inoue

2015 15th IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2015 ISPASS 2015 - IEEE International Symposium on Performance Analysis of Systems and Software 161 - 162 2015年4月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Today's computer systems often employ high-throughput accelerators (such as Intel Xeon Phi coprocessors and NVIDIA Tesla GPUs) to improve the performance of some applications or portions of applications. While such accelerators are useful for suitable applications, it remains challenging to predict which workloads will run well on these platforms and to predict the resulting performance trends for varying input.

DOI： 10.1109/ISPASS.2015.7095797
Characterization and cross-platform analysis of high-throughput accelerators.

Keitaro Oka, Wenhao Jia, Margaret Martonosi, Koji Inoue

2015 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS) 161 - 162 2015年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISPASS.2015.7095797
A flexible hardware barrier mechanism for many-core processors

Takeshi Soga, Hiroshi Sasaki, Tomoya Hirao, Masaaki Kondo, Koji Inoue

2015 20th Asia and South Pacific Design Automation Conference, ASP-DAC 2015 20th Asia and South Pacific Design Automation Conference, ASP-DAC 2015 61 - 68 2015年3月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes a new hardware barrier mechanism which offers the flexibility to select which cores should join the synchronization, allowing for executing multiple multi-threaded applications by dividing a many-core processor into several groups. Experimental results based on an RTL simulation show that our hardware barrier achieves a 66-fold reduction in latency over typical software based implementations, with a hardware overhead of the processor of only 1.8%. Additionally, we demonstrate that the proposed mechanism is sufficiently flexible to cover a variety of core groups with minimal hardware overhead.

DOI： 10.1109/ASPDAC.2015.7058982
A flexible hardware barrier mechanism for many-core processors.

Takeshi Soga, Hiroshi Sasaki 0001, Tomoya Hirao, Masaaki Kondo, Koji Inoue

The 20th Asia and South Pacific Design Automation Conference(ASP-DAC) 61 - 68 2015年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ASPDAC.2015.7058982
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems

Satoshi Imamura, Hiroshi Sasaki, Koji Inoue, Dimitrios S. Nikolopoulos

32nd IEEE International Conference on Computer Design, ICCD 2014 2014 32nd IEEE International Conference on Computer Design, ICCD 2014 324 - 331 2014年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Power capping is an essential function for efficient power budgeting and cost management on modern server systems. Contemporary server processors operate under power caps by using dynamic voltage and frequency scaling (DVFS). However, these processors are often deployed in non-uniform memory access (NUMA) architectures, where thread allocation between cores may significantly affect performance and power consumption. This paper proposes a method which maximizes performance under power caps on NUMA systems by dynamically optimizing two knobs: DVFS and thread allocation. The method selects the optimal combination of the two knobs with models based on artificial neural network (ANN) that captures the nonlinear effect of thread allocation on performance. We implement the proposed method as a runtime system and evaluate it with twelve multithreaded benchmarks on a real AMD Opteron based NUMA system. The evaluation results show that our method outperforms a naive technique optimizing only DVFS by up to 67.1%, under a power cap.

DOI： 10.1109/ICCD.2014.6974701
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems.

Satoshi Imamura, Hiroshi Sasaki 0001, Koji Inoue, Dimitrios S. Nikolopoulos

32nd IEEE International Conference on Computer Design(ICCD) 324 - 331 2014年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ICCD.2014.6974701
Power Consumption Evaluation of an MHD Simulation with CPU Power Capping.

Keiichiro Fukazawa, Masatsugu Ueda, Mutsumi Aoyagi, Tomonori Tsuhata, Kyohei Yoshida, Aruta Uehara, Masakazu Kuze, Yuichi Inadomi, Koji Inoue

14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing(CCGRID) 612 - 617 2014年7月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/CCGrid.2014.47
Power and Performance Characterization and Modeling of GPU-Accelerated Systems.

Yuki Abe 0001, Hiroshi Sasaki 0001, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

2014 IEEE 28th International Parallel and Distributed Processing Symposium(IPDPS) 113 - 122 2014年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/IPDPS.2014.23
Power and performance characterization and modeling of GPU-accelerated systems

Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014 Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014 113 - 122 2014年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Graphics processing units (GPUs) provide an order-of-magnitude improvement on peak performance and performance-per-watt as compared to traditional multicore CPUs. However, GPU-accelerated systems currently lack a generalized method of power and performance prediction, which prevents system designers from an ultimate goal of dynamic power and performance optimization. This is due to the fact that their power and performance characteristics are not well captured across architectures, and as a result, existing power and performance modeling approaches are only available for a limited range of particular GPUs. In this paper, we present power and performance characterization and modeling of GPU-accelerated systems across multiple generations of architectures. Characterization and modeling both play a vital role in optimization and prediction of GPU-accelerated systems. We quantify the impact of voltage and frequency scaling on each architecture with a particularly intriguing result that a cutting-edge Kepler-based GPU achieves energy saving of 75% by lowering GPU clocks in the best scenario, while Fermi- and Tesla-based GPUs achieve no greater than 40% and 13%, respectively. Considering these characteristics, we provide statistical power and performance modeling of GPU-accelerated systems simplified enough to be applicable for multiple generations of architectures. One of our findings is that even simplified statistical models are able to predict power and performance of cutting-edge GPUs within errors of 20% to 30% for any set of voltage and frequency pair.

DOI： 10.1109/IPDPS.2014.23
Power consumption evaluation of an MHD simulation with CPU power capping

Keiichiro Fukazawa, Masatsugu Ueda, Mutsumi Aoyagi, Tomonori Tsuhata, Kyohei Yoshida, Aruta Uehara, Masakazu Kuze, Yuichi Inadomi, Koji Inoue

14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2014 Proceedings - 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2014 612 - 617 2014年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Recently to achieve the Exa-flops next generation computer system, the power consumption becomes the important issue. On the other hand, the power consumption character of application program is not so considered now. In this study we examine the power character of our Magneto hydrodynamic (MHD) simulation code for the global magnetosphere to evaluate the power consumption behavior of the simulation code under the CPU power capping on the parallel computer system. As a result, it is confirmed that there are different power consumption parts in the MHD simulation code, which the execution performance decreases or does not change under the CPU power capping. This indicates the capability of performance optimization with the power capping.

DOI： 10.1109/CCGrid.2014.47
Coordinated power-performance optimization in manycores

Hiroshi Sasaki, Satoshi Imamura, Koji Inoue

22nd International Conference on Parallel Architectures and Compilation Techniques, PACT 2013 PACT 2013 - Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques 51 - 61 2013年11月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Optimizing the performance in multiprogrammed environments, especially for workloads composed of multi-threaded programs is a desired feature of runtime management system in future manycore processors. At the same time, power capping capability is required in order to improve the reliability of microprocessor chips while reducing the costs of power supply and thermal budgeting. This paper presents a sophisticated runtime coordinated power-performance management system called C-3PO, which optimizes the performance of manycore processors under a power constraint by controlling two software knobs: thread packing, and dynamic voltage and frequency scaling (DVFS). The proposed solution distributes the power budget to each program by controlling the workload threads to be executed with appropriate number of cores and operating frequency. The power budget is distributed carefully in different forms (number of allocated cores or operating frequency) depending on the power-performance characteristics of the workload so that each program can effectively convert the power into performance. The proposed system is based on a heuristic algorithm which relies on runtime prediction of power and performance via hardware performance monitoring units. Empirical results on a 64-core platform show that C-3PO well outperforms traditional counterparts across various PARSEC workload mixes.

DOI： 10.1109/PACT.2013.6618803
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator.

Lovic Gauthier, Shinya Ueno, Koji Inoue

International Conference on Compilers, Architecture and Synthesis for Embedded Systems(CASES) 10 - 10 2013年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/CASES.2013.6662514
Static Mapping of Multiple Data-Parallel Applications on Embedded Many-Core SoCs.

Junya Kaida, Yuko Hara-Azumi, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Koji Inoue

IEICE Transactions on Information & Systems 96-D ( 10 ) 2268 - 2271 2013年10月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transinf.E96.D.2268
Coordinated power-performance optimization in manycores.

Hiroshi Sasaki 0001, Satoshi Imamura, Koji Inoue

Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques(PACT) 51 - 61 2013年10月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/PACT.2013.6618803
A Prototype System for Many-core Architecture SMYLEref with FPGA Evaluation Boards

Son-Truong NGUYEN, Masaaki KONDO, Tomoya HIRAO, Inoue Koji

IEICE Transactions on Information and Systems 2013年8月

　詳細を見る

記述言語：英語
A Prototype System for Many-Core Architecture SMYLEref with FPGA Evaluation Boards.

Son-Truong Nguyen, Masaaki Kondo, Tomoya Hirao, Koji Inoue

IEICE Transactions on Information & Systems 96-D ( 8 ) 1645 - 1653 2013年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transinf.E96.D.1645
Many-core acceleration for model predictive control systems.

Satoshi Kawakami, Akihito Iwanaga, Koji Inoue

Proceedings of the 1st International Workshop on Many-core Embedded Systems 2013(MES) 17 - 24 2013年6月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/2489068.2489071
Line sharing cache Exploring cache capacity with frequent line value locality

Keitarou Oka, Hiroshi Sasaki, Koji Inoue

2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 669 - 674 2013年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes a new last level cache architecture called line sharing cache (LSC), which can reduce the number of cache misses without increasing the size of the cache memory. It stores lines which contain the identical value in a single line entry, which enables to store greater amount of lines. Evaluation results show performance improvements of up to 35% across a set of SPEC CPU2000 benchmarks.

DOI： 10.1109/ASPDAC.2013.6509677
SMYLEref A reference architecture for manycore-processor SoCs

M. Kondo, S. T. Nguyen, T. Hirao, T. Soga, H. Sasaki, K. Inoue

2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 561 - 564 2013年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Nowadays, the trend of developing micro-processor with tens of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. We are currently developing a many-core processor architecture for embedded systems as a part of a NEDO's project. This paper introduces the many-core architecture called SMYLEref along whit the concept of Virtual Accelerator on Many-core, in which many cores on a chip are utilized as a hardware platform for realizing multiple virtual accelerators. We are developing its prototype system with off-the-shelf FPGA evaluation boards. In this paper, we introduce the architecture of SMYLEref and the detail of the prototype system. In addition, several initial experiments with the prototype system are also presented.

DOI： 10.1109/ASPDAC.2013.6509656
SMYLE project Toward high-performance, low-power computing on manycore-processor SoCs

Koji Inoue

2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 558 - 560 2013年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper introduces a manycore research project called SMYLE (Scalable ManYcore for Low Energy computing). The aims of this project are: 1) proposing a manycore SoC architecture and developing a suitable programming and execution environment, 2) designing a domain specific manycore system for emerging video mining applications, and 3) releasing developed software tools and FPGA emulation environments to accelerate manycore research and development in the community. The project started in December 2010 with full support from the New Energy and Industrial Technology Development Organization (NEDO).

DOI： 10.1109/ASPDAC.2013.6509655
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator

Lovic Gauthier, Shinya Ueno, Inoue Koji

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2013 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2013 2013年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper presents a hybrid compile and run-time memory management technique for a 3D-stacked reconfigurable accelerator including a memory layer composed of multiple memory units whose parallel access allows a very high bandwidth. The technique inserts allocation, free and data transfers into the code for using the memory layer and avoids memory overflows by adding a limited number of additional copies to and from the host memory. When compile-time information is lacking, the technique relies on run-time decisions for controlling these memory operations. Experiments show that, compared to a pessimistic approach, the overhead for avoiding overflows can be cut on average by 27%, 45% and 63% when the size of each memory unit is respectively 1kB, 128kB and 1MB.

DOI： 10.1109/CASES.2013.6662514
SMYLEref: A reference architecture for manycore-processor SoCs.

Masaaki Kondo, Son Truong Nguyen, Tomoya Hirao, Takeshi Soga, Hiroshi Sasaki 0001, Koji Inoue

18th Asia and South Pacific Design Automation Conference(ASP-DAC) 561 - 564 2013年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ASPDAC.2013.6509656
SMYLE Project: Toward high-performance, low-power computing on manycore-processor SoCs.

Koji Inoue

18th Asia and South Pacific Design Automation Conference(ASP-DAC) 558 - 560 2013年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ASPDAC.2013.6509655
Power and performance of GPU-accelerated systems: A closer look.

Yuki Abe 0001, Hiroshi Sasaki 0001, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

Proceedings of the IEEE International Symposium on Workload Characterization(IISWC) 109 - 110 2013年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/IISWC.2013.6704675
Line sharing cache: Exploring cache capacity with frequent line value locality.

Keitarou Oka, Hiroshi Sasaki 0001, Koji Inoue

18th Asia and South Pacific Design Automation Conference(ASP-DAC) 669 - 674 2013年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ASPDAC.2013.6509677
Power and performance of GPU-accelerated systems A closer look

Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

2013 IEEE International Symposium on Workload Characterization, IISWC 2013 Proceedings - 2013 IEEE International Symposium on Workload Characterization, IISWC 2013 109 - 110 2013年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/IISWC.2013.6704675
Many-core acceleration for model predictive control systems

Satoshi Kawakami, Akihito Iwanaga, Inoue Koji

1st International Workshop on Many-Core Embedded Systems, MES 2013, in Conjunction with the 40th Annual IEEE/ACM International Symposium on Computer Architecture, ISCA 2013 1st International Workshop on Many-Core Embedded Systems, MES 2013 - In Conjunction with the 40th Annual IEEE/ACM International Symposium on Computer Architecture, ISCA 2013 17 - 24 2013年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes a novel many-core execution strategy for real-time model predictive controls. The key idea is to exploit predicted input values, which are produced by the model predictive control itself, to speculatively solve an op- timal control problem. It is well known that control appli- cations are not suitable for multi- or many-core processors, because feedback-loop systems inherently stand on sequen- tial operations. Since the proposed scheme does not rely on conventional thread-/data-level parallelism, it can be easily applied to such control systems. An analytical evaluation using a real application demonstrates the potential of per- formance improvement achieved by the proposed speculative executions.

DOI： 10.1145/2489068.2489071
A three-dimensional integrated accelerator

Farhad Mehdipour, Krishna C. Nunna, Koji Inoue, Kazuaki J. Murakami

15th Euromicro Conference on Digital System Design, DSD 2012 Proceedings - 15th Euromicro Conference on Digital System Design, DSD 2012 148 - 151 2012年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

We propose a three-dimensional (3D) reconfigurable data-path accelerator which is capable of running partitioned large data flow graphs (DFGs) on the layers of 3D stack, while inter-layer connections are implemented by means of through-silicon vias (TSVs). A tool for mapping data flow graphs has been developed, and a key 3D-specific problem namely routing nets on 3D architecture has been discussed in details as well. Conducted experiments demonstrate smaller footprint area and higher performance for the 3D accelerator comparing with 2D counterpart.

DOI： 10.1109/DSD.2012.15
A Three-Dimensional Integrated Accelerator.

Farhad Mehdipour, Krishna Chaitanya Nunna, Koji Inoue, Kazuaki J. Murakami

15th Euromicro Conference on Digital System Design(DSD) 148 - 151 2012年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/DSD.2012.15
Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization.

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki J. Murakami

The Journal of Supercomputing 60 ( 2 ) 196 - 222 2012年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1007/s11227-010-0505-0
Scalability-based manycore partitioning 査読

Hiroshi Sasaki, Koji Inoue, Teruo Tanimoto, Hiroshi Nakamura

21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012 Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT 107 - 116 2012年10月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

Multicore processors have been popular for years, and the industry is gradually shifting towards the era of manycore processors. Single-thread performance of microprocessors is not growing at a historical rate, but the existence of a num- ber of active processes in the computer system and the con- tinuing development of multi-threaded applications benefit from the growing core counts to sustain system throughput. This trend brings us a situation where a number of paral- lel applications simultaneously being executed on a single system. Since multi-threaded applications try to maximize its throughput by utilizing the whole system, each of them usually create equal or larger number of threads compared to underlying logical core counts. This introduces much greater number of threads to be co-scheduled in the entire system. However, each program has different characteristics (or scalability) and contends for shared resources, which are the CPU cores and memory hierarchies, with each other. Therefore, it is clear that OS thread scheduling will play a major role in achieving high system performance under such conditions. We develop a sophisticated scheduler that (1) dynamically predicts the scalability of programs via the use of hardware performance monitoring units, (2) decides the optimal number of cores to be allocated for each program, and (3) allocates the cores to programs while maximizing the system utilization to achieve fair and maximum perfor- mance. The evaluation results on a 48-core AMD Opteron system show improvements over the Linux scheduler for a variety of multiprogramming workloads.

DOI： 10.1145/2370816.2370833
Power and Performance Analysis of GPU-Accelerated Systems.

Yuki Abe 0001, Hiroshi Sasaki 0001, Martin Peres, Koji Inoue, Kazuaki J. Murakami, Shinpei Kato

2012 Workshop on Power-Aware Computing Systems(HotPower) 2012年10月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）
コア数と動作周波数の動的変更によるメニーコア・プロセッサ性能向上手法の提案査読

今村智史, 佐々木広, 福本尚人, 井上弘士, 村上和彰

情報処理学会論文誌ACS 5 ( 4 ) 24 - 35 2012年8月

　詳細を見る

記述言語：日本語
データ値の局所性を利用したライン共有キャッシュ査読

岡慶太郎, 佐々木広, 阿部祐希, 井上弘士, 村上和彰

情報処理学会論文誌ACS 5 ( 4 ) 36 - 47 2012年8月

　詳細を見る

記述言語：日本語
Scalability-based manycore partitioning.

Hiroshi Sasaki 0001, Teruo Tanimoto, Koji Inoue, Hiroshi Nakamura

International Conference on Parallel Architectures and Compilation Techniques(PACT) 107 - 116 2012年2月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/2370816.2370833
3次元積層LSI向けSRAM/DRAMハイブリッドキャッシュ・アーキテクチャ査読

上野伸也, 橋口慎哉, 福本尚人, 井上弘士, 村上和彰

情報処理学会論文誌コンピューティングシステム（ACS） 5 ( 1 ) 41 - 52 2012年1月

　詳細を見る

記述言語：英語

本稿では，3次元積層DRAMの利用を前提とし，大幅なチップ面積の増加をともなうことなく高いメモリ性能を達成可能な新しいキャッシュ・アーキテクチャを提案する．3次元積層されたDRAMを大容量キャッシュとして活用することで，オフチップメモリ参照回数の劇的な削減が期待できる．しかしながら，キャッシュの大容量化はアクセス時間の増加を招くため，場合によっては性能が低下する．この問題を解決するため，提案方式では，実行対象プログラムのワーキングセット・サイズに応じて3次元積層DRAMキャッシュを選択的に活用する．ベンチマークプログラムを用いた定量的評価を行った結果，提案方式は動的制御方式により平均メモリアクセス時間を15%削減した．This paper proposes a novel cache architecture for 3D-implemented microprocessors. 3D-IC is one of the most interesting techniques to achieve high-performance, low-power VLSI systems. Stacking multiple dies makes it possible to implement microprocessor cores and large caches (or DRAM) into the same chip. Unfortunately, applying the 3D DRAM cache causes performance degradation for some programs, because increasing cache size makes access time longer. To tackle this issue, the proposed cache supports two operation modes: a fast but small SRAM cache mode and a slow but large DRAM cache mode. An appropriate operation mode is selected at run time based on the behavior of application programs. The evaluation results show that the proposed approach achieves 15% of memory performance improvement.
Task mapping techniques for embedded many-core SoCs.

Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Koji Inoue

International SoC Design Conference(ISOCC) 204 - 207 2012年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISOCC.2012.6407075
Task mapping techniques for embedded many-core SoCs

Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Koji Inoue

2012 International SoC Design Conference, ISOCC 2012 ISOCC 2012 - 2012 International SoC Design Conference 204 - 207 2012年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes static task mapping techniques for embedded many-core SoCs. The proposed techniques take into account both task and data parallelisms of the tasks in order to efficiently utilize the potential parallelism of the many-core architecture. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. In addition, a static mapping technique considering dynamic task switching is proposed. Experimental results show the effectiveness of the proposed techniques.

DOI： 10.1109/ISOCC.2012.6407075
NSIM: An Interconnection Network Simulator for Extreme-Scale Parallel Computers 査読

Hideki MIWA Ryutaro SUSUKITA Hidetomo SHIBAMURA Tomoya HIRAO Jun MAKI Makoto YOSHIDA Takayuki KANDO Yuichiro AJIMA Ikuo MIYOSHI Toshiyuki SHIMIZU Yuji OINAGA Hisashige ANDO Yuichi INADOMI Koji INOUE Mutsumi AOYAGI Kazuaki MURAKAMI

IEICE TRANSACTIONS on Information and Systems 2011年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Performance evaluation of 3D stacked multi-core processors with temperature consideration

Takaaki Hanada, Hiroshi Sasaki, Koji Inoue, Kazuaki Murakami

2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

3D stacked multi-core processor is one of the applications of 3D integration technology. It achieves high bandwidth access to last level cache and allows to increase the number of cores while maintaining the package area. Although, 3D multi-core temperature increases with the number of stacked dies because of the escalating power density and thermal resistivity. Therefore, 3D multi-cores require lower clock frequencies for keeping the temperature under a safe constraint, so that performance is not always improved. In this paper, we evaluate the performance of 3D stacked multi-cores running under temperature constraints, and we show that there is a trade-off between clock frequency and parallel capability.

DOI： 10.1109/3DIC.2012.6263025
NSIM: An Interconnection Network Simulator for Extreme-Scale Parallel Computers.

Hideki Miwa, Ryutaro Susukita, Hidetomo Shibamura, Tomoya Hirao, Jun Maki, Makoto Yoshida, Takayuki Kando, Yuichiro Ajima, Ikuo Miyoshi, Toshiyuki Shimizu, Yuji Oinaga, Hisashige Ando, Yuichi Inadomi, Koji Inoue, Mutsumi Aoyagi, Kazuaki J. Murakami

IEICE Transactions on Information & Systems 94-D ( 12 ) 2298 - 2308 2011年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transinf.E94.D.2298
3D implemented SRAM/DRAM hybrid cache architecture for high-performance and low power consumption

Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, Kazuaki Murakami

54th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2011 54th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2011 2011年10月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper introduces our research status focusing on 3D-implemented microprocessors. 3D-IC is one of the most interesting techniques to achieve high-performance, low-power VLSI systems. Stacking multiple dies makes it possible to implement microprocessor cores and large caches (or DRAM) into the same chip. Although this kind of integration has a great potential to bring a breakthrough in computer systems, its efficiency strongly depends on the characteristics of target application programs. Unfortunately, applying die stacking implementation causes performance degradation for some programs. To tackle this issue, we introduce a novel cache architecture consisting of a small but fast SRAM and a stacked large DRAM. The cache attempts to adapt to varying behavior of application programs in order to compensate for the negative impact of the die stacking approach.

DOI： 10.1109/MWSCAS.2011.6026484
Message from the chairs 査読

Naehyuck Chang, Hiroshi Nakamura, Kenichi Osada, Massimo Poncino, Koji Inoue

17th IEEE/ACM International Symposium on Low Power Electronics and Design, ISLPED 2011 Proceedings of the International Symposium on Low Power Electronics and Design iii - iv 2011年9月

　詳細を見る

記述言語：英語

DOI： 10.1109/ISLPED.2011.5993616
Routing architecture and algorithms for a superconductivity circuits-based computing hardware.

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 24th Canadian Conference on Electrical and Computer Engineering(CCECE) 977 - 980 2011年9月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/CCECE.2011.6030605
A thermal-aware mapping algorithm for reducing peak temperature of an accelerator deployed in a 3D stack.

Farhad Mehdipour, Krishna Chaitanya Nunna, Lovic Gauthier, Koji Inoue, Kazuaki J. Murakami

2011 IEEE International 3D Systems Integration Conference (3DIC)(3DIC) 1 - 4 2011年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/3DIC.2012.6263034
Performance evaluation of 3D stacked multi-core processors with temperature consideration.

Takaaki Hanada, Hiroshi Sasaki 0001, Koji Inoue, Kazuaki J. Murakami

2011 IEEE International 3D Systems Integration Conference (3DIC)(3DIC) 1 - 5 2011年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/3DIC.2012.6263025
演算/メモリ性能バランスを考慮したマルチコア向けオンチップメモリ貸与法査読

福本尚人, 井上弘士, 村上和彰

情報処理学会論文誌ACS 2011年5月

　詳細を見る

記述言語：英語
A design scheme for a reconfigurable accelerator implemented by single-flux quantum circuits.

Farhad Mehdipour, Hiroaki Honda, Koji Inoue, Hiroshi Kataoka, Kazuaki J. Murakami

Journal of Systems Architecture - Embedded Systems Design 57 ( 1 ) 169 - 179 2011年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1016/j.sysarc.2010.07.009
A thermal-aware mapping algorithm for reducing peak temperature of an accelerator deployed in a 3D stack

Farhad Mehdipour, Krishna Chaitanya Nunna, Lovic Gauthier, Koji Inoue, Kazuaki Murakami

2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011 IEEE International 3D Systems Integration Conference, 3DIC 2011 2011年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Thermal management is one of the main concerns in three-dimensional integration due to difficulty of dissipating heat through the stack of the integrated circuit. In a 3D stack involving a data-path accelerator, a base processor and memory components, peak temperature reduction is targeted in this paper. A mapping algorithm has been devised in order to distribute operations of data flow graphs evenly over the processing elements of the target accelerator in two steps involving thermal-aware partitioning of input data flow graphs, and thermal-aware mapping of the partitions onto the processing elements. The efficiency of the proposed technique in reducing peak temperature is demonstrated throughout the experiments.

DOI： 10.1109/3DIC.2012.6263034
Routing architecture and algorithms for a superconductivity circuits-based computing hardware

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami

2011 Canadian Conference on Electrical and Computer Engineering, CCECE 2011 2011 Canadian Conference on Electrical and Computer Engineering, CCECE 2011 977 - 980 2011年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Dedicated tools for placing and routing data flow graphs extracted from computation-intensive applications are basic requirements for developing applications on a large-scale reconfigurable data-path processor (LSRDP) implemented by superconductivity circuits. Using an alternative technology instead of CMOS circuits for implementing such hardware entails considering particular constraints and conditions from the architecture and tools development perspectives. The main contribution of this work is to introduce an operand routing network (ORN) architecture as well as algorithms for routing the nets corresponding to the edges of the data flow graphs. Further, a micro-routing algorithm is proposed for routing and configuring the ORNs internally. These algorithms have been applied on a number of data flow graphs from target applications and the results demonstrate their efficacy.

DOI： 10.1109/CCECE.2011.6030605
Hardware and software requirements for implementing a high-performance superconductivity circuits-based accelerator

Farhad Mehdipour, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

3rd Asia Symposium on Quality Electronic Design, ASQED 2011 Proceedings of the 3rd Asia Symposium on Quality Electronic Design, ASQED 2011 229 - 235 2011年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Single-Flux Quantum based large-scale data-path processor (SFQ-LSRDP) is a reconfigurable computing system which is implemented by means of superconductivity circuits. SFQ-LSRDP has a capability of accelerating data flow graphs (DFGs) extracted from scientific applications. Using an alternative technology instead of CMOS circuits for implementing such hardware entails considering particular constraints and conditions from the architecture and tools development perspectives. In this paper, we will introduce hardware specifications of the LSRDP and the tool chain developed for implementing applications. Placing and routing data flow graphs is a fundamental part to develop applications on the SFQ-LSRDP. Algorithms for placing DFG operations and routing nets corresponding to the edges of data flow graphs will be discussed in more details. These algorithms have been applied on a number of data flow graphs and the results demonstrate their efficiency. Further, simulation results demonstrates remarkable performance numbers in the range of hundreds of Gflops for the proposed architecture.

DOI： 10.1109/ASQED.2011.6111751
Mapping scientific applications on a large-scale data-path accelerator implemented by single-flux quantum (SFQ) circuits.

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Irina Kataeva, Kazuaki J. Murakami, Hiroyuki Akaike, Akira Fujimaki

Design, Automation and Test in Europe(DATE) 993 - 996 2010年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/DATE.2010.5456902
Mapping scientific applications on a large-scale data-path accelerator implemented by Single-Flux Quantum (SFQ) circuits

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Irina Kataeva, Kazuaki Murakami, Hiroyuki Akaike, Akira Fujimaki

Design, Automation and Test in Europe Conference and Exhibition, DATE 2010 DATE 10 - Design, Automation and Test in Europe 993 - 996 2010年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

To overcome issues originating from the CMOS technology, a large-scale reconfigurable data-path (LSRDP) processor based on single-flux quantum circuits is introduced. LSRDP is augmented to a general purpose processor to accelerate the execution of data flow graphs (DFGs) extracted from scientific applications. Procedure of mapping large DFGs onto the LSRDP is discussed and our proposed techniques for reducing area of the accelerator within the design procedure will be introduced as well.

DOI： 10.1109/date.2010.5456902
Rapid Design Space Exploration of a Reconfigurable Instruction-Set Processor.

Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 92-A ( 12 ) 3182 - 3192 2009年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transfun.E92.A.3182
Rapid design space exploration of a reconfigurable instruction-set processor 査読

Farhad Mehdipour, Hamid Noori, Inoue Koji, Kazuaki Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E92-A ( 12 ) 3182 - 3192 2009年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

Multitude parameters in the design process of a reconfigurable instruction-set processor (RISP) may lead to a large design space and remarkable complexity. Quantitative design approach uses the data collected from applications to satisfy design constraints and optimize the design goals while considering the applications' characteristics; however it highly depends on designer observations and analyses. Exploring design space can be considered as an effective technique to find a proper balance among various design parameters. Indeed, this approach would be computationally expensive when the performance evaluation of the design points is accomplished based on the synthesis-and-simulation technique. A combined analytical and simulation-based model (CAnSO**) is proposed and validated for performance evaluation of a typical RISP. The proposed model consists of an analytical core that incorporates statistics collected from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. CAnSO has clear speed advantages and therefore it can be used for easing a cumbersome design space exploration of a reconfigurable RISP processor and quick performance evaluation of slightly modified architectures.

DOI： 10.1587/transfun.E92.A.3182
Performance balancing Software-based on-chip memory management for effective CMP executions

Naoto Fukumoto, Kenichi Imazato, Inoue Koji, Kazuaki Murakami

10th MEDEA Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture, MEDEA '09, held in conjunction with the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT 2009 Proceedings of the 10th MEDEA Workshop on MEmory Performance DEaling with Applications, Systems and Architecture, MEDEA '09, held in conjunction with the PACT 2009 Conference 28 - 34 2009年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes the concept of performance balancing, and reports its performance impact on a Chip multiprocessor (CMP). Integrating multiple processor cores into a single chip, or CMPs, can achieve higher peak performance by means of exploiting thread level parallelism. However, the off-chip memory bandwidth which does not scale with the number of cores tends to limit the potential of CMPs. To solve this issue, the technique proposed in this paper attempts to make a good balance between computation and memorization. Unlike conventional parallel executions, this approach exploits some cores to improve the memory performance. These cores devote the on-chip memory hardware resources to the remaining cores executing the parallelized threads. In our evaluation, it is observed that our approach can achieve 31% of performance improvement compared to a conventional parallel execution model in the specified program.

DOI： 10.1145/1621960.1621966
Adaptive cache-line size management on 3D integrated microprocessors

Takatsugu Ono, Inoue Koji, Kazuaki Murakami

2009 International SoC Design Conference, ISOCC 2009 2009 International SoC Design Conference, ISOCC 2009 472 - 475 2009年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the cache miss penalty because large amount of data can be transferred from the main memory to the cache at a time. If a large cache line size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable line-size cache scheme. In this paper, we apply it to an L1 data cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data cache and stacked DRAM energy consumption up to 75%, compared to a conventional cache.

DOI： 10.1109/SOCDC.2009.5423920
ALU-array based reconfigurable accelerator for energy efficient executions

Inoue Koji, Hamid Noori, Farhad Mehdipour, Takaaki Hanada, Kazuaki Murakami

2009 International SoC Design Conference, ISOCC 2009 2009 International SoC Design Conference, ISOCC 2009 157 - 160 2009年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper introduces an energy efficient acceleration technique for embedded microprocessors. By means of supporting an ALU-array based coarse-grain reconfigurable functional unit, well customized special instructions are identified and executed for each application program. Since the reconfigurable functional unit can execute several dependent instructions (a sequence of instructions), simultaneously, the performance of the base microprocessor can dramatically be improved. In addition, this kind of direct execution is very energy efficient because it reduces activation counts of hardware components such as instruction cache, branch predictor, register-file accesses, and so on.

DOI： 10.1109/SOCDC.2009.5423898
Adaptive cache-line size management on 3D integrated microprocessors

Takatsugu Ono, Koji Inoue, Kazuaki Murakami

2009 International SoC Design Conference (ISOCC) 2009年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/socdc.2009.5423920
Performance balancing: software-based on-chip memory management for effective CMP executions.

Naoto Fukumoto, Kenichi Imazato, Koji Inoue, Kazuaki J. Murakami

MEDEA@PACT 28 - 34 2009年9月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/1621960.1621966
An Operand Routing Network for an SFQ Reconfigurable Data-Paths Processor 査読

I. Kataeva, H. Akaike, A. Fujimaki, N. Yoshikawa, N. Takagi, K. Inoue, H. Honda, and K. Murakami

IEEE Transactions on Applied Superconductivity 2009年6月

　詳細を見る

記述言語：英語
An operand routing network for an SFQ reconfigurable Data-Paths processor 査読

Irina Kataeva, Hiroyuki Akaike, Akira Fujimaki, Nobuyuki Yoshikawa, Naofumi Takagi, Koji Inoue, Hiroaki Honda, Kazuaki Murakami

IEEE Transactions on Applied Superconductivity 19 ( 3 ) 665 - 669 2009年6月

　詳細を見る

記述言語：英語

We report the progress in the development of an Operand Routing Network (ORN) for an SFQ Reconfigurable Data-Paths processor (SFQ-RDP). The SFQ-RDP is implemented as a two-dimensional array of Floating-Point Units (FPU), outputs of which can be connected to the inputs of one or more FPUs in the next row via ORN. We have considered two architectures of the ORN: one is based on NDRO switches and the other-on crossbar switches. The comparison shows that the crossbar-based ORN has better performance due to the regular pipelined structure. We have designed a crossbar switch with a multicasting function and a l-to-2 ORN prototype for 2.5 kA/cm² process. The circuits have been experimentally tested at the frequencies up to 36 GHz.

DOI： 10.1109/TASC.2009.2018534
Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

Takatsugu ONO, Koji INOUE, Kazuaki MURAKAMI, Kenji YOSHIDA

IEICE Transactions on Electronics E92-C ( 4 ) 433 - 443 2009年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transele.e92.c.433
Reducing On-Chip DRAM energy via data transfer size optimization 査読

Takatsugu Ono, Koji Inoue, Kazuaki Murakami, Kenji Yoshida

IEICE Transactions on Electronics E92-C ( 4 ) 433 - 443 2009年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines.

DOI： 10.1587/transele.E92.C.433
A combined analytical and simulation-based model for performance evaluation of a reconfigurable instruction set processor.

Farhad Mehdipour, Hamid Noori, Bahman Javadi, Hiroaki Honda, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 14th Asia South Pacific Design Automation Conference(ASP-DAC) 564 - 569 2009年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ASPDAC.2009.4796540
A combined analytical and simulation-based model for performance evaluation of a reconfigurable instruction set processor

Farhad Mehdipour, Hamid Noori, Bahman Javadi, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

Asia and South Pacific Design Automation Conference 2009, ASP-DAC 2009 Proceedings of the ASP-DAC 2009 Asia and South Pacific Design Automation Conference 2009 564 - 569 2009年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Performance evaluation is a serious challenge in designing or optimizing reconfigurable instruction set processors. The conventional approaches based on synthesis and simulations are very time consuming and need a considerable design effort. A combined analytical and simulation-based model (CAnSO ) is proposed and validated for performance evaluation of a typical reconfigurable instruction set processor. The proposed model consists of an analytical core that incorporates statistics gathered from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. Compared to cycle-accurate simulation results, CAnSO proves almost 2% variation in the speedup measurement.

DOI： 10.1109/ASPDAC.2009.4796540
Foreword Special section on hardware and software technologies on advanced microprocessors 査読

Koji Inoue, Koji Kai, Fumio Arakawa, Akihiko Inoue, Yoshio Hirose, Shorin Kyo, Keiji Kimura, Morihiro Kuga, Masaaki Kondo, Toshinori Sato, Makoto Satoh, Hiroyuki Tomiyama, Hiroshi Nakamura, Hiroo Hayashi, Masanori Hariyama, Hiroki Matsutani, Kunio Uchiyama

IEICE Transactions on Electronics E92-C ( 10 ) 2009年

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

DOI： 10.1587/transele.E92.C.1231
Analyzing the impact of data prefetching on chip multiprocessors

Naoto Fukumoto, Tomonobu Mihara, Inoue Koji, Kazuaki Murakami

13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008 13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008 2008年11月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Data prefetching is a well known approach to compensating for poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed single-core architectures. In Chip Multiprocessor (or CMP) chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMP should be different from traditional single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetches issued during program execution. Then, we discuss quantitatively the effect of prefetching to memory performance. The experimental results show that the negative effect of invalidation of prefetched cache blocks is very small. In addition, it is observed that the current prefetch algorithms do not exploit effectively the feature of CMPs, i.e. cache-to-cache on-chip data transfer.

DOI： 10.1109/APCSAC.2008.4625454
Improved policies for Drowsy caches in embedded processors

Junpei Zushi, Gang Zeng, Hiroyuki Tomiyama, Hiroaki Takada, Inoue Koji

4th IEEE International Symposium on Electronic Design, Test and Applications, DELTA 2008 Proceedings - 4th IEEE International Symposium on Electronic Design, Test and Applications, DELTA 2008 362 - 367 2008年9月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

In the design of embedded systems, especially batterypowered systems, it is important to reduce energy consumption. Cache are now used not only in general-purpose processors but also in embedded processors. As feature sizes shrink, the leakage energy has contributed to a significant portion of total energy consumption. To reduce the leakage energy of cache, the Drowsy cache was proposed, in which the cache lines are periodically moved to the lowleakage mode without loss of its content. However, when a cache line in the low-leakage mode is accessed, one or more clock cycles are required to transition the cache line back to the normal mode before its content can be accessed. As a result, these penalty cycles may significantly degrade the cache performance, especially in embedded processors without out-of-order execution. In this paper, we propose four mode transition policies which aim at high energy reduction with the minimum performance degradation. We also compare our policies with existing policies in the context of embedded processors. Experimental results demonstrate the effectiveness of the proposed policies.

DOI： 10.1109/DELTA.2008.70
Analyzing the impact of data prefetching on Chip MultiProcessors.

Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki J. Murakami

13th Asia-Pacific Computer Systems Architecture Conference(ACSAC) 2008年9月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/APCSAC.2008.4625454
Improving energy efficiency of configurable caches via temperature-aware configuration selection

Hamid Noori, Maziar Goudarzi, Inoue Koji, Kazuaki Murakami

IEEE Computer Society Annual Symposium on VLSI: Trends in VLSI Technology and Design, ISVLSI 2008 Proceedings - IEEE Computer Society Annual Symposium on VLSI Trends in VLSI Technology and Design, ISVLSI 2008 363 - 368 2008年9月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Our results show that using a Temperature-Aware Configurable Cache (TACC), up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner case temperature (100°C). The TACC also enhances the performance by up to 28% and 17% for the instruction and data cache, respectively.

DOI： 10.1109/ISVLSI.2008.24
Design space exploration for a coarse grain accelerator

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Inoue Koji, Kazuaki Murakami

2008 Asia and South Pacific Design Automation Conference, ASP-DAC 2008 Asia and South Pacific Design Automation Conference, ASP-DAC 685 - 690 2008年8月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

In the design process of a reconfigurable accelerator employing in an embedded system, multitude parameters may result in remarkable complexity and a large design space. Design space exploration as an alternative to the quantitative approach can be employed to find a right balance between the different design parameters. In this paper, a hybrid approach is introduced to analytically explore the design space for a coarse grain accelerator and determine a wise design point exploiting data extracted from applications, quantitatively. It also provides flexibility for taking into account new design constraints as well as new characteristics of applications. Furthermore, this approach is a methodological approach which reduces the design time and results in a point which satisfies the design goals.

DOI： 10.1109/ASPDAC.2008.4484039
Enhancing energy efficiency of processor-based embedded systems through post-fabrication ISA extension.

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 2008 International Symposium on Low Power Electronics and Design(ISLPED) 241 - 246 2008年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/1393921.1393987
Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits

N. Takagi, K. Murakami, A. Fujimaki, N. Yoshikawa, K. Inoue, and H. Honda

IEICE Transactions on Electronics 2008年7月

　詳細を見る

記述言語：英語
A Gravity-Directed Temporal Partitioning Approach 査読

F. Mehdipour, H. Noori, H. Honda, K. Inoue, and K. Murakami

IEICE Electronics Express, Vol. 5, No. 10, pp.366-373 2008年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
A gravity-directed temporal partitioning approach 査読

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

IEICE Electronics Express 5 ( 10 ) 366 - 373 2008年5月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

Reconfiguration latency has a significant impact on the system performance in reconfigurable systems. A temporal partitioning approach is introduced for partitioning data flow graphs for a reconfigurable system comprising a partial programmable fine-grained hardware. Residing eligibility inspired from the Universal gravitation law is introduced to depict the eligibility of a node to stay in succeeding configurations (partitions) and to prohibit it from being swapped in/out. Partitioning based on residing eligibility causes fewer nodes with different functionalities to be assigned to subsequent partitions. Thus, reconfiguration overhead time and also unused hardware space decreases due to common parts in consecutive configurations.

DOI： 10.1587/elex.5.366
A gravity-directed temporal partitioning approach.

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki J. Murakami

IEICE Electronic Express 5 ( 10 ) 366 - 373 2008年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1587/elex.5.366
Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems 国際誌

H. Noori, M. Goudarzi, K. Inoue, and K. Murakami

IEICE Transactions on Electronics 2008年4月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

IEEE Computer Society Annual Symposium on VLSI(ISVLSI) 363 - 368 2008年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISVLSI.2008.24
Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Electronics 91-C ( 4 ) 418 - 431 2008年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietele/e91-c.4.418
A reconfigurable functional unit with conditional execution for multi-exit custom instructions 査読

Hamid Noori, Farhad Mehdipour, Inoue Koji, Kazuaki Murakami

IEICE Transactions on Electronics E91-C ( 4 ) 497 - 508 2008年4月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of these custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. A quantitative approach is utilized to propose an efficient architecture for the RFU and fix its constraints. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple exits custom instructions are proposed. Conditional execution has been added to the RFU to support the multi-exit feature of custom instructions. Experimental results show that multi-exit custom instructions enhance the performance by an average of 67% compared to custom instructions limited to one basic block. A maximum speedup of 4.7, compared to a general embedded processor, and an average speedup of 1.85 was achieved on MiBench benchmark suite.

DOI： 10.1093/ietele/e91-c.4.497
Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits.

Naofumi Takagi, Kazuaki J. Murakami, Akira Fujimaki, Nobuyuki Yoshikawa, Koji Inoue, Hiroaki Honda

IEICE Transactions on Electronics 91-C ( 3 ) 350 - 355 2008年3月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietele/e91-c.3.350
Improved Policies for Drowsy Caches in Embedded Processors.

Junpei Zushi, Gang Zeng, Hiroyuki Tomiyama, Hiroaki Takada, Koji Inoue

4th IEEE International Symposium on Electronic Design, Test and Applications(DELTA) 362 - 367 2008年3月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/DELTA.2008.70
Design space exploration for a coarse grain accelerator.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 13th Asia South Pacific Design Automation Conference(ASP-DAC) 685 - 690 2008年3月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ASPDAC.2008.4484039
An architecture framework for an adaptive extensible processor.

Hamid Noori, Farhad Mehdipour, Kazuaki J. Murakami, Koji Inoue, Morteza Saheb Zamani

The Journal of Supercomputing 45 ( 3 ) 313 - 340 2008年2月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1007/s11227-008-0174-4
A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions.

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Electronics 91-C ( 4 ) 497 - 508 2008年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietele/e91-c.4.497
Enhancing energy efficiency of processor-based embedded systems through post-fabrication ISA extension

Hamid Noori, Farhad Mehdipour, Koji Inoue, Kazuaki Murakami

ISLPED'08: 13th ACM/IEEE International Symposium on Low Power Electronics and Design ISLPED'08 Proceedings of the 2008 International Symposium on Low Power Electronics and Design 241 - 246 2008年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Application-specific instruction set extension is an effective technique for reducing accesses to components such as on- and off-chip memories, register file and enhancing the energy efficiency. However, the addition of custom functional units to the base processor is required for supporting custom instructions, which due to the increase of manufacturing and design costs in new nanometer-scale technologies and shorter time-to-market, is becoming an issue. To address above issues, in our proposed approach, an optimized reconfigurable functional unit is used instead, and instruction set customization is done after chip-fabrication. Therefore, while maintaining the flexibility of a conventional microprocessor, the low-energy feature of customization is applicable. Experimental results show that the maximum and average energy savings are 67% and 22%, respectively for our proposed architecture framework.

DOI： 10.1145/1393921.1393987
Performance prediction of large-scale parallell system and application using macro-level simulation

Ryutaro Susukita, Yasunori Kimura, Hisashige Ando, Hidemi Komatsu, Mutsumi Aoyagi, Motoyoshi Kurokawa, Hiroaki Honda, Kazuaki J. Murakami, Yuichi Inadomi, Hidetomo Shibamura, Koji Inoue, Shuji Yamamura, Shigeru Ishizuki, Yunqing Yu

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008 2008年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

To predict application performance on an HPC system is an important technology for designing the computing system and developing applications. However, accurate prediction is a challenge, particularly, in the case of a future coming system with higher performance. In this paper, we present a new method for predicting application performance on HPC systems. This method combines modeling of sequential performance on a single processor and macro-level simulations of applications for parallel performance on the entire system. In the simulation, the execution flow is traced but kernel computations are omitted for reducing the execution time. Validation on a real terascale system showed that the predicted and measured performance agreed within 10% to 20 %. We employed the method in designing a hypothetical petascale system of 32768 SIMD-extended processor cores. For predicting application performance on the petascale system, the macro-level simulation required several hours.

DOI： 10.1109/SC.2008.5220091
Performance evaluation of a reconfigurable set processor

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

2008 International SoC Design Conference, ISOCC 2008 2008 International SoC Design Conference, ISOCC 2008 I184 - I187 2008年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Performance evaluation is a serious challenge in designing optimizing reconfigurable instruction set processors. A combined and simulation-based model (CAnSO?) is proposed and for performance evaluation of a typical reconfigurable set processor. The proposed model consists of an core that incorporates statistics gathered from cycleaccurate to make a reasonable evaluation. CAnSO has speed advantages and compared to cycle-accurate simulation, proves almost 2% variation in the speedup measurement.

DOI： 10.1109/SOCDC.2008.4815603
Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs

F. Mehdipour, H. Noori, M. S. Zamani, K. Inoue, and K. Murakami

IEICE Transactions on Electronics 2007年12月

　詳細を見る

記述言語：英語
Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Information & Systems 90-D ( 12 ) 1956 - 1966 2007年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietisy/e90-d.12.1956
At the cutting edge of a petascale computing world An overview of Petascale System Interconnect project

Kazuaki J. Murakami, Feng Long Gu, Mutsumi Aoyagi, Takeshi Nanri, Koji Inoue

5th International Conference on Computational Methods in Science and Engineering, ICCMSE 2007 Computational Methods in Science and Engineering - Theory and Computation Old Problems and New Challenges, Lectures Presented at the Int. Conf. Computational Methods in Sci. Eng. 2007 ICCMSE 2007 23 - 38 2007年12月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This talk presents an overview of the Petascale System Interconnect (PSI) project. The PSI project is one of the national projects on "Fundamental Technologies for the Next Generation Supercomputing" of MEXT (Ministry of Education, Culture Sports, Science and Technology), Japan. The goal of the PSI project is to develop technologies enabling petascale supercomputing systems with hundreds of thousands of computing nodes. The PSI project consists of three subprojects to tackle with the three fundamental technologies: subproject 1 is for the small and efficient optical packet switches; subproject 2 is for the low-cost & high-performance MPI communications; and subproject 3 is for the methodologies of evaluating and estimating the performance of petascale systems. With the successful completion of the PSI project, the Japan Next-Generation Supercomputer R&D Center (NSC) will take the technologies to build Japan's next generation supercomputer, which is expected to be over 70 times faster than the current fastest supercomputers.

DOI： 10.1063/1.2827008
Design of a reconfigurable data-path prototype in the single-flux-quantum circuit 査読

S. Iwasaki, M. Tanaka, Y. Yamanashi, H. Park, H. Akaike, A. Fujimaki, N. Yoshikawa, N. Takagi, K. Murakami, H. Honda, K. Inoue

Superconductor Science and Technology 20 ( 11 ) S328 - S331 2007年11月

　詳細を見る

記述言語：英語

We have designed a reconfigurable data-path (RDP) prototype based on the single-flux-quantum (SFQ) circuit. The RDP serves as an accelerator for a high performance computer and is composed of many stages of the array of floating point number processing units (FPUs) connected by reconfigurable operand routing networks (ORNs). The FPU array usually includes shift-registers (SRs) in order that the data is forwarded to the next stage without calculation. The data-path is reconfigured so as to reflect a long repeat instruction appearing in large-scale calculations. We can implement parallel and pipelined processing without memory access in such calculations, reducing the required bandwidth between a memory and a microprocessor. The SFQ high speed network switches and bit-serial/slice FPUs realize reduction in the circuit areas and in the power consumption compared to semiconductor devices when we make up the RDP by using the SFQ circuit. As a first step of the development of the SFQ-RDP, we design a 2 × 2 RDP prototype composed of double arrays of dual arithmetic logic units (ALUs). The prototype also has dual SRs in each array and four ORNs. We use bit-serial ALUs designed to operate at 25GHz. Each ORN behaves like a 4 × 2 crossbar switch. We have demonstrated the reconfiguration in the RDP prototype made up of 15 050 Josephson junctions though only some of the functions of ALUs are available.

DOI： 10.1088/0953-2048/20/11/S06
A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

M. Sakamoto, A. Katsuno, G. Sugizaki, T. Yoshida, A. Inoue, K. Inoue, and K. Murakami

IEICE Transactions on Electronics 2007年10月

　詳細を見る

記述言語：英語
A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips.

Mariko Sakamoto, Akira Katsuno, Go Sugizaki, Toshio Yoshida, Aiichiro Inoue, Koji Inoue, Kazuaki J. Murakami

IEICE Transactions on Electronics 90-C ( 10 ) 1972 - 1982 2007年10月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietele/e90-c.10.1972
Multi-physics Extension of OpenFMO Framework

Toshiya Takami, Jun Maki, Jun-ichi Ooba, Yuichi Inadomi, Hiroaki Honda, Ryutaro Susukita, Koji Inoue, Taizo Kobayashi, Rie Nogita, Mutsumi Aoyagi

CoRR abs/0707.2630 2007年9月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）
メモリアクセスの特徴を活用した高速かつ正確なメモリアーキテクチャ・シミュレーション法

小野貴継　井上弘士　村上和彰

情報処理学会論文誌コンピューティングシステム 2007年8月

　詳細を見る

記述言語：日本語
メモリアクセスの特徴を活用した高速かつ正確なメモリアーキテクチャ・シミュレーション法

小野貴継, 井上弘士, 村上和彰

情報処理学会論文誌コンピューティングシステム（ACS） 48 ( 13 ) 203 - 213 2007年8月

　詳細を見る

記述言語：日本語

本稿では、高速かつ正確なメモリアーキテクチャ・シミュレーション法を提案する。一般に、メモリアーキテクチャの評価には、メモリ参照のアドレス・トレースに基づいたシミュレーションを行う。しかしながら、評価対象の増加により、評価時間が長くなる傾向にある。トレースに基づくシミュレーションにおいて、1 回あたりのシミュレーション時間はアドレス・トレースの削減によって短縮できるが、精度が低下するという問題がある。そこで、本手法はメモリアクセスの特徴を活用して高い精度維持しつつトレース・サイズを削減し、シミュレーション時間の短縮を実現する。キャッシュ性能測定に基づく評価実験の結果、本手法はトレース・サイズを平均 98.8％削減し、そのときのキャッシュ・ミス率の予測誤差は平均 0.067 パーセンテージ・ポイントであった。This paper proposes a fast and accurate memory architecture simulation technique. To design memory architecture, the first steps commonly involve using trace-driven simulation. However, expanding the design space makes the evaluation time increase. A fast simulation is achieved by a trace size reduction, but it reduces the simulation accuracy. Our approach can reduce the simulation time while maintaining the accuracy of the simulation results. In order to evaluate validity of proposed technique, we measured the cache miss ratio. In our evaluation, the proposed technique reduces the trace size 98.8％ and cache miss ratio differs from 0.067 percentage point on an average.

DOI： 10.15017/8308
通信タイミングを考慮した衝突削減のためのMPIランク配置最適化技術

森江善之,　末安直樹　松本透, 南里豪志,　石畑宏明,　井上弘士,　村上和彰

情報処理学会論文誌コンピューティングシステム 2007年8月

　詳細を見る

記述言語：日本語
Handling Control Data Flow Graphs for a Tightly Coupled Reconfigurable Accelerator.

Hamid Noori, Farhad Mehdipour, Morteza Saheb Zamani, Koji Inoue, Kazuaki J. Murakami

Embedded Software and Systems(ICESS) 249 - 260 2007年5月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1007/978-3-540-72685-2_24
Interactive presentation: Generating and executing multi-exit custom instructions for an adaptive extensible processor.

Hamid Noori, Farhad Mehdipour, Kazuaki J. Murakami, Koji Inoue, Maziar Goudarzi

2007 Design, Automation and Test in Europe Conference and Exposition(DATE) 325 - 330 2007年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/DATE.2007.364612
The effect of temperature on cache size tuning for low energy embedded systems.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 17th ACM Great Lakes Symposium on VLSI 2007 453 - 456 2007年3月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/1228784.1228891
The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki J. Murakami

Proceedings of the 2007 International Conference on Embedded Systems & Applications(ESA) 169 - 176 2007年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）
Generating and executing multi-exit custom instructions for an adaptive extensible processor

Hamid Noon, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, Maziar Goudarzi

2007 Design, Automation and Test in Europe Conference and Exhibition Proceedings - 2007 Design, Automation and Test in Europe Conference and Exhibition, DATE 2007 325 - 330 2007年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problems of this approach are immense cost and long time of designing. To address these issues, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chip-fabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units with the capability of conditional execution. Unlike previous proposed CIs, ours can include multiple exits. Experimental results show that multi-exit CIs enhance the performance by 46% in average compared to CIs limited to one basic block. A maximum speedup of 2.89 compared to a 4-issue in-order RISC processor, and a speedup of 1.66 in average, was achieved on MiBench benchmark suite.

DOI： 10.1109/DATE.2007.364612
The effect of temperature on cache size tuning for low energy embedded systems

Hamid Noori, Maziar Goudarzi, Koji Inoue, Kazuaki Murakami

17th Great Lakes Symposium on VLSI, GLSVLSI'07 GLSVLSI'07 Proceedings of the 2007 ACM Great Lakes Symposium on VLSI 453 - 456 2007年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for about 40% or more of the total energy consumed in these systems. In older technology nodes, active power was the primary contributor to total power dissipation of a CMOS design. However, with the scaling of feature sizes, the share of leakage in total power consumption of digital systems continues to grow. Temperature is a factor which exponentially increases the leakage current. In this paper, we show the effects of temperature on the selection of optimal cache size for low energy embedded systems. Our results show that for a given application, the optimal cache size selection is affected by the temperature. Our experiments have been done for 100nm technology. Our study reveals that the cache size selection for different temperatures depends on the rate at which cache miss increases when reducing the cache size. When the miss rate increases sharply the optimal point is the same for all examined temperatures, however when it becomes smoother, the optimal point for different temperatures begin to get farther.

DOI： 10.1145/1228784.1228891
Multi-physics extension of OpenFMO framework

Toshiya Takami, Jun Maki, Jun'ichi Ooba, Yuuichi Inadomi, Hiroaki Honda, Ryutaro Susukita, Koji Inoue, Taizo Kobayashi, Rie Nogita, Mutsumi Aoyagi

International Conference on Computational Methods in Science and Engineering 2007, ICCMSE 2007 Computation in Modern Science and Engineering - Proceedings of the International Conference on Computational Methods in Science and Engineering 2007 (ICCMSE 2007) 122 - 125 2007年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

OpenFMO framework, an open-source software (OSS) platform for Fragment Molecular Orbital (FMO) method, is extended to multi-physics simulations (MPS). After reviewing the several FMO implementations on distributed computer environments, the subsequent development planning corresponding to MPS is presented. It is discussed which should be selected as a scientific software, lightweight and reconfigurable form or large and self-contained form.

DOI： 10.1063/1.2835969
Implementation and evaluation of Fock matrix calculation program on the Cell processor

Hiroaki Honda, Tetsuo Hayashi, Yuichi Inadomi, Koji Inoue, Kazuaki J. Murakami

International Conference on Computational Methods in Science and Engineering 2007, ICCMSE 2007 Computation in Modern Science and Engineering - Proceedings of the International Conference on Computational Methods in Science and Engineering 2007 (ICCMSE 2007) 64 - 67 2007年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Various processor architectures have been proposed until today, and the performance has improved remarkably. Recently, the Chip Multi-processors (CMPs), which has many processor cores onto a chip, are proposed for further performance improvement. The Cell processor is one of such CMP and shows high computational performance. Although this processor is designed for the multimedia, that high performance character can be utilized to molecular orbital calculation. In this study we implemented Fock matrix construction program on the Cell processor, and evaluated computational performance. As a result, there were two kinds of main stalls by the branch prediction and the data alignment, which are controlled by software mechanism for the simplification of the Cell processor hardware. It is possible to improve the performance about 30%, if the branch prediction hit ratio could be improved to 99%. For data alignment stall, a part of stalls, which is originated by data shuffle pipeline, could be decreased by preparing hardware data alignment mechanism.

DOI： 10.1063/1.2836167
Handling control data flow graphs for a tightly coupled reconfigurable accelerator

Hamid Noori, Farhad Mehdipour, Morteza Saheb Zamani, Koji Inoue, Kazuaki Murakami

3rd International Conference on Embedded Software and Systems, ICESS 2007 Embedded Software and Systems - Third International Conference, ICESS 2007, Proceedings 249 - 260 2007年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

In an embedded system including a base processor integrated with a tightly coupled accelerator, extracting frequently executed portions of the code (hot portion) and executing their corresponding data flow graph (DFG) on the accelerator brings about more speedup. In this paper, we intend to present our motivations for handling control instructions in DFGs and extending them to Control DFGs (CDFGs). In addition, basic requirements for an accelerator with conditional execution support are proposed. Moreover, some algorithms are presented for temporal partitioning of CDFGs considering the target accelerator architectural specifications. To show the effectiveness of the proposed ideas, we applied mem to the accelerator of an extensible processor called AMBER. Experimental results represent the effectiveness of covering control instructions and using CDFGs versus DFGs.

DOI： 10.1007/978-3-540-72685-2_24
Return Address Protection on Cache Memories.

Koji Inoue

IEICE Transactions on Electronics 89-C ( 12 ) 1937 - 1947 2006年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietele/e89-c.12.1937
Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors.

Koji Inoue

13th IEEE International Conference on Electronics, Circuits, and Systems(ICECS) 160 - 163 2006年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ICECS.2006.379744
Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache.

Koji Inoue

13th IEEE International Conference on Electronics, Circuits, and Systems(ICECS) 1093 - 1096 2006年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ICECS.2006.379629
Special section on VLSI Design and CAD Algorithms 査読

Hidetoshi Onodera, Makoto Ikeda, Tohru Ishihara, Tsuyoshi Isshiki, Koji Inoue, Kenichi Okada, Seiji Kajihara, Mineo Kaneko, Hiroshi Kawaguchi, Shinji Kimura, Morihiro Kuga, Atsushi Kurokawa, Takashi Sato, Toshiyuki Shibuya, Yoichi Shiraishi, Kazuyoshi Takagi, Atsushi Takahashi, Yoshinori Takeuchi, Nozomu Togawa, Hiroyuki Tomiyama, Yuichi Nakamura, Kiyoharu Hamaguchi, Yukiya Miura, Shin Ichi Minato, Ryuichi Yamaguchi, Masaaki Yamada, Yasushi Yuminaka, Takayuki Watanabe, Masanori Hashimoto, Masayuki Miyazaki

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E89-A ( 12 ) 3377 2006年12月

　詳細を見る

記述言語：英語

DOI： 10.1093/ietfec/e89-a.12.3377
An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki J. Murakami, Mehdi Sedighi, Koji Inoue

Advances in Computer Systems Architecture 219 - 230 2006年9月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1007/11859802_18
Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit.

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki J. Murakami, Koji Inoue, Mehdi Sedighi

Embedded and Ubiquitous Computing(EUC) 722 - 731 2006年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1007/11802167_73
A Reconfigurable Functional Unit for an Adaptive Dynamic Extensible Processor.

Hamid Noori, Farhad Mehdipour, Kazuaki J. Murakami, Koji Inoue, Morteza Saheb Zamani

Proceedings of the 2006 International Conference on Field Programmable Logic and Applications (FPL)(FPL) 1 - 4 2006年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/FPL.2006.311313
A reconfigurable functional unit for an adaptive dynamic extensible processor

Hamid Noori, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, Morteza Sahebzamani

2006 International Conference on Field Programmable Logic and Applications, FPL Proceedings - 2006 International Conference on Field Programmable Logic and Applications, FPL 781 - 784 2006年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper presents a reconfigurable functional unit (RFU) for an adaptive dynamic extensible processor. The processor can tune its extended instructions to the target applications, after chip-fabrication. The custom instructions (CIs) are generated deploying the hot basic blocks during the training mode. In the normal mode, CIs are executed on the RFU. A quantitative approach was used for designing the RFU. The RFU is a matrix of functional units with 8 inputs and 6 outputs. Performance is enhanced up to 1.25 using the proposed RFU for 22 applications of Mibench. This processor needs no extra opcodes for CIs, new compiler, source code modification and recompilation.

DOI： 10.1109/FPL.2006.311313
Supporting a dynamic program signature An intrusion detection framework for microprocessors

Koji Inoue

ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems 160 - 163 2006年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

To address computer security issues, a hardware-based intrusion detection technique is proposed. This uses the dynamic program execution behavior for authentication. Based on secret key information, an execution behavior is determined. Next, a secure compiler constructs object code which generates the predetermined execution behavior at runtime. During program execution, a secure profiler monitors the execution behavior. If the profiler cannot detect the expected behavior, it sends an alarm signal to the microprocessor for terminating program execution. Since attack code cannot anticipate the execution behavior required, malicious attacks can be detected and prohibited at the start of program execution.

DOI： 10.1109/ICECS.2006.379744
Lock and unlock A data management algorithm for a security-aware cache

Koji Inoue

ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems ICECS 2006 - 13th IEEE International Conference on Electronics, Circuits and Systems 1093 - 1096 2006年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes an efficient cache line management algorithm for a security-aware cache architecture (SCache). SCache attempts to detect the corruption of return address values at runtime. When a return address store is executed, the cache generates a replica of the return address. This copied data is treated as read only. Subsequently, when the corresponding return address load is performed, the cache verifies the return address value loaded from the memory stack by means of comparing it with the replica data. Unfortunately, since the replica data is also a candidate for cache line replacements, SCache does not work well for application programs that cause higher cache miss rates. To resolve this issue, a lock and unlock data management algorithm is proposed in order to improve the security of SCache. The experimental results show that a proposed SCache model can protect about 99% of return address loads from the threat of buffer overflow attacks, while it worsens the processor performance by only 1%, compared with a non-secure conventional cache.

DOI： 10.1109/ICECS.2006.379629
Custom instruction generation using temporal partitioning techniques for a reconfigurable functional unit

Farhad Mehdipour, Hamid Noon, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi

International Conference on Embedded and Ubiquitous Computing, EUC 2006 Embedded and Ubiquitous Computing - International Conference, EUC 2006, Proceedings 722 - 731 2006年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit (RFU). Custom instructions (CIs) are usually extracted from critical portions of applications. It may not be possible to meet all of the RFU constraints when CIs are generated. This paper addresses the generation of mappable CIs on an RFU. In this paper, our proposed RFU architecture for an adaptive dynamic extensible processor is described. Then, an integrated framework for temporal partitioning and mapping is presented to partition and map the CIs on RFU. In this framework, two mapping aware temporal partitioning algorithms are used to generate CIs. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. Using this framework brings about more speedup for the extensible processor.

DOI： 10.1007/11802167_73
An integrated temporal partitioning and mapping framework for handling custom instructions on a reconfigurable functional unit

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Mehdi Sedighi, Koji Inoue

11th Asia-Pacific Conference on Advances in Computer Systems Architecture, ACSAC 2006 Advances in Computer Systems Architecture - 11th Asia-Pacific Conference, ACSAC 2006, Proceedings 219 - 230 2006年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

Extensible processors allow customization for an application by extending the core instruction set architecture. Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit. Custom instructions (CIs) usually are extracted from critical portions of applications. This paper presents approaches for CI generation with respect to the RFU constraints to improve speedup of the extensible processor. First, our proposed RFU architecture for an adaptive dynamic extensible processor called AMBER is described. Then, an integrated temporal partitioning and mapping framework is presented to partition and map the CIs on the RFU. In this framework, a mapping aware temporal partitioning algorithm is used to generate CIs which are mappable on the RFU. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. In addition, a mapping algorithm is presented which supports CIs with critical path length more than the RFU depth.

DOI： 10.1007/11859802_18
Adaptive Mode Control for Low-Power Caches Based on Way-Prediction Accuracy.

Hidekazu Tanaka, Koji Inoue

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 88-A ( 12 ) 3274 - 3281 2005年12月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietfec/e88-a.12.3274
A Cost Effective Spacial Redundancy with Data-Path Partitioning.

Shigeharu Matsusaka, Koji Inoue

Third International Conference on Information Technology and Applications (ICITA 2005) 51 - 56 2005年7月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ICITA.2005.7
Quantitative Evaluation of State-Preserving Leakage Reduction Algorithm for L1 Data Caches.

Reiko Komiya, Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 88-A ( 4 ) 862 - 868 2005年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1093/ietfec/e88-a.4.862
Energy-security tradeoff in a secure cache architecture against buffer overflow attacks.

Koji Inoue

SIGARCH Computer Architecture News 33 ( 1 ) 81 - 89 2005年3月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）

DOI： 10.1145/1055626.1055638
Low-power cache design

Vasily G. Moshnyaga, Koji Inoue

Low-Power Processors and Systems on Chips 8 - 1-8-21 2005年1月

　詳細を見る

記述言語：英語

Cache memories are the most area-and energy-consuming units in today’s microprocessors. As the speed disparity between processor and external memory increases, designers try to put large multilevel caches on a chip to reduce the number of external memory accesses and thus boost the system performance. (See Table 8.1 for a survey of the on-die caches for several recent high-end microprocessors.) On-chip data and instruction caches are implemented using arrays of densely packed static RAM cells. The device count for the caches often exceeds the number of transistors devoted to the processor’s datapath and controller. For example, the Alpha21364 [3] and PA-RISC Maco [5] microprocessors have over 90% of their transistors in RAM, with most of them dedicated for caches; the Itanium2 [1] has 80% in caches, the IBM G5 [7] has 72%, the PowerPC [8] has 71%, and Strong-ARM110 [9] has 70%. Due to the large load capacitance and high access rate, these caches account for significant portion of the overall power dissipation (e.g., 35% in Itanium2 [1]; 43% in Strong-ARM [9]). Therefore optimizing caches for power is increasingly important. Although much work on energy reduction has taken place in the circuit and technology domains [10,11], interest in cache design for power efficiency at the architectural level continues to increase. Architecture is the entry point in cache design hierarchy, and decisions taken at this level can drastically affect the efficiency of design.
A cost effective spatial redundancy with data-path partitioning

Shigeharu Matsusaka, Koji Inoue

3rd International Conference on Information Technology and Applications, ICITA 2005 Proceedings - 3rd International Conference on Information Technology and Applications, ICITA 2005 51 - 56 2005年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

In order to maintain the high reliability of a computer system, it is necessary to detect the failure leading to a fault. In general, fault can be detected by exploiting time redundancy or spatial redundancy. However, it negatively affects on either hardware cost or processor performance. To solve the cost-performance issue, in this paper, we propose a concept of cost-effective approach to achieve spatial redundancy for dependable processors. In addition, we perform a primly evaluation for the impact of our method on processor performance.

DOI： 10.1109/ICITA.2005.7
A low-power I-cache design with tag-comparison reuse.

Koji Inoue, Hidekazu Tanaka, Vasily G. Moshnyaga, Kazuaki J. Murakami

Proceedings of the 2004 International Symposium on System-on-Chip(SoC) 61 - 67 2004年11月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ISSOC.2004.1411147
Low-power cache design

Vasily G. Moshnyaga, Inoue Koji

Low-Power Electronics Design 25 - 1-25-21 2004年1月

　詳細を見る

記述言語：英語

Cache memories are the most area- and energy-consuming units in today’s microprocessors. As the speed disparity between processor and external memory increases, designers try to put large multilevel caches on a chip to reduce the number of external memory accesses and thus boost the system performance. (See Table 25.1 for a survey of the on-die caches for several recent high-end microprocessors.) On-chip data and instruction caches are implemented using arrays of densely packed static RAM cells. The device count for the caches often exceeds the number of transistors devoted to the processor’s datapath and controller. For example, the Alpha21364 [3] and PA-RISC Maco [5] microprocessors have over 90% of their transistors in RAM, with most of them dedicated for caches; the Itanium2 [1] has 80% in caches, the IBM G5 [7] has 72%, the PowerPC [8] has 71%, and Strong-ARM110 [9] has 70%. Due to the large load capacitance and high access rate, these caches account for significant portion of the overall power dissipation (e.g., 35% in Itanium2 [1]; 43% in Strong-ARM [9]). Therefore optimizing caches for power is increasingly important. Although much work on energy reduction has taken place in the circuit and technology domains [10,11], interest in cache design for power efficiency at the architectural level continues to increase. Architecture is the entry point in cache design hierarchy, and decisions taken at this level can drastically affect the efficiency of design.
A low-power I-cache design with tag-comparison reuse

Koji Inoue, Hidekazu Tanaka, Vasily G. Moshnyaga, Kazuaki Murakami

2004 International Symposium on System-on-Chip 2004 International Symposium on System-on-Chip Proceedings 61 - 67 2004年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper reports design and evaluation results of a low-energy I-cache architecture, called history-based tag-comparison (HBTC) cache. The HBTC cache attempts to re-use tag-comparison results to detect and eliminate unnecessary memory-array activations. We have performed cycle accurate simulations, and have designed an SRAM core based on a 0.18 μm CMOS technology. As a result, it has been observed that the HBTC approach can achieve 60% of energy reduction, with only 0.3% performance degradation, compared to a conventional cache. Furthermore, we have also evaluated the potential of the HBTC cache by combining with other low-energy techniques.
Designing a TCP/IP core for power consumption analysis

Kenichi Tanamachi, Inoue Koji, Vasily G. Moshnyaga

Proceedings of 2004 IEEE Asia-Pacific Conference on Advanced System Integrated Circuits Proceedings of 2004 IEEE Asia-Pacific Conference on Advanced System Integrated Circuits 412 - 413 2004年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

The designing of a low-power TCP/IP hardcore for pervasive computing was discussed. In order to implement the TCP/IP operations in hardware, the TCP and IP functions were partitioned into four modules which were port_ctr, data_ctr, window_ctr, and checksum. It was found that the data_ctr consumed about 22-30% of total power. The power consumption of TCP core and IP core were compared and it was found that the power consumed by the TCP was almost double of that of the IP core.
Reducing Access Count to Register-Files through Operand Reuse.

Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga

Advances in Computer Systems Architecture 112 - 121 2003年9月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1007/978-3-540-39864-6_10
Instruction Encoding for Reducing Power Consumption of I-ROMs Based on Execution Locality.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 86-A ( 4 ) 799 - 805 2003年4月

　詳細を見る

記述言語：その他掲載種別：研究論文（学術雑誌）
A zero-value prediction technique for fast DCT computation

Y. Nishida, Inoue Koji, V. G. Moshnyaga

2003 IEEE Workshop on Signal Processing Systems, SIPS 2003 2003 IEEE Workshop on Signal Processing Systems Design and Implementation, SIPS 2003 2003-January 165 - 170 2003年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

The paper proposes a new computationally efficient technique for DCT operation. Unlike related research, the technique reduces the number of computations by predicting the effect of quantization on DCT and avoiding calculations of those DCT values which lead to zero elements in the block after quantization. Experimental evaluation on a number of video benchmarks shows that our method is able to reduce the total number of computations by 29% for DCT and by 59% for quantization while maintaining high image quality.

DOI： 10.1109/SIPS.2003.1235663
Dynamic tag-check omission A low power instruction cache architecture exploiting execution footprints

Koji Inoue, Vasily Moshnyaga, Kazuaki Murakami

2nd International Workshop on Power-Aware Computer Systems, PACS 2002 Power-Aware Computer Systems - 2nd International Workshop, PACS 2002, Revised Papers 18 - 32 2003年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes an architecture for low-power directmapped instruction caches, called “history-based tag-comparison (HBTC) cache”. The HBTC cache attempts to detect and omit unnecessary tag checks at run time. Execution footprints are recorded in an extended BTB (Branch Target Buffer), and are used to know the cache residence of target instructions before starting cache access. In our simulation, it is observed that our approach can reduce the total count of tag checks by 90 %, resulting in 15 % of cache-energy reduction, with less than 0.5 % performance degradation.

DOI： 10.1007/3-540-36612-1_2
Multiplier energy reduction through bypassing of partial products.

Jun-ni Ohban, Vasily G. Moshnyaga, Koji Inoue

IEEE Asia Pacific Conference on Circuits and Systems 2002 13 - 17 2002年10月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/APCCAS.2002.1115097
Reducing power consumption of instruction ROMs by exploiting instruction frequency.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

IEEE Asia Pacific Conference on Circuits and Systems 2002 1 - 6 2002年10月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/APCCAS.2002.1115094
A history-based I-cache for low-energy multimedia applications.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

Proceedings of the 2002 International Symposium on Low Power Electronics and Design(ISLPED) 148 - 153 2002年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/566408.566447
Reducing energy consumption of video memory by bit-width compression.

Vasily G. Moshnyaga, Koji Inoue, Mizuka Fukagawa

Proceedings of the 2002 International Symposium on Low Power Electronics and Design(ISLPED) 142 - 147 2002年8月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1145/566408.566446
Omitting cache look-up for high-performance, low-power microprocessors 査読

K Inoue, VG Moshnyaga, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E85C ( 2 ) 279 - 287 2002年2月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

In this paper, we propose a novel architecture for low-power direct-mapped instruction caches, called "history-based tag-comparison (HBTC) cache." The cache attempts to reuse tag-comparison results for avoiding unnecessary tag checks. Execution footprints are recorded into an extended BTB (Branch Target Buffer). In our evaluation, it is observed that the energy for tag comparison can be reduced by more than 90% in many applications.
Dynamic Tag-Check Omission: A Low Power Instruction Cache Architecture Exploiting Execution Footprints.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

Power-Aware Computer Systems(PACS) 18 - 32 2002年2月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1007/3-540-36612-1_2
Trends in high-performance, low-power cache memory architectures 査読

K Inoue, VG Moshnyaga, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E85C ( 2 ) 304 - 314 2002年2月

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

One of uncompromising requirements from portable computing is energy efficiency, because that affects directly the battery life. On the other hand, portable computing will target more demanding applications, for example moving pictures, so that higher performance is still required. Cache memories have been employed as one of the most important components of computer, systems. In this paper, we briefly survey architectural techniques for high performance, low power cache memories.
A Low Energy Set-Associative I-Cache with Extended BTB.

Koji Inoue, Vasily G. Moshnyaga, Kazuaki J. Murakami

20th International Conference on Computer Design (ICCD 2002), VLSI in Computers and Processors(ICCD) 187 2002年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1109/ICCD.2002.1106768
Register File Energy Reduction by Operand Data Reuse.

Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga

Integrated Circuit Design. Power and Timing Modeling, Optimization and Simulation(PATMOS) 278 - 288 2002年1月

　詳細を見る

記述言語：その他掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1007/3-540-45716-X_28
Register file energy reduction by operand data reuse

Hiroshi Takamura, Koji Inoue, Vasily G. Moshnyaga

12th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS 2002 Integrated Circuit Design Power and Timing Modeling, Optimization and Simulation - 12th International Workshop, PATMOS 2002, Proceedings 278 - 288 2002年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper presents an experimental study of register file utilization in conventional RISC-type data path architecture to determine benefits that we can expect to achieve by eliminating unnecessary register file reads and writes. Our analysis shows that operand bypassing, enhanced for operand-reuse can discard the register file accesses up to 65% as a peak and by 39% on average for tested benchmark programs.

DOI： 10.1007/3-540-45716-x_28
Multiplier energy reduction through bypassing of partial products

Jun Ni Ohban, V. G. Moshnyaga, K. Inoue

Asia-Pacific Conference on Circuits and Systems, APCCAS 2002 Proceedings - APCCAS 2002 Asia-Pacific Conference on Circuits and Systems 13 - 17 2002年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

The design of portable battery operated multimedia devices requires energy-efficient multiplication circuits. This paper presents a novel approach to reduce power consumption of digital multiplier based on dynamic bypassing of partial products. The bypassing elements incorporated into the multiplier hardware eliminate redundant signal transitions, which appear within the carry-save adders when the partial product is zero. Simulations on the real-life DCT data show that the proposed approach can improve power saving of related methods by 12%, while jointly with them, it reduces the power consumption of a 16x16 digital CMOS multiplier by 31%, with 25% area overhead and less than 4% performance degradation in the worst case. The circuit implementation is outlined.

DOI： 10.1109/APCCAS.2002.1115097
Reducing power consumption of instruction ROMs by exploiting instruction frequency

K. Inoue, V. G. Moshnyaga, K. Murakami

Asia-Pacific Conference on Circuits and Systems, APCCAS 2002 Proceedings - APCCAS 2002 Asia-Pacific Conference on Circuits and Systems 1 - 6 2002年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

This paper proposes a new approach to reducing the power consumption of instruction ROMs for embedded systems. The power consumption of instruction ROMs strongly depends on the switching activity of bit-lines. If a read bit-value indicates '0', the precharged bitline is discharged. In this scenario, a bit-line switching takes place and consumes power. Otherwise, the precharged bit-line level is maintained until the next access, thus no bit-line switching occurs. In our approach, the binary-patterns to be assigned to op-codes are determined based on the frequency of instructions for reducing the bit-line switching activity. Application programs are analyzed in advance, and then binary-patterns including many '1's' are assigned to the most frequently referenced instructions. In our evaluation, it is observed that the proposed approach can reduce bit-line switching by 40%.

DOI： 10.1109/APCCAS.2002.1115094
Performance/energy efficiency of variable line-size caches for intelligent memory systems

Koji Inoue, Koji Kai, Kazuaki Murakami

2nd International Workshop on Intelligent Memory Systems, IMS 2000 Intelligent Memory Systems - 2nd International Workshop, IMS 2000, Revised Papers 169 - 178 2001年

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

DOI： 10.1007/3-540-44570-6_13
A high-performance/low-power on-chip memory-path architecture with variable cache-line size

K Inoue, K Kai, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E83C ( 11 ) 1716 - 1723 2000年11月

　詳細を見る

記述言語：英語

This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. Ln our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.
Dynamically variable line-size cache architecture for merged DRAM/Logic LSIs

K Inoue, K Kai, K Murakami

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E83D ( 5 ) 1048 - 1057 2000年5月

　詳細を見る

記述言語：英語

This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache)." The D-VLS cache ran optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth on merged DRAM/logic LSIs appropriately. In our evaluation, it is observed that an average memory access time improvement achieved by a direct-mapped D-VLS cache is about 20% compared to a conventional direct-mapped cache with tired 32-byte lines. This performance improvement is better than that of a doubled-size conventional direct-mapped cache*.
A high-performance and low-power cache architecture with speculative way-selection

K Inoue, T Ishihara, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E83C ( 2 ) 186 - 194 2000年2月

　詳細を見る

記述言語：英語

This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache*.
MOE A special-purpose parallel computer for high-speed, large-scale molecular orbital calculation

Koji Hashimoto, Hiroto Tomita, Inoue Koji, Katsuhiko Metsugi, Kazuaki Murakami, Shinjiro Inabata, So Yamada, Nobuaki Miyakawa, Hajime Takashima, Kunihiro Kitamura, Shigeru Obara, Takashi Amisaki, Kazutoshi Tanabe, Umpei Nagashima

1999 ACM/IEEE Conference on Supercomputing, SC 1999 ACM/IEEE SC 1999 Conference, SC 1999 1999年1月

　詳細を見る

記述言語：英語掲載種別：研究論文（その他学術会議資料等）

We are constructing a high-performance, special-purpose parallel machine for ab initio Molecular Orbital calculations, called MOE (Molecular Orbital calculation Engine). The sequential execution time is O(N⁴) where N is the number of basis functions, and most of time is spent to the calculations of electron repulsion integrals (ERIs). The calculation of ERIs have a lot of parallelism of O(N⁴), and therefore MOE tries to exploit the parallelism. This paper discuss the MOE architecture and examines important aspects of architecture design, which is required to calculate ERIs according to the "Obara method". We conclude that n-way parallelization is the most cost-effective, hence we designed the MOE prototype system with a host computer and many processing nodes. The processing node includes a 76 bit oating-point MULTIPLY-and-ADD unit and internal memory, etc., and it performs ERI computations efficiently. We estimate that the prototype system with 100 processing nodes calculate the energy of proteins in a few days.

DOI： 10.1109/SC.1999.10000
High bandwidth, variable line-size cache architecture for merged DRAM/logic LSIs 査読

K Inoue, K Kai, K Murakami

IEICE TRANSACTIONS ON ELECTRONICS E81C ( 9 ) 1438 - 1447 1998年9月

　詳細を見る

記述言語：英語

Merged DRAM/logic LSIs could provide high on-chip memory bandwidth by interconnecting logic portions and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit such high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses. This approach tends to increase the cache-line size if we attempt to improve the attainable memory bandwidth. Larger cache lines, however, might worsen the system performance if programs running on the LSIs do not have enough spatial locality of references and cache misses frequently take place. This paper describes a novel cache architecture suitable for merged DRAM/logic LSIs, called variable line-size cache or VLS cache, for resolving the above-mentioned dilemma. The VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and, at the same time, alleviate the negative effects of larger cache-line size by partitioning each large cache line into multiple sub-lines and allowing every sub-line to work as an independent cache line. The number of sub-lines involved when a cache replacement occurs fan be determined depending on the characteristics of programs. This paper also evaluates the cost/performance improvements attainable by the VLS cache and compares it with those of conventional cache architectures. As a result, it is observed that a VLS cache reduces the average memory-access time by 16.4% while it increases the hardware cost by only 13%, compared to a conventional direct-mapped cache with fixed 32-byte lines.
Efficient Autoencoder-Based Human Body Communication Transceiver for WBAN 査読国際誌

Ali, Abdelhay; Inoue, Koji; Shalaby, Ahmed; Sayed, Mohammed Sharaf; Ahmed, Sabah Mohamed

IEEE ACCESS 7 117196 - 117205 1900年

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

DOI： 10.1109/ACCESS.2019.2936796
Decision Tree Models and Early Splitting Termination in Screen Content Extension of High Efficiency Video Coding 査読国際誌

Badry, Emad; Inoue, Koji; Sayed, Mohammed Sharaf

IEEE ACCESS 8 143437 - 143452 1900年

　詳細を見る

記述言語：英語掲載種別：研究論文（学術雑誌）

DOI： 10.1109/ACCESS.2020.3014163

▼全件表示

書籍等出版物

Low-Power Electronics Design (Low-Power Cache Design: Chap. 25)

V. Moshnyaga and K. Inoue（担当：共著）

CRC PRESS 2004年1月

　詳細を見る

記述言語：英語著書種別：学術書

講演・口頭発表等

SuperNPU: An Extremely Fast Neural Processing Unit Using Superconducting Logic Devices 国際会議

Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kousuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, and Koji Inoue

IEEE/ACM International Symposium on Microarchitecture (MICRO) 2020年10月

　詳細を見る

開催年月日： 2020年10月

記述言語：英語会議種別：口頭発表（一般）

開催地：online 国名：日本国

Superconductor single-flux-quantum (SFQ) logic family has been recognized as a highly promising solution for the post-Moore's era, thanks to its ultra-fast and low-power switching characteristics. Therefore, researchers have made a tremendous amount of effort in various aspects to promote the technology and automate its circuit design process (e.g., low-cost fabrication, design tool development). However, there has been no progress in designing a convincing SFQ-based architectural unit due to the architects' lack of understanding of the technology's potentials and limitations at the architecture level. In this paper, we present how to architect an SFQ-based architectural unit by providing design principles with an extreme-performance neural processing unit (NPU). To achieve the goal, we first implement an architecture-level simulator to model an SFQ-based NPU accurately. We validate this model using our die-level prototypes, design tools, and logic cell library. This simulator accurately measures the NPU's performance, power consumption, area, and cooling overheads. Next, driven by the modeling, we identify key architectural challenges for designing a performance-effective SFQ-based NPU (e.g., expensive on-chip data movements and buffering). Lastly, we present SuperNPU, our example SFQ-based NPU architecture, which effectively resolves the challenges. Our evaluation shows that the proposed design outperforms a conventional state-of-the-art NPU by 23 times. With free cooling provided as done in quantum computing, the performance per chip power increases up to 490 times. Our methodology can also be applied to other architecture designs with SFQ-friendly characteristics.
Performance Prediction of Large-scale Parallel System and Application using Macro-level Simulation 国際会議

R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, K. Inoue, S. Ishizuki, Y. Kimura, H. Komatsu, M. Kurokawa, K. Murakami, H. Shibamura, S. Yamamura, Y. Yu

the International Conference for High Performance Computing, Networking, Storage and Analysis (SC08) 2008年11月

　詳細を見る

開催年月日： 2008年11月

記述言語：その他会議種別：口頭発表（一般）

開催地：オースティン国名：その他
Analyzing and Mitigating the Impact of Manufacturing Variability in Power-Constrained Supercomputing 国際会議

稲富雄一, Tapasya Patki, Inoue Koji, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, Ikuo Miyoshi

The International Conference for High Performance Computing, Networking, Storage and Analysis 2015年11月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：アメリカ合衆国
H. Noori, F. Mehdipour, K. Murakami, K. Inoue, and M. Goudarzi, "Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor 国際会議

H. Noori, F. Mehdipour, K. Murakami, K. Inoue, and M. Goudarzi

The European Event for Electronic System Design & Test (DATE'07) 2007年4月

　詳細を見る

記述言語：その他会議種別：口頭発表（一般）

国名：フランス共和国
How many trials do we need for reliable NISQ computing? 国際会議

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, and Koji Inoue

The First International Workshop on Quantum Computing: Circuits Systems Automation and Applications 2020年7月

　詳細を見る

開催年月日： 2021年6月

記述言語：英語会議種別：口頭発表（一般）

開催地：online 国名：日本国
Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core 国際会議

Susumu Mashimo, Ryota Shioya, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020年1月

　詳細を見る

開催年月日： 2021年6月

記述言語：英語会議種別：口頭発表（一般）

開催地：Fukuoka 国名：日本国
Enhancing a manycore-oriented compressed cache for GPGPU 国際会議

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020年1月

　詳細を見る

開催年月日： 2021年6月

記述言語：英語会議種別：口頭発表（一般）

開催地：Fukuoka 国名：日本国
32 GHz 6.5 mW Gate-Level-Pipelined 4-bit Processor using Superconductor Single-Flux-Quantum Logic 国際会議

Koki Ishida, Masamitsu Tanaka, Ikki Nagaoka, Takatsugu Ono, Satoshi Kawakami, Teruo Tanimoto, Akira Fujimaki, Koji Inoue

2020 Symposia on VLSI Technology and Circuits 2020年6月

　詳細を見る

開催年月日： 2021年6月

記述言語：英語会議種別：口頭発表（一般）

開催地：online 国名：日本国
Practical error modeling toward realistic NISQ simulation 国際会議

Teruo Tanimoto, Shuhei Matsuo, Satoshi Kawakami, Yutaka Tabuchi, Masao Hirokawa, and Koji Inoue

The First International Workshop on Quantum Computing: Circuits Systems Automation and Applications 2020年7月

　詳細を見る

開催年月日： 2021年6月

記述言語：英語会議種別：口頭発表（一般）

開催地：online 国名：日本国
Enhancing a manycore-oriented compressed cache for GPGPU 国際会議

Keitaro Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020年1月

　詳細を見る

開催年月日： 2020年1月

記述言語：英語会議種別：口頭発表（一般）

開催地：Fukuoka, Japan 国名：日本国
Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core 国際会議

Susumu Mashimo, Ryota Shioya, Koji Inoue

International Conference on High Performance Computing in Asia-Pacific Region 2020年1月

　詳細を見る

開催年月日： 2020年1月

記述言語：英語会議種別：口頭発表（一般）

開催地：Fukuoka, Japan 国名：日本国
Evaluating the Impact of Energy Efficient Networks on HPC Workloads 国際会議

G Georgakoudis, N Jain, T Ono, K Inoue, S Miwa, A Bhatele

26th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) 2020年1月

　詳細を見る

開催年月日： 2019年12月

記述言語：英語会議種別：口頭発表（一般）

開催地：Hyderabad, India 国名：インド
An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor 国際会議

Susumu Mashimo, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima, Koji Inoue, Ryota Shioya

IEEE International Conference on Field Programmable Technology 2019年12月

　詳細を見る

開催年月日： 2019年12月

記述言語：英語会議種別：口頭発表（一般）

開催地：Tianjin, China 国名：中華人民共和国
A 48GHz 5.6mW gate-level-pipelined multiplier using single-flux quantum logic 国際会議

Ikki Nagaoka, Masamitsu Tanaka, Koji Inoue, Akira Fujimaki

IEEE International Solid-State Circuits Conference (ISSCC 2019) 2019年2月

　詳細を見る

開催年月日： 2019年2月

記述言語：英語

開催地：San Francisco 国名：アメリカ合衆国
Improving Lifetime in MLC Phase Change Memory using Slow Writes 国際会議

Takatsugu Ono, Zhe Chen and Koji Inoue

International Japan-Africa Conference on Electronics, Communication and Computations 2018年12月

　詳細を見る

開催年月日： 2018年12月

記述言語：英語

国名：エジプト・アラブ共和国
Situation-Based Dynamic Frame-Rate Control for On-Line Object Tracking 国際会議

Yusuke Inoue, Takatsugu Ono and Koji Inoue

International Japan-Africa Conference on Electronics, Communication and Computations 2018年12月

　詳細を見る

開催年月日： 2018年12月

記述言語：英語

国名：日本国
30-GHz Operation of Datapath for Bit-Parallel, Gate-Level-Pipelined Rapid Single-Flux-Quantum Microprocessors 招待国際会議

Masamitsu Tanaka, Yuki Hatanaka, Yuichi Matsui, Ikki Nagaoka, Koki Ishida, Kyosuke Sano, Taro Yamashita, Takatsugu Ono, Koji Inoue, Akira Fujimaki

Applied Superconductivity Conference 2018年10月

　詳細を見る

開催年月日： 2018年10月

記述言語：英語

国名：日本国
Autoencoder based Features Extraction for Automatic Classiﬁcation of Earthquakes and Explosions 国際会議

Omar M. Saad, K. Inoue, Ahmed Shalaby, Lotf Samy, and Mohammed S. Sayed

the 17th IEEE/ACIS International Conference on Computer and Information Science 2018年6月

　詳細を見る

開催年月日： 2018年6月

記述言語：英語

国名：日本国
Analyzing Resource Trade-offs in Hardware-overprovisioned Supercomputers 国際会議

Ryuichi Sakamoto, Tapasya Patki, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Daniel Ellsworth, Barry Rountree, Martin Schulz

the 32nd International Parallel and Distributed Processing 2018年5月

　詳細を見る

開催年月日： 2018年5月

記述言語：英語

国名：日本国
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems 国際会議

Satoshi Imamura, Hiroshi Sasaki, Inoue Koji, Dimitrios S. Nikolopoulos

IEEE International Conference on Computer Design 2014年10月

　詳細を見る

開催年月日： 2014年10月

記述言語：英語

国名：大韓民国
Power-capped DVFS and thread allocation with ANN models on modern NUMA systems 国際会議

Satoshi Imamura, Hiroshi Sasaki, Inoue Koji, Dimitrios S. Nikolopoulos

IEEE International Conference on Computer Design 2014年10月

　詳細を見る

開催年月日： 2014年10月

記述言語：英語

国名：大韓民国

researchmap
Performance evaluations of finite difference applications realized on a single flux quantum circuits-based reconfigurable accelerator

Hiroaki Honda, Farhad Mehdipour, Hiroshi Kataoka, Inoue Koji, Kazuaki J. Murakami

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2011, APSIPA ASC 2011 2011年12月

　詳細を見る

開催年月日： 2011年10月

記述言語：英語

開催地：Xi'an 国名：中華人民共和国

Hardware accelerators integrating to general purpose processors are increasingly employed to achieve lower power consumption and higher processing speed, however, energy consumption of high performance accelerators has become a great issue on large scale parallel computer system. We have investigated the applicability of Single-Flux-Quantum (SFQ) circuits as a part of superconductivity technology in high-performance computing systems. Although it is possible to develop extraordinary low power processor by SFQ devices, conditional branch and loop back controls are difficult to be implemented by current SFQ technology. Therefore, we have proposed Reconfigurable Data- Path (RDP) accelerator which is avoiding those limitations of SFQ technology, while trying to get benefits of these circuits. In this research, we have implemented two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications for investigating efficiency of using SFQ-RDP accelerator. According to performance evaluation results for above applications, execution times are 50.6 and 79.0 times smaller than those of the general purpose processor, and comparable with ones reported for GPU (Graphics Processing Units).Hardware accelerators integrating to general purpose processors are increasingly employed to achieve lower power consumption and higher processing speed, however, energy consumption of high performance accelerators has become a great issue on large scale parallel computer system. We have investigated the applicability of Single-Flux-Quantum (SFQ) circuits as a part of superconductivity technology in high-performance computing systems. Although it is possible to develop extraordinary low power processor by SFQ devices, conditional branch and loop back controls are difficult to be implemented by current SFQ technology. Therefore, we have proposed Reconfigurable Data-Path (RDP) accelerator which is avoiding those limitations of SFQ technology, while trying to get benefits of these circuits. In this research, we have implemented two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications for investigating efficiency of using SFQ-RDP accelerator. According to performance evaluation results for above applications, execution times are 50.6 and 79.0 times smaller than those of the general purpose processor, and comparable with ones reported for GPU (Graphics Processing Units).
パケットペーシングによる全対全通信の最適化とシミュレーション評価

柴村英智, 三輪英樹, 薄田竜太郎, 平尾智也, 安島雄一郎, 三吉郁夫, 清水俊幸, 石畑宏明, 井上弘士

ハイパフォーマンスコンピューティングと計算科学シンポジウム 2011年1月

　詳細を見る

開催年月日： 2011年1月

記述言語：その他会議種別：口頭発表（一般）

開催地：筑波国名：日本国
演算/メモリ性能バランスを考慮したマルチコア向けオンチップメモリ貸与法

福本尚人, 井上弘士, 村上和彰

ハイパフォーマンスコンピューティングと計算科学シンポジウム 2011年1月

　詳細を見る

開催年月日： 2011年1月

記述言語：その他会議種別：口頭発表（一般）

開催地：筑波国名：日本国
Reducing Preprocessing Overhead Times in a Reconfigurable Accelerator of Finite Difference Applications 国際会議

H. Kataoka, H. Honda, F. Mehdipour, K. Inoue, and K. Murakami

In Proc. Symp. on Application Accelerators in High Performance Computing (SAAHPC'10) 2010年7月

　詳細を見る

開催年月日： 2010年7月

記述言語：その他

開催地：テネシー国名：その他
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor 国際会議

Farhad Mehdipour, Hamid Noori, Bahman Javadi, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

The 14th Asia and South-Pacific Design Automation Conference (ASP-DAC 2009) 2009年1月

　詳細を見る

開催年月日： 2009年1月

記述言語：その他会議種別：口頭発表（一般）

開催地：yokohama 国名：日本国
Analyzing the Impact of Data Prefetching on Chip MultiProcessors 国際会議

N. Fukumoto, T. Mihara, K. Inoue, and K. Murakami

IEEE Asia-Pacific Computer Systems Architecture Conference (ACSAC'08) 2008年8月

　詳細を見る

開催年月日： 2008年8月

記述言語：その他

国名：日本国
Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection 国際会議

H. Noori, M. Goudarzi, K. Inoue, and K. Murakami

International Symposium on VLSI (ISVLSI'08) 2008年8月

　詳細を見る

開催年月日： 2008年8月

記述言語：その他

国名：フランス共和国
Enhancing Energy Efficiency of Processor-Based Embedded Systems through Post-Fabrication ISA Extension 国際会議

H. Noori, F. Mehdipour, K. Inoue, and K. Murakami

International Symposium on Low Power Electronics and Design (ISLPED'08) 2008年8月

　詳細を見る

開催年月日： 2008年8月

記述言語：その他

国名：インド
Design Space Exploration for a Coarse Grain Accelerator 国際会議

F. Mehdipour, H. Noori, M. S. Zamani, K. Inoue, and K. Murakami

Asia and South Pacific Design Automation Conference (ASPDAC'08) 2008年1月

　詳細を見る

開催年月日： 2008年1月

記述言語：その他

国名：大韓民国
Improved Policies for Drowsy Caches in Embedded Processors 国際会議

J. Zushi, G. Zeng, H. Tomiyama, H. Takada, and K. Inoue

Internal Symposium on Electronics Design, Test & Applications 2008年1月

　詳細を見る

開催年月日： 2008年1月

記述言語：その他
Design Space Exploration for a Coarse Grain Accelerator 国際会議

F. Mehdipour, H. Noori, M. S. Zamani, K. Inoue, and K. Murakami

Asia and South Pacific Design Automation Conference 2008年1月

　詳細を見る

開催年月日： 2008年1月

記述言語：その他

国名：大韓民国
Improved Policies for Drowsy Caches in Embedded Processors 国際会議

J. Zushi, G. Zeng, H. Tomiyama, H. Takada, and K. Inoue

Internal Symposium on Electronics Design, Test & Applications 2008年1月

　詳細を見る

開催年月日： 2008年1月

記述言語：その他

国名：台湾
Energy Consumption Evaluation of an Adaptive Extensible Processor 国際会議

H. Noori, F. Mehdipour, M. Goudarzi, S. Yamaguchi, K. Inoue, and K. Murakami

Reconfigurable and Adaptive Architecture Workshop 2007年12月

　詳細を見る

開催年月日： 2007年12月

記述言語：その他

国名：アメリカ合衆国
Adaptive Management of Cache Block Replication for High-Performance CMP 国際会議

T. Mihara, K. Inoue, and K. Murakami

WorkshopOn Chip MultiProcessor: Processor Architecture and Memory Hierarchy related Issues 2007年9月

　詳細を見る

開催年月日： 2007年9月

記述言語：その他

国名：ルーマニア
One-sided Communication Implementation in FMO Method 国際会議

J. Maki, Y. Inadomi, T. Takami, R. Susukita, H. Honda, J. Ooba, T. Kobayashi, R. Nogita, K. Inoue and M. Aoyagi

International Conference on High Performance Computing, Grid and e-Science in Asia Pacific Regiion 2007年9月

　詳細を見る

開催年月日： 2007年9月

記述言語：その他

国名：ギリシャ共和国
Multi-physics Extension of OpenFMO 国際会議

T. Takami, J. Maki, J. Ooba, Y. Inadomi, H. Honda, R. Susukita, K. Inoue, T. Kobayashi, R. Nogita, and M. Aoyagi

FrameworkInternational Conference of Computational Method in Sciences and Enginnering 2007年9月

　詳細を見る

開催年月日： 2007年9月

記述言語：その他

国名：ギリシャ共和国
Implementation and Evaluation of Fock Matrix Calculation Program on the Cell Processor 国際会議

H. Honda, T. Hayashi, Y. Inadomi, K. Inoue, and K. Murakami

International Conference of Computational Method in Sciences and Enginnering 2007年9月

　詳細を見る

開催年月日： 2007年9月

記述言語：その他

国名：ギリシャ共和国
The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems 国際会議

H. Noori, M. Goudarzi, K. Inoue, and K. Murakami

International Conference on Embedded Systems and Applications 2007年6月

　詳細を見る

開催年月日： 2007年6月

記述言語：その他

国名：アメリカ合衆国
メモリアクセスの特徴を活用した高速かつ正確なメモリアーキテクチャ・シミュレーション法

小野貴継井上弘士村上和彰

先進的計算基盤システムシンポジウム 2007年5月

　詳細を見る

開催年月日： 2007年5月

記述言語：その他会議種別：口頭発表（一般）

国名：日本国
通信タイミングを考慮した衝突削減のためのMPIランク配置最適化技術

森江善之, 末安直樹松本透, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

先進的計算基盤システムシンポジウム 2007年5月

　詳細を見る

開催年月日： 2007年5月

記述言語：その他会議種別：口頭発表（一般）

国名：日本国
Dynamic Management Technique to Mitigate Performance Degradation for Low-Leakage Caches 国際会議

R. Komiya, K. Inoue, and K. Murakami

The 10th IEEE Symposium on Low-Power and High-Speed Chips 2007年4月

　詳細を見る

開催年月日： 2007年4月

記述言語：その他会議種別：口頭発表（一般）

開催地：横浜国名：日本国
Reducing energy consumption of video memory by bit-width compression

Vasily G. Moshnyaga, Koji Inoue, Mizuka Fukagawa

Proceedings of the 2002 International Symposium on Low Power Electronics and Design 2002年1月

　詳細を見る

開催年月日： 2002年8月

記述言語：英語

開催地：Monterey, CA 国名：アメリカ合衆国

A new architectural technique to reduce energy dissipation of video memory is propose. Unlike existing approaches, the technique exploits the pixel correlation in video sequences, dynamically adjusting the memory bit-width to the number of bits changed per pixel. Instead of treating the data bits independently, we group the most significant bits together, activating the corresponding group of bit-lines adaptively to data variation. The method is not restricted to the specific bit-patterns nor depends on the storage phase. It works equally well on read and write accesses, as well as during precharging. Simulation results show that using this method we can reduce the total energy consumption of video memory by 20% without affecting the picture quality.
A history-based i-cache for low-energy multimedia applications

Koji Inoue, V. G. Moshnyaga, K. Murakami

Proceedings of the 2002 International Symposium on Low Power Electronics and Design

　詳細を見る

開催年月日： 2002年8月

記述言語：英語

開催地：Monterey, CA 国名：アメリカ合衆国

This paper proposes a history-based tag-comparison scheme for reducing energy consumption of direct-mapped instruction caches. The proposed cache efficiently exploits program-execution footprints recorded in the Branch Target Buffer (BTB), and attempts to detect and eliminate unnecessary tag checks at run time. Simulation results show that our approach can eliminate up to 95% of tag checks, saving the cache energy by 17%, while affecting the processor performance by only 0.2%.
Way-predicting set-associative cache for high performance and low energy consumption

Koji Inoue, Tohru Ishihara, Kazuaki Murakami

Proceedings of the 1999 International Conference on Low Power Electronics and Design (ISLPED)

　詳細を見る

開催年月日： 1999年8月

記述言語：英語

開催地：San Diego, CA, USA 国名：その他

This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache.
Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs

Inoue Koji, Koji Kai, Kazuaki Murakami

Proceedings of the 1999 5th International Symposium on High-Performance Computer Architecture, HPCA 1999年1月

　詳細を見る

開催年月日： 1999年1月

記述言語：英語

開催地：Orlando, FL, USA 国名：その他

This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called `dynamically variable line-size cache (D-VLS cache)'. The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is about 27%, compared to a conventional direct-mapped cache with fixed 32-byte lines.
3D memory architecture 招待国際会議

Koji Inoue

D43D: 3rd Design for 3D Silicon Integration Workshop 2011年6月

　詳細を見る

記述言語：その他会議種別：口頭発表（一般）

国名：フランス共和国
Adaptive Execution on 3D Microprocessors 招待国際会議

Koji Inoue

11th International Forum on Embedded MPSoC and Multicore 2011年7月

　詳細を見る

記述言語：その他会議種別：口頭発表（一般）

国名：フランス共和国
Adaptive Execution on 3D Microprocessors 招待国際会議

Koji Inoue

11th International Forum on Embedded MPSoC and Multicore 2011年7月

　詳細を見る

記述言語：その他会議種別：口頭発表（一般）

国名：フランス共和国
Performance Evaluation of 3D Stacked Multi-Core Processors with Temperature Consideration 国際会議

T. Hanada, H. Sasaki, K. Inoue and K. Murakami

International 3D System Integration Conference 2012年1月

　詳細を見る

記述言語：その他

国名：日本国
A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack 国際会議

F. Mehdipour, K. C. Nunna, L. Gauthier, K. Inoue and K. Murakami

International 3D System Integration Conference 2012年1月

　詳細を見る

記述言語：その他

国名：日本国
Efficient Barrier Synchronization for 2D Meshed NoC-based Many-core Processors 国際会議

Lovic Gauthier, Farhad Mehdipour, Koji Inoue, Shinya Ueno, Hiroshi Sasaki

The 17th Workshop on Synthesis And System Integration of Mixed Information technologies 2012年3月

　詳細を見る

記述言語：その他

国名：日本国
Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core-count and Frequency Scaling 国際会議

Satoshi Imamura, Hiroshi Sasaki, Naoto Fukumoto, Koji Inoue, and Kazuaki Murakami

2nd Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments (RESoLVE '12) 2012年3月

　詳細を見る

記述言語：その他会議種別：口頭発表（一般）

国名：グレートブリテン・北アイルランド連合王国(英国)
On the Power and Performance Analysis of GPU-Accelerated Systems 国際会議

Yuki Abe, 佐々木広, Inoue Koji, Kazuaki Murakami, Shinpei Kato

Poster session, 2012 USENIX Annual Technical Conference 2012年6月

　詳細を見る

記述言語：英語会議種別：シンポジウム・ワークショップパネル（公募）

国名：アメリカ合衆国
SMYLE: Scalable Many-core for Low-Energy computing (Invited) 招待国際会議

Koji Inoue and Masaaki Kondo

12th International Forum on Embedded MPSoC and Multicore 2012年7月

　詳細を見る

記述言語：その他

国名：日本国
A Three-Dimensional Integrated Accelerator 国際会議

Farhad Mehdipour, Krishna Chaitanya Nunna, Inoue Koji, Kazuaki Murakami

Euromicro Conference on Digital System Design 2012年9月

　詳細を見る

記述言語：英語会議種別：シンポジウム・ワークショップパネル（公募）

国名：トルコ共和国
Scalability-Based Manycore Partitioning 国際会議

Hiroshi Sasaki, Teruo Tanimoto, Koji Inoue, and Hiroshi Nakamura

International Conference on Parallel Architectures and Compilation Techniques 2012年9月

　詳細を見る

記述言語：その他会議種別：シンポジウム・ワークショップパネル（公募）

国名：アメリカ合衆国
Power and Performance Analysis of GPU-Accelerated Systems 国際会議

Yuki Abe, Hiroshi Sasaki, Martin Peres, Inoue Koji, Kazuaki Murakami, Shinpei Kato

Workshop on Power-Aware Computing and Systems 2012年10月

　詳細を見る

記述言語：英語会議種別：シンポジウム・ワークショップパネル（公募）

国名：アメリカ合衆国
Task Mapping Techniques for Embedded Many-core SoCs 国際会議

Junya Kaida, Takuji Hieda, Ittetsu Taniguchi, Hiroyuki Tomiyama, Yuko Hara-Azumi, Inoue Koji

International SoC Design Conference 2012年11月

　詳細を見る

記述言語：英語会議種別：シンポジウム・ワークショップパネル（公募）

国名：大韓民国
SMYLEref: A Reference Architecture for Manycore-Processor SoCs 国際会議

Masaaki Kondo, Son Truong Nguyen, Takeshi Soga, Tomoya Hirao, Hiroshi Sasaki, Inoue Koji

Asia and South Pacific Design Automation Conference (ASP-DAC) 2013年1月

　詳細を見る

記述言語：英語

国名：日本国

, , , , Hiroshi Sasaki, and Koji Inoue, "
SMYLEProject:TowardHigh-Performance,Low-PowerComputingonManycore-Processor SoCs

Inoue Koji

Asia and South Pacific Design Automation Conference (ASP-DAC) 2013年1月

　詳細を見る

記述言語：英語

国名：日本国
Line Sharing Cache: Exploring Cache Capacity with Frequent Line Value Locality 国際会議

Keitaro Oka, Hiroshi Sasaki, Inoue Koji

Asia and South Pacific Design Automation Conference 2013年1月

　詳細を見る

記述言語：英語会議種別：シンポジウム・ワークショップパネル（公募）

国名：日本国
メニーコアプロセッサにおける実時間モデル予測制御のための投機実行法

川上哲志, 岩永明人, 井上弘士

先進的計算基盤システムシンポジウム論文集 2013年5月

　詳細を見る

記述言語：日本語

国名：日本国

Speculative Execution for Real-time Model Predictive Control on Manycore Processor
Many-core Acceleration for Model Predictive Control Systems 国際会議

Satoshi Kawakami, Akihito Iwanaga, Inoue Koji

Int’l Workshop on Manycore Embedded Systems 2013年6月

　詳細を見る

記述言語：英語

国名：日本国
Coordinated Power-Performance Optimization in Manycores 国際会議

Hiroshi Sasaki, Satoshi Imamura, Inoue Koji

the 22nd International Conference on Parallel Architectures and Compilation Techniques 2013年9月

　詳細を見る

記述言語：英語

国名：日本国
フレームレートの動的最適化に基づく低消費エネルギー物体追跡システムの提案 (集積回路デザインガイア2013 : VLSI設計の新しい大地)

江川瀬里奈, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2013年11月

　詳細を見る

記述言語：日本語

国名：日本国

動画像上の指定した対象物体の位置座標を各フレームで推定するオンライン物体追跡は,自動車の安全技術の一つである障害物追跡や居眠り検知などに広く応用され,重要な技術となっている.最近ではバッテリ駆動を基本とする移動体における応用が拡大しており,追跡精度を向上するだけでなく,低消費エネルギー化も同時に達成することが求められる.そこで本稿では,物体追跡システムの低消費エネルギー化を目的とした動的フレームレート最適化方式を提案する.本方式では,物体追跡システム全体の消費エネルギーに基づいて最適なフレームレートに動的変更することにより,必要以上のフレーム取得や処理に要する消費エネルギーを削減する.消費エネルギーモデルを用いて本方式の実装・評価を行った結果,従来方式と同程度の追跡精度で消費エネルギーを70%以上削減できることが分かった.
Performance and Power Consumption Evaluation of MHD Simulation for Magnetosphere on Parallel Computer System with CPU Power Capping 国際会議

FUKAZAWA Keiichiro, Tomonori Tsuhata, Kyohei Yoshida, Masakazu Kuze, Masatsugu Ueda, 稲富雄一, Inoue Koji

Extreme Green & Energy Efficiency in Large Scale Distributed Systems 2014年5月

　詳細を見る

記述言語：英語

国名：オランダ王国
Power and Performance Characterization and Modeling of GPU-accelerated Systems 国際会議

Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Inoue Koji, Masato Edahiro, Martin Peres

the 28th IEEE International Parallel & Distributed Processing Symposium 2014年5月

　詳細を見る

記述言語：英語

国名：日本国
A flexible hardware barrier mechanism for many-core processors 国際会議

Takeshi Soga, Hiroshi Sasaki, Tomoya Hirao, Masaaki Kondo, Inoue Koji

Asia and South Pacific Design Automation Conference 2015年1月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：日本国
物体追跡システムの低消費エネルギー化を目的とした動的フレームレート制御法 (集積回路)

井上優良, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2015年12月

　詳細を見る

記述言語：日本語

国名：日本国

Dynamic Frame-rate Optimization for Low Energy Object Tracking
物体追跡システムの低消費エネルギー化を目的とした動的フレームレート制御法 (電子部品・材料)

井上優良, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2015年12月

　詳細を見る

記述言語：日本語

国名：日本国

Dynamic Frame-rate Optimization for Low Energy Object Tracking
モデル予測制御を対象としたメニーコアプロセッサ向け投機実行法の制御性能評価 (VLSI設計技術)

藤井卓, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016年1月

　詳細を見る

記述言語：日本語

国名：日本国
光パスゲート論理に基づく並列加算回路の提案と光電混載回路シミュレータによる動作検証 (回路とシステム)

石原亨, 新家昭彦, 井上弘士, 野崎謙悟, 納富雅也

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016年6月

　詳細を見る

記述言語：日本語

国名：日本国

A Parallel Adder Circuit based on Optical Pass-gate Logic and Its Evaluation with Optoelectronic Circuit Simulator
受信信号強度を用いたデバイス認証方式における攻撃可能条件の定式化 (コンピュータシステム)

藤井達也, 小野貴継, 金谷晴一, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016年8月

　詳細を見る

記述言語：日本語

国名：日本国

Formulating Attack Condition on Received Signal Strength Indicator based Device Authentication
Single-Flux-Quantum Cache Memory Architecture 国際会議

Koki Ishida, Masamitsu Tanaka, Takatsugu Ono, Inoue Koji

International SoC Design Conference 2016年10月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：大韓民国
単一磁束量子回路を用いたシフトレジスタ型キャッシュメモリ・アーキテクチャの提案 (電子部品・材料) -- (デザインガイア2016 : VLSI設計の新しい大地)

石田浩貴, 田中雅光, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016年11月

　詳細を見る

記述言語：日本語

国名：日本国

Shift-Register-Based Single-Flux-Quantum Cache Memory Architecture
Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping 国際会議

今村智史, Yuichiro Yasui, Inoue Koji, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

the 1st High Performance Graph Data Management and Processing workshop 2016年11月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：アメリカ合衆国
Evaluating the Impacts of Code-Level Performance Tunings on Power Efficiency 国際会議

今村智史, Keitaro Oka, Yuichiro Yasui, 稲富雄一, Katsuki Fujisawa, Toshio Endo, Koji Ueno, Keiichiro Fukazawa, Nozomi Hata, Yuta Kakibuka, Inoue Koji, Takatsugu Ono

IEEE International Conference on Big Data 2016年12月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：アメリカ合衆国
Production Hardware Overprovisioning: Real-world Performance Optimization using an Extensible Power-aware Resource Management Framework 国際会議

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, and Martin Schulz

IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017) 2017年5月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：アメリカ合衆国
High-Throughput Bit-Parallel Arithmetic Logic Unit Using Rapid Single-Flux-Quantum Logic 国際会議

Masamitsu Tanaka, Ryo Sato, Yuki Hatanaka, Yuichi Matsui, Hiroyuki Akaike, Akira Fujimaki, Koki Ishida, Takatsugu Ono, Koji Inoue

International Superconductive Electronics Conference 2017年6月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：イタリア共和国
単一磁束量子ゲートレベルパイプラインマイクロプロセッサに向けた要素回路設計 (超伝導エレクトロニクス)

畑中湧貴, 松井裕一, 田中雅光, 佐野京佑, 藤巻朗, 石田浩貴, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2017年8月

　詳細を見る

記述言語：日本語

国名：日本国

Design of Component Circuits for Rapid Single-Flux-Quantum Gate-Level-Pipelined Microprocessors
CPCI Stack: Metric for Accurate Bottleneck Analysis on OoO Microprocessors 国際会議

Teruo Tanimoto, Takatsugu Ono, Koji Inoue

International Symposium on Computing and Networking 2017年11月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：日本国
Wireless Spoofing-Attack PreventionUsing Radio-Propagation Characteristics 国際会議

Mihiro Sonoyama, Takatsugu Ono, Osamu Muta, Haruichi Kanaya, Koji Inoue

IEEE International Conference on Dependable, Autonomic and Secure Computing 2017年11月

　詳細を見る

記述言語：英語会議種別：口頭発表（一般）

国名：アメリカ合衆国
A Low Power I-Cache Design with Tag-Comparison Reuse 国際会議

K. Inoue, H. Tanaka, V. Moshnyaga, K. Murakami

The International Symposium on System-On-Chip 2004年11月

　詳細を見る

記述言語：その他会議種別：口頭発表（一般）

開催地：Tampere 国名：フィンランド共和国
Quantitative Evaluation of Leakage Reduction Algorithm for L1 Data Caches 国際会議

R. Komiya, K. Inoue, V. Moshnyaga, K. Murakami

The International SoC Design Conference (ISOCC) 2004年10月

　詳細を見る

記述言語：その他

開催地：the Convention and Exhibition Center (COEX) 国名：大韓民国
Energy-Security Tradeoff in a Secure Cache Architecture Against Buffer Overflow Attacks 国際会議

Koji Inoue

Workshop on Architectural Support for Security and Anti-Virus (WASSA) 2004年10月

　詳細を見る

記述言語：その他会議種別：口頭発表（一般）

開催地：Park Plaza Hotel 国名：アメリカ合衆国

▼全件表示

MISC

Way-Predicting Set-Associative Cache for High Performance and Low Energy Consumption 査読

Koji Inoue, Tohru Ishihara, Kazuaki J. Murakami

Proceedings of International Symposium on Low Power Electronics and Design (ISLPED'99) 1999年8月

　詳細を見る

記述言語：その他

DOI： 10.1145/313817.313948
Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/Logic LSIs 査読

K Inoue, K Kai, K Murakami

FIFTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS 1999年1月

　詳細を見る

記述言語：英語

This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache)" The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is about 27%, compared to a conventional direct-mapped cache with fixed 32-byte lines.

DOI： 10.1109/HPCA.1999.744366
極低温不揮発FPGAを対象とした誤り耐性量子コンピュータ向け表面符号復号器のRTL設計

中村, 徹舟, 宮村, 信, 井上, 弘士, 川上, 哲志, 阪本, 利司, 多田, 宗弘, 谷本, 輝夫

情報処理学会論文誌コンピューティングシステム（ACS） 17 ( 1 ) 13 - 25 2024年3月（ ISSN:1882-7829 ）

　詳細を見る

記述言語：日本語

量子ハードウェアは高いエラー率を示すため，量子誤り訂正技術の実現が不可欠である．特に，表面符号は高いエラー訂正性能を持つ誤り訂正符号として注目されている．本研究では，極低温環境で動作可能なNanoBridge-FPGAへの実装を目指し，iterative greedyアルゴリズムを用いた表面符号復号器のRTL設計を行った．設計した復号器は，先行研究と同じ誤りシミュレータを用いて動作検証を行い，レイテンシ・使用リソース量の評価も行った．さらに，NanoBridge-FPGAへの論理合成・配置配線も行い，使用リソース量を確認した．
Since the error rates of existing quantum devices are high, it is essential to realize quantum error correction (QEC) techniques. In particular, surface code (SC) has attracted attention as one of the most promising error-correcting codes. In this study, we have designed an RTL surface code decoder using the iterative greedy algorithm to implement on NanoBridge-FPGA that can operate in cryogenic environments. The designed decoder was verified using the same error simulator as in the previous study, and latency and resource usage were also evaluated. In addition, we performed logic synthesis, placement and routing targeting NanoBridge-FPGA and confirmed the resource usage.

CiNii Books

CiNii Research

researchmap
極低温不揮発FPGAを対象とした誤り耐性量子コンピュータ向け表面符号復号器のRTL設計

中村徹舟, 宮村信, 井上弘士, 川上哲志, 阪本利司, 多田宗弘, 谷本輝夫

情報処理学会研究報告(Web) 2023 ( ARC-252 ) 2023年

　詳細を見る

J-GLOBAL

researchmap
通信量に着目したQAOA向け極低温NISQコンピューティングのアーキテクチャ検討

富田祐永, 上野洋典, 上野洋典, 谷本輝夫, 田中雅光, 井上弘士, 中村宏

情報処理学会研究報告(Web) 2022 ( ARC-250 ) 2022年

　詳細を見る

J-GLOBAL

researchmap
単一磁束量子回路に基づくゲートレベルパイプライン浮動小数点演算器の動作実証

長岡一起, 加島亮太, 田中雅光, 川上哲志, 谷本輝夫, 山下太郎, 井上弘士, 藤巻朗

電子情報通信学会大会講演論文集(CD-ROM) 2022 2022年（ ISSN:1349-144X ）

　詳細を見る

J-GLOBAL

researchmap
単一磁束量子プロセッサ向けキャッシュメモリ構成法の検討と定量的評価

鴨志田圭吾, 石川伊織, 羽野祐太, 川上哲志, 谷本輝夫, 小野貴継, 田中雅光, 藤巻朗, 井上弘士

情報処理学会研究報告(Web) 2022 ( ARC-249 ) 2022年

　詳細を見る

J-GLOBAL

researchmap
光パスゲート論理に基づく超低遅延光回路—特集集積ナノフォトニクス研究の最前線

新家昭彦, 石原亨, 井上弘士, 野崎謙悟, 納富雅也

NTT技術ジャーナル / 日本電信電話株式会社編 2018年5月

　詳細を見る

記述言語：日本語
アウトオブオーダ命令実行の依存グラフ表現に関する考察

谷本輝夫, 佐々木広, 小野貴継, 井上弘士

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 2016年8月

　詳細を見る

記述言語：日本語
マルチスケールフィルタ向けアクセラレータ・アーキテクチャの提案

上野伸也, GauthierLovic Eric, 井上弘士, 村上和彰

研究報告システムLSI設計技術（SLDM） 2012年10月

　詳細を見る

記述言語：日本語

画像認識技術が様々な分野で使われ，画像認識アプリケーションを高性能・低消費エネルギーで実行するプロセッサが要求されている．画像認識アプリケーションの実行時間の多くを占めるのはフィルタ処理である．そのため， GRAPE-DR のように演算器をアレイ上に並べるアーキテクチャが適している．しかしながら，処理ごとにフィルタの大きさが異なるため，従来のようにメモリとのデータ入出力を行う演算器が最上段と最下段に固定されている場合，一度に動作させることの可能な演算器が少なくなってしまう．そこで，本稿では，メモリとのデータ入出力に柔軟性を持たせた DSP (Data Stream Processing) Tile 型アクセラレータ・アーキテクチャを提案する．提案するアクセラレータは， DSPTile という小規模なフィルタ処理を実行可能な Tile を大量に集積しており，各 DSPTile がメモリと通信を行える．さらに，各 DSPTile は他の DSPTile へ演算結果を渡せるように接続されている．これらを利用して，小規模なフィルタ処理を複数並列に実行したり，大規模なフィルタ処理を実行したりすることが可能である．本稿では，面積オーバーヘッドを考慮しながら，詳細なアーキテクチャの決定を行う．Image recognition processing includes a number of filter operations which dominate the total execu tion time. Exploiting ALU array to accelerate the filter operations is one of the most promising approaches to achieve such energy-efficient executions. However, it is difficult for conventional ALU array accelerator to achieve high-performance and low-energy for multi-scale filter operations. To tackle this issule this paper proposes DSP (Data Stream Processing) tile accelerator for multi scale filter operations. Tile accelerator has many DSP tiles which can execute a small size of filter efficiently. Each DSP tile is connected with three-dimensionally implemented scratch-pad memories via TSVs.
キャッシュウェイ割り当てとコード配置の同時最適化によるメモリアクセスエネルギーの削減

高田純司, 石原亨, 井上弘士

研究報告システムLSI設計技術（SLDM） 2011年10月

　詳細を見る

記述言語：日本語

本稿ではシングルコアプロセッサで実行されるマルチタスクにおいて，キャッシュウェイの割り当てとコード配置を同時に最適化する手法の提案を行う．同時には一つの割り当てられたキャッシュウェイのみを活性化し，残りのキャッシュウェイは活性化しないことでセットアソシアティブキャッシュへのアクセスエネルギーを削減可能である．また，主記憶上のプログラムコードの配置位置を変えることでキャッシュミス数を削減する．キャッシュミス数を削減することで主記憶へのアクセスエネルギーを削減可能であると同時に，全実行時間の短縮が可能である．商用プロセッサを用いた実験によって本手法を適用したプロセッサシステムの評価を行い，手法適用前の場合と比較して最大で 17％の消費エネルギーの削減を確認した．The paper proposes a technique which simultaneously finds the optimal cache way allocation and code placement for given multiple tasks running on a single core processor. It reduces the energy consumption in a set-associative cache by activating only a single cache way at a time and deactivating the remaining cache ways. The technique also reduces the number of cache misses by changing the code placement in a main memory, which results in a reduction of the energy consumption in the main memory as well as the reduction of total execution time. Experiments using a commercial embedded processor demonstrate that the technique reduces the total energy consumption in the target processor system by 17% at the best case compared to the energy of the system which does not apply our technique.
キャッシュウェイ割り当てとコード配置の同時最適化によるメモリアクセスエネルギーの削減

高田純司, 石原亨, 井上弘士

電子情報通信学会技術研究報告. ICD, 集積回路 2011年10月

　詳細を見る

記述言語：日本語

本稿ではシングルコアプロセッサで実行されるマルチタスクにおいて,キャッシュウェイの割り当てとコード配置を同時に最適化する手法の提案を行う.同時には一つの割り当てられたキャッシュウェイのみを活性化し,残りのキャッシュウェイは活性化しないことでセットアソシアティブキャッシュへのアクセスエネルギーを削減可能である.また,主記憶上のプログラムコードの配置位置を変えることでキャッシュミス数を削減する.キャッシュミス数を削減することで主記憶へのアクセスエネルギーを削減可能であると同時に,全実行時間の短縮が可能である.商用プロセッサを用いた実験によって本手法を適用したプロセッサシステムの評価を行い,手法適用前の場合と比較して最大で17%の消費エネルギーの削減を確認した.
マルチコア向けオンチップメモリ貸与法における実行コード生成法の改善 (集積回路)

福本尚人, 今里賢一, 井上弘士

電子情報通信学会技術研究報告 2010年1月

　詳細を見る

記述言語：日本語

Improving execution code generation for on-chip memory lending on multicores
3次元DRAM-プロセッサ積層実装を対象としたオンチップ・メモリ・アーキテクチャの提案と評価

橋口慎哉, 小野貴継, 井上弘士, 村上和彰

研究報告システムソフトウェアとオペレーティング・システム（OS） 2009年4月

　詳細を見る

記述言語：日本語

本稿では，3次元積層DRAMの利用を前提とし，大幅なチップ面積の増加を伴うことなく高いメモリ性能を達成可能な新しいキャッシュ・アーキテクチャを提案する．3次元実装されたDRAMを大容量キャッシュとして活用することで，オフチップメモリ参照回数の劇的な削減が期待できる．しかしながら，その反面，キャッシュの大容量化はアクセス時間の増加を招くため，場合によっては性能が低下する．この問題を解決するため，提案方式では，実行対象プログラムのワーキングセット・サイズに応じて3次元積層DRAMキャッシュを選択的に活用する．ベンチマーク・プログラムを用いた定量的評価を行った結果，提案方式の静的制御方式で平均35%，動的制御方式で平均43%の性能向上を達成した．In this paper, we propose a new architecture that can achieve high memory performance without large footprint overhead for DRAM-stacked processors. 3D stacked DRAM caches can dramatically reduce off chip memory accesses. However, this approach degrades performance in some cases because increasing cache size makes access time longer to solve this problems. Our approach selectively leverages the stacked DRAM cache based on the valiation of working set sizes. The results of our quantitative evaluation showed that the proposed approach achieves 35% of memory performance gain in static control method and 43% in dynamic control method.
演算／メモリ性能バランスを考慮した Cell／B.E. 向けオンチップ・メモリ活用法とその評価

林徹生, 福本尚人, 今里賢一, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008年5月

　詳細を見る

記述言語：日本語

現在我々は，チップマルチプロセッサの高性能化を目的とした演算/メモリ性能バランシング技術を提案している．本技術では，チップ内に搭載された複数コアを必要に応じて「演算用」もしくは「メモリ性能向上用」として使い分ける．本方式では，如何にして適切なコア配分を実現するかが極めて重要となる．そこで本稿では，性能モデリングに基づくコア分配法を提案する．また，本方式を Cell/BE、プロセッサに実装し，その有効性を評価する．ベンチマークプログラムを用いた定量的評価を行った結果，単純な並列処理に比べて最大で 14.5％の性能向上を達成できた．We have proposed the concept of Performance Balancing to improve the CMP performance. This approach attempts to exploit the on-chip cores not only for executing the parallelized threads, but also for improving the memory performance. In this technique, it is very important to decide an appropriate number of cores dedicated to memory performance improvements. In this paper, we propose an algorithm to solve this problem and implement it on a Cell/B.E. processor. In our evaluation, it is observed that our approach can achieve 14％ performance improvement in the best case compared to a conventional CMP model.
演算／メモリ性能バランスを考慮した CMP 向けへルパースレッド実行方式の提案と評価

今里賢一, 福本尚人, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008年5月

　詳細を見る

記述言語：日本語

複数のプロセッサコアを１チップに搭載するチップマルチプロセッサ(CMP)が現在注目されている．チップ内スレッドレベル並列処理により高い演算性能を得ることができるためである．しかしながら，メモリバンド幅の制約や複数コア搭載によるメモリアクセス頻度の増加により，メモリウォール問題が深刻化する．その結果，多くのメモリ参照を必要とする並列プログラムの実行においては実効性能が低下するといった問題が生じる．そこで本稿では，CMP の性能向上を目的として，演算性能とメモリ性能のバランスを考慮したヘルパースレッド実行方式を提案する．従来の方式では，スレッドレベル並列性を高めるため，搭載された全てのプロセッサコアを利用して並列プログラムを実行する．これに対し，提案方式では，一部のプロセッサコアをプリフェッチを行うヘルパースレッドに割当てる．へルパースレッドの最適な数が既知であると仮定して提案方式の性能を評価した結果，従来方式と比較して，最大で 47％の性能向上を得ることができた．Conventional CMPs attempt to exploit the thread-level parallelism (TLP) by using all of the cores integrated in a chip. However, this kind of straightforward way does not always achieve the best performance. This is because the memory-wall problem becomes more critical in CMPs, resulting in poor performance in spite of high TLP. To solve this issue, we propose an efficient thread management technique, called performance balancing. We dare to throttle the TLP to execute software prefetchers as helper-threads. Our experimental results show 47％ speed up in the best case compared with a conventional parallel execution.
演算/メモリ性能バランスを考慮したCMP向けヘルパースレッド実行方式の提案と評価

今里賢一, 福本尚人, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008年5月

　詳細を見る

記述言語：日本語

複数のプロセッサコアを1チップに搭載するチップマルチプロセッサ(CMP)が現在注目されている.チップ内スレッドレベル並列処理により高い演算性能を得ることができるためである.しかしながら,メモリバンド幅の制約や複数コア搭載によるメモリアクセス頻度の増加により,メモリウォール問題が深刻化する.その結果,多くのメモリ参照を必要とする並列プログラムの実行においては実効性能が低下するといった問題が生じる.そこで本稿では,CMPの性能向上を目的として,演算性能とメモリ性能のバランスを考慮したヘルパースレッド実行方式を提案する.従来の方式では,スレッドレベル並列性を高めるため,搭載された全てのプロセッサコアを利用して並列プログラムを実行する.これに対し,提案方式では,一部のプロセッサコアをプリフェッチを行うヘルパースレッドに割当てる.ヘルパースレッドの最適な数が既知であると仮定して提案方式の性能を評価した結果,従来方式と比較して,最大で47%の性能向上を得ることができた.
チップマルチプロセッサにおけるメモリ負荷変動の定量的解析

山口光章, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008年5月

　詳細を見る

記述言語：日本語

複数コアを1チップに搭載するチップマルチプロセッサ(CMP)が注目されている.CMPでは,チップ内並列処理により高い演算性能を達成することができる.共有型キャッシュを搭載するCMPの場合,各コアは最大でキャッシュの全容量を利用可能となる.しかしながら,その反面,複数アプリケーションを同時に実行した場合には,複数コアによるキャッシュアクセスが発生するため競合性ミスが増加する.その結果,あるコアにおけるプログラム実行性能が,他コアで実行されるプログラム特性よって大きな影響を受ける.したがって,CMP性能を最大限引き出すためには,共有型キャッシュの最適化が極めて重要となる.そこで本稿では,キャッシュ共有型CMPを対象とし,複数アプリケーションの同時実行によるメモリ負荷変動を定量的に解析した.解析の結果,あるプログラム実行において,他コアとの競合性ミスに起因するメモリ負荷の増加が「大きく発生する区間」と「さほど発生しない区間」が存在することが分かった.
チップマルチプロセッサにおけるメモリ負荷変動の定量的解析

山口光章, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008年5月

　詳細を見る

記述言語：日本語

複数コアを１チップに搭載するチップマルチプロセッサ(CMP)が注目されている．CMP では，チップ内並列処理により高い演算性能を達成することができる．共有型キャッシュを搭載する CMP の場合，各コアは最大でキャッシュの全容量を利用可能となる．しかしながら，その反面，複数アプリケーションを同時に実行した場合には，複数コアによるキャッシュアクセスが発生するため競合性ミスが増加する．その結果，あるコアにおけるプログラム実行性能が，他コアで実行されるプログラム特性よって大きな影響を受ける．したがって，CMP 性能を最大限引き出すためには，共有型キャッシュの最適化が極めて重要となる．そこで本稿では，キャッシュ共有型 CMP を対象とし，複数アプリケーションの同時実行によるメモリ負荷変動を定量的に解析した．解析の結果，あるプログラム実行において，他コアとの競合性ミスに起因するメモリ負荷の増加が「大きく発生する区間」と「さほど発生しない区間」が存在することが分かった．Integrating multiple processor cores into a single chip, or chip-multiprocessors (CMPs) is one of the most promising approaches to achieve high-performance and low-power consumption at the same time. In CMPs employing a sheared L2 cache, conflict misses may be increased, because all of the cores share the limited cache resource. To solve this problem, this paper quantitatively analyzes the memory workload on CMPs. By means of observing the transition of a CPI stack, we can discuss the detail of the memory behavior. In this analysis, it is observed that intra- and inter-programs, there are time period in which the conflicts frequently take place.
トランザクショナルメモリにおける並列実行トランザクション数動的制御法の提案とその評価

武田進, 島崎慶太, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008年5月

　詳細を見る

記述言語：日本語

本稿では,トランザクショナルメモリにおける性能向上を目的とした並列実行トランザクション数動的制御法を提案する.一般に,並列プログラムにおいては共有変数へのアクセスに関して排他制御を行う必要がある.トランザクショナルメモリでは,複数スレツドに対して共有変数の同時アクセスを許すことで高性能化を実現する.しかしながら,複数スレッドによる共有変数へのアクセスにおいて不都合が発生した場合には,それまでの実行を中断し,トランザクション実行のやり直しを行う必要がある.その結果,期待した並列効果を得ることができないだけでなく,場合によっては性能が低下する.この問題を解決するため,本稿では実行やり直しの発生可能性を事前に検知し,必要に応じて並列に実行されるトランザクション数を抑制する方式を提案する.32コアを搭載したチップマルチプロセッサを前提とした評価を行った結果,最大で1.6倍程度の性能向上を達成することを確認した.
トランザクショナルメモリにおける並列実行トランザクション数動的制御法の提案とその評価

武田進, 島崎慶太, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008年5月

　詳細を見る

記述言語：日本語

本稿では，トランザクショナルメモリにおける性能向上を目的とした並列実行トランザクション数動的制御法を提案する．一般に，並列プログラムにおいては共有変数へのアクセスに関して排他制御を行う必要がある．トランザクショナルメモリでは，複数スレッドに対して共有変数の同時アクセスを許すことで高性能化を実現する．しかしながら，複数スレッドによる共有変数へのアクセスにおいて不都合が発生した場合には，それまでの実行を中断し，トランザクション実行のやり直しを行う必要がある．その結果，期待した並列効果を得ることができないだけでなく，場合によっては性能が低下する．この問題を解決するため，本稿では実行やり直しの発生可能性を事前に検知し，必要に応じて並列に実行されるトランザクション数を抑制する方式を提案する．32 コアを搭載したチップマルチプロセッサを前提とした評価を行った結果，最大で 1.6 倍程度の性能向上を達成することを確認した．This paper proposes a technique to improve the performance of CMPs by mans of managing the number of transactions to be executed in parallel. In parallel computing, we need to manage sheared data in order to ensure the exclusiveness. In transactional memories, it is allowed the threads to access the shared data, resulting in higher performance. This is because we can aggressively exploit thread-level parallelisms. However, when a conflict takes place in the transactional memory, the associated thread execution needs to be aborted in order to guarantee the correct execution results. This abort operation degrades the CMP performance. To solve this issue, we propose an adaptive management mechanism to throttle or un-throttle the thread-level parallelism. In our evaluation, it is observed that in the best case we can achieve 1.6x speedup.
演算/メモリ性能バランスを考慮した Cell/B.E. 向けオンチップ・メモリ活用法とその評価

林徹生, 福本尚人, 今里賢一, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2008年5月

　詳細を見る

記述言語：日本語

現在我々は,チップマルチプロセッサの高性能化を目的とした演算/メモリ性能バランシング技術を提案している.本技術では,チップ内に搭載された複数コアを必要に応じて「演算用」もしくは「メモリ性能向上用」として使い分ける.本方式では,如何にして適切なコア配分を実現するかが極めて重要となる.そこで本稿では,性能モデリングに基づくコア分配法を提案する.また,本方式をCell/B.E.プロセッサに実装し,その有効性を評価する.ベンチマークプログラムを用いた定量的評価を行った結果,単純な並列処理に比べて最大で14.5%の性能向上を達成できた.
通信衝突削減のためのタスク配置最適化の評価

森江善之, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2008年3月

　詳細を見る

記述言語：日本語

本稿では，通信性能の悪化の主要因である通信の衝突を避けるためのタスク配置最適化の評価を行った．著者らはメッセージのタイミングを制御しつつ，衝突を回避するタスク配置最適化の研究をしている．その中でその最適化をツリートポロジに適用して，その手法が有効であることを示した．一方，メッシュやトーラスなどのネットワークトポロジにはホップ数を評価関数とする通信衝突削減のためのタスク配置最適化が研究なされており，有効性があるとされている．そこで，ネットワークトポロジが3Dメッシュの際に通信衝突回数を評価関数としたタスク配置最適化を適用した場合とホップ数を評価関数したタスク配置最適化を適用した場合にどのような違いがあるか調べる実験をし，考察を行った．In this text, we evaluated the optimization of task allocation to avoid contentions that was the key factor of the communication performance degradation. We applied the optimization of task allocation controlling the timing of the message for avoiding contentions to the tree topology, and showed it was effectiveness. On the other hand, there were some optimizations of task allocation for reducing contentions. Those optimizations used the evaluation function which used the number of hops. Those optimizations against the mesh and torus topology were effective. We experimented and investigated what's the difference between the optimization of task allocation which the evaluation function was the number of contentions and the number of hops when the network topology was 3D mesh. We considered about it.
高信頼マイクロプロセッサ・アーキテクチャ

井上弘士

日本信頼性学会誌 : 信頼性 = The journal of Reliability Engineering Association of Japan 2008年1月

　詳細を見る

記述言語：日本語

近年，コンピュータの頭脳であるマイクロプロセッサの信頼性低下が極めて深刻な問題として注目されている．微細加工技術の進歩に伴い劇的な性能向上を達成してきた反面，耐故障性の低下により外部/内部雑音などの影響を受け易くなった．その結果，システムには不具合がなくとも，コンピュータが正しくプログラムを実行できないという極めて深刻な事態となる．このような背景に基づき，近年，マイクロプロセッサの信頼性向上を目的とした様々なアーキテクチャ技術が提案された．本稿では，信頼性向上戦略を整理すると共に，商用マイクロプロセッサの動向も踏まえてアーキテクチャ・レベルでの信頼性向上技術を解説する．
演算/メモリ性能バランスを考慮したCMP向けオンチップ・メモリ貸与法の提案

林徹生, 今里賢一, 井上弘士, 村上和彰

情報処理学会研究報告組込みシステム（EMB） 2008年1月

　詳細を見る

記述言語：日本語

チップマルチプロセッサでは並列処理によって性能向上を実現可能である．しかしながら，プロセッサコアの処理速度に比べ主記憶へのアクセス速度は非常に遅い．また，コア間での資源共有が必要であり，主記憶アクセスがプロセッサ性能抑制の主要因となっている．したがって，プロセッサシステム全体の性能向上のためには，各コアにおける演算の並化効率とメモリ性能の両方を向上させる必要がある．そこで本稿では，メモリ貸与法に基づくSPM型CMP向けコア協調実行方式を提案する．演算，メモリ性能の向上のため，それぞれにバランスよくコア資源を分配することでトータルでの性能向上を目指す．姫野ベンチマークをCellプロセッサに実装して評価した結果，単純な並列処理に比べて最大で13％の性能向上を確認した．This paper proposes performance balancing, that is core management technique focused on trade-off between calculation and memory performance．In CMPs, high-performance is achieved by exploiting TLP. However, resource sharing among the cores makes memory performance lower regardless of the already low performance compared with processor core's one. Thus, we have to consider not only scalability, but also the performance assumed ideal memory sub-systems. Our proposed technique attempts to select effective approach, exploit scalability or improve memory performance. We also focus on a software-controllable on-chip memory. By borrowing local memory of some cores to others, we achieve memory performance improvement,and try to improve processor performance.Our experimental results show 13% speed up in the best case, compared with conventional parallel processing on Cell Broadband Engine.
A hybrid design space exploration approach for a coarse-grained reconfigurable accelerator (システムLSI設計技術)

Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami

情報処理学会研究報告システムLSI設計技術（SLDM） 2008年1月

　詳細を見る

記述言語：英語

Multitude parameters involved in the design process of a reconfigurable accelerator which is exploited in embedded systems brings about a remarkable complexity and large design space. One effective technique is design space exploration which is capable to find a right balance between the different design parameters. Quantitative design approach is an alternative which uses the data collected from applications; however it is time consuming and highly depends on designer observations and analyses and might not conclude to an optimal design. In this paper a hybrid approach is introduced which uses an analytical approach to explore the design space for a reconfigurable accelerator and determine a wise design point based on the quantitative data collected from the targeted applications. It also provides flexibility for applying new design constraints as well as new applications characteristics. Furthermore this approach is a methodological approach which reduces the design time and results in a design which satisfies the design goals. Experimental results show the efficacy of the hybrid approach.Multitude parameters involved in the design process of a reconfigurable accelerator which is exploited in embedded systems brings about a remarkable complexity and large design space. One effective technique is design space exploration which is capable to find a right balance between the different design parameters. Quantitative design approach is an alternative which uses the data collected from applications; however it is time consuming and highly depends on designer observations and analyses and might not conclude to an optimal design. In this paper, a hybrid approach is introduced which uses an analytical approach to explore the design space for a reconfigurable accelerator and determine a wise design point based on the quantitative data collected from the targeted applications. It also provides flexibility for applying new design constraints as well as new applications characteristics. Furthermore, this approach is a methodological approach which reduces the design time and results in a design which satisfies the design goals. Experimental results show the efficacy of the hybrid approach.
情報社会を支えるディペンダブル･プロセッサ

井上弘士

情報処理学会研究報告システムLSI設計技術（SLDM） 2007年10月

　詳細を見る

記述言語：日本語

本稿では、アーキテクチャ・レベルでの安全性向上を目的としたセキュア・プロセッサに関する研究事例を紹介する。近年、コンピュータ・ウィルスや情報漏洩が極めて深刻な社会問題となっている。これまでに、ネットワーク・レベルやシステム・ソフトウェア・レベルにて多くの安全性向上技術が開発されており実用化も進んでいる。しかしながら、依然としてコンピュータ・システムに対する脅威は増加の一途を辿っている。1970 年代初頭にマイクロプロセッサが開発されて以来、トランジスタ集積度の向上に伴い順調な性能向上を達成してきた。また、携帯機器の普及に伴い、低消費電力化や低消費エネルギー化も進んでいる。しかしながら、安全性向上に関する議論は少なく、2000 年以降になって本格的にアーキテクチャ・レベルで安全性を考慮する必要性が認識されるようになった。今後、高性能化や低消費電力化と同様に、安全性向上技術は極めて重要な設計制約となる。This paper introduces architectural supports to improve the efficiency of computer security. In the social information infrastructures, we exactly face to "Security Problem" such as computer viruses and information leaks. Although a number of techniques to improve security efficiency, which focus on network and system software components, have so far been proposed, still many threats exist. Since 1970s, microprocessors have made incredible progress in terms of performance. In addition, from 1990s, many techniques to reduce power or energy consumption have been developed. However, a few discussions for computer security at the processor level have done. Now, it is the time to start considering, how we can improve the security efficiency by means of providing architectural supports.
PSI-NSIM : 大規模並列システムの性能解析に向けた並列相互結合網シミュレータ

柴村英智, 薄田竜太郎, 本田宏明, 稲富雄一, 于雲青, 井上弘士, 青柳睦

電子情報通信学会技術研究報告. CPSY, コンピュータシステム 2007年10月

　詳細を見る

記述言語：日本語

大規模並列システムの設計開発ならびに性能解析に向けた相互結合網シミュレータPSI-NSIMについて述べる.PSI-NSIMは,評価対象とする相互結合網の仕様を記述した仕様ファイルとアプリケーション実行から生成した通信プロファイルを基にシミュレーションを行う.所望する相互結合網の評価に必要な各種情報を出力するのみならず,システム全体の性能を高速かつ精度良く予測するとともに,アプリケーションの性能解析や可視化のための各種情報も出力する.本稿では、シミュレータの実装、および既存のクラスタシステムの性能評価について報告する.
通信タイミングを考慮した衝突削減のための MPI ランク配置最適化技術

森江善之, 末安直樹, 松本透, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

情報処理学会論文誌コンピューティングシステム（ACS） 2007年8月

　詳細を見る

記述言語：日本語

本稿では、通信性能の悪化の主要因である通信の衝突を避けるためのランク配置最適化技術の提案を行う。通信のタイミングを考慮することで、通信の衝突を回避する高精度な MPI ランク配置を行う目的関数の提案を行った。また、本稿提案の目的関数を適用することによる通信時間の削減効果を調べる評価実験を行った。対象プログラムとして recursive doubling の通信パターンや CG 法、umt2000 といったアプリケーションの通信パターンを用いた。評価実験では、通信時間が順配置に対して最大 45％、従来研究によるランク配置に対して最大 24％、通信時間が削減され、提案手法が有効であることが分かった。In this paper, this work proposes the optimization of rank allocation technology of avoiding the communication contention that is the key factor of the communication performance degradation. This work proposes the objective function for high-quality Optimization of MPI rank allocation to be able to avoid a communication contention by considering the communication-timing of each message. Moreover, in the evaluation experiment, this work checks how this objective function cuts down communication time. The communication pattern of the recursive doubling algorithm and the communication pattern of the application such as CG and umt2000 are used in the evaluation experiment. The ratio of reduction in the communication time are 45％ or less for order rank allocation, 24％ or less for previous work rank allocation in the experiment.
次世代スーパーコンピュータの設計開発に向けたシステム性能評価環境 PSI-SIM

柴村英智, 薄田竜太郎, 本田宏明, 稲富雄一, 于雲青, 井上弘士, 青柳睦

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2007年8月

　詳細を見る

記述言語：日本語

ペタフロップス級時代の次世代スーパーコンピュータの設計開発に向けた統合型システム性能評価環境 PSI-SIM について述べる。本環境は、実践的な並列アプリケーションから生成した通信プロファイルを基に、所望するインターコネクトやシステム全体の性能を高速かつ精度良く予測するとともに、アプリケーションの性能解析や可視化を支援する。本稿では、通信プロファイルを高速に生成するためのプログラムコード抽象化手法を提案する。また、PSI-SIM によるアプリケーションや既存のクラスタシステムの性能評価を行い、シミュレーション時間や見積もり誤差について議論する。This paper presents a system performance evaluation environment, PSI-SIM, toward peta-scale next generation supercomputer development. This environment estimates performances of desired interconnect and system based on communication profile which generated from execution of practical parallel application, and supports easy application analysis and visualization. We propose a program code abstraction method for fast communication profile generation. Furthermore, PSI-SIM simulates applications and an existing cluster system, then the elapsed simulation times and the error rates of the estimation are discussed.
負荷ばらつきを考慮した MPI ブロードキャスト通信の動的最適化に関する研究

栗原康志, 曽我武史, HyacintheNzigouMamadou, 南里豪志, 末安直樹, 松本透, 井上弘士, 村上和彰

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2007年8月

　詳細を見る

記述言語：日本語

本研究では、負荷バランスの不均衡によって生じるプロセス毎の到着遅れがブロードキャスト通信の性能を低下させる問題に着目し、それを解決する手段として、負荷状況に応じて通信順序を調整する動的最適化手法を提案した。この手法では、まずプロセスの負荷情報としてルートプロセスからの到着時刻の遅れ時間を算出し、それに応じてブロードキャスト通信アルゴリズム内の仮想ランクへのプロセス割り当てを変更する。本稿では、提案手法のプロトタイプを PC クラスタ環境に実装し、負荷の状況によって最大で 40％程度、ブロードキャスト通信の性能を向上できることを確認した。さらに、提案手法のブロードキャストを疎行列計算に適用することにより、通信時間を最大 25％程度削減できることを確認した。This work focuses on the problem that the load imbalance can decrease the performance of broadcast communication. To avoid the problem, the authors proposed a technique of optimization that adjusts the order of communications in a broadcast at runtime. In this technique, the information of the delay of each rank from the root rank is used to decide the optimal order. In This paper, a proto-type of this technique was implemented on a PC cluster, and showed that the optimization decreased the by 40％ at maximum. In addition to that, it was confirmed to be able to reduce the communication time by about 25％ or less by applying the broadcast of the proposal technique to the sparse matrix calculation.
高速かつ正確なキャッシュシミュレーション法とその評価

小野貴継, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007年6月

　詳細を見る

記述言語：日本語

本稿では、高速かつ正確なキャッシュシミュレーション法について述べ、先行研究と定量的な比較を行い有効性を評価する。一般に、キャッシュメモリのシミュレーションにはトレース・ドリブン方式が用いられる。設計空間の拡大に伴い評価対象が増加しており、評価時間が長くなる傾向にある。トレース・サイズの削減によりシミュレーション時間を短縮できるが、精度が低下するという問題が生じる。そこで、本手法はメモリ・アクセスの特徴を利用し、精度を維持しつつ時間の短縮を実現する。先行研究と比較した結果、トレース・サイズは平均 81.7％削減され、キャッシュ・ミス率の予測精度は平均 34.6％向上した。This paper proposes a fast, accurate cache simulation technique for efficient design space exploration, and shows its efficiency by means of comparing with a related approach. Trace-driven simulation is a well known methodology to measure memory-system performance, e.g. cache hit rates. One of advantages of this method is the high-speed of simulations. Since the trend increases the complexity of microprocessor chips, e.g. CMPs, however, it is strongly required to achieve much faster simulations without sacrificing the accuracy of performance prediction. The proposed approach first attempts to characterize the memory-access patters, and then generates a small but well-constructed memory-access trace as a stimulus of cache simulators. In our evaluation, it is observed that the proposed technique reduces the trace size by 81.7％ while the accuracy of cache miss rates is improved by 34.6％, compared with SimPoint approach.
大規模再構成可能データパスにおけるオンチップ・ネットワーク・アーキテクチャの検討

島崎慶太, 長野孝昭, 本田宏明, ファラハドメディプー, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007年6月

　詳細を見る

記述言語：日本語

Large Scale Reconfigurable Data Path (LSRDP) は、二次元アレイ状に配極した多数の演算器を搭載し、演算器の種類と演算器間のネットワークを再構成可能とするデータパスをもつプロセッサアクセラレータである。LSRDP において、演算器数と演算器間のネットワーク構成の間には面積に関してトレードオフの関係が存在する。本稿では LSRDP に量子化学計算の二電子積分の初期積分部分を実装し、クロスバースイッチにて演算器行間ネットワークを実装する場合の検討を行った。その結果、各演算器を他の９個の演算器と接続した場合、LSRDP 全体の面積が最小となることが明らかになった。Large Scale Reconfigurable Data Path (LSRDP) is a data path type processor accelarator. On the LSRDP, enormous Floating Point number processing Units (FPUs) are arranged as 2-dimensional array, and each FPU and FPU network is reconfigurable. There is a trade-off relation about the area size between the number of FPUs and network configuration for the LSRDP. In this research, the LSRDP area size is estimated under condition that the initial integral part of the quantum chemistry two electron integral calculation is implemented and the crossbar switch is assumed to implement the network connecting each FPU array. As a result, it was obtained that each FPU in an array is connected with the nine FPUs in next array for the minimized LSRDP area size.
高速かつ正確なキャッシュシミュレーション法とその評価

小野貴継, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2007年5月

　詳細を見る

記述言語：日本語

本稿では,高速かつ正確なキャッシュシミュレーション法について述べ,先行研究と定量的な比較を行い有効性を評価する.一般に,キャッシュメモリのシミュレーションにはトレース・ドリブン方式が用いられる.設計空間の拡大に伴い評価対象が増加しており,評価時間が長くなる傾向にある.トレース・サイズの削減によりシミュレーション時間を短縮できるが,精度が低下するという問題が生じる.そこで,本手法はメモリ・アクセスの特徴を利用し,精度を維持しつつ時間の短縮を実現する.先行研究と比較した結果,トレース・サイズは平均81.7%削減され,キャッシュ・ミス率の予測精度は平均34.6%向上した.
The potential of temperature-aware configurable cache on energy reduction (計算機アーキテクチャ)

Hamid Noori, Maziar Goudarzi, Koji INOUE, Kazuaki MURAKAMI

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007年5月

　詳細を見る

記述言語：英語

Active power used to be the primary contributor to total power dissipation of CMOS designs but with the technology scaling the share of leakage in total power consumption of digital systems continues to grow. Moreover temperature is another factor that exponentially increases the leakage current. In this paper we show the effects of temperature and technology nodes on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. We show that a temperature-aware configurable cache is an effective way to save energy in finer technologies when the embedded system may be used in different temperatures. Our results show that using a temperature-aware configurable cache up to 66％ energy can be saved with only 1％ performance penalty for instruction cache and 74％ energy saving with 4.7％ performance loss for data cache.Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Moreover, temperature is another factor that exponentially increases the leakage current. In this paper, we show the effects of temperature and technology nodes on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. We show that a temperature-aware configurable cache is an effective way to save energy in finer technologies when the embedded system may be used in different temperatures. Our results show that using a temperature-aware configurable cache, up to 66％ energy can be saved with only 1％ performance penalty for instruction cache and 74％ energy saving with 4.7％ performance loss for data cache.
チップマルチプロセッサにおけるデータ・プリフェッチ効果の分析

福本尚人, 三原智伸, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007年5月

　詳細を見る

記述言語：日本語

複数コアを１チップに搭載するチップマルチプロセッサ(CMP)が注目されている。CMP は、複数コアで並列処理することで高い演算性能を達成することができる。しかしながら、メモリバンド幅の制約や複数コア搭載によるメモリアクセス頻度の上昇により、メモリウォール問題が深刻化する。主記憶のアクセス時間を隠蔽する方法のひとつにデータ・プリフェッチがある。CMP においてデータ・プリフェッチを行う場合、コア間の相互作用があるため、シングルコアプロセッサとは異なる効果が現れる。そこで本稿では、CMP におけるデータ・プリフェッチが性能へ与える影響を分析した。その結果、プリフェッチしたデータが無効化される割合は極めて小さく、プリフェッチを発行したコア以外のメモリアクセス時間を隠蔽するプリフェッチが約５％あることが明らかになった。Chip Multiprocessors (or CMPs) can achieve higher performance by means of exploiting thread level parallelism. Increasing the number of processor cores in a chip dramatically improves the peak performance. However, since the memory bandwidth does not scale with the number of cores, the negative impact of the memory-wall problem becomes more critical. Data prefetching is a well known approach to compensating for the poor memory performance, and has been employed in commercial processor chips. Although a number of prefetching techniques have so far been proposed, in many cases, they have assumed that the processor core in a chip is only one. In CMP chips, there are some shared resources such as L2 caches, buses, and so on. Therefore, the effect of prefetching on CMPs should be different from that on single-core processors. In this paper, we analyze the effect of prefetching on CMP performance. This paper first classifies the impact of prefetch operations issued during a program execution. Then, we discuss qualitatively and quantitatively the effect of prefetching to the memory performance. The experimental results show that the negative effect of invalidation of prefetched data is very small. In addition, it is observed that about 5％ of prefetch operations improve the cache hit rates of other cores.
チップマルチプロセッサにおけるデータ・プリフェッチ効果の分析

福本尚人, 三原智伸, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2007年5月

　詳細を見る

記述言語：日本語

複数コアを1チップに搭載するチップマルチプロセッサ(CMP)が注目されている.CMPは,複数コアで並列処理することで高い演算性能を達成することができる.しかしながら,メモリバンド幅の制約や複数コア搭載によるメモリアクセス頻度の上昇により,メモリウォール問題が深刻化する.主記憶のアクセス時間を隠蔽する方法のひとつにデータ・プリフェッチがある.CMPにおいてデータ・プリフェッチを行う場合,コア間の相互作用があるため,シングルコアプロセッサとは異なる効果が現れる.そこで本稿では,CMPにおけるデータ・プリフェッチが性能へ与える影響を分析した.その結果,プリフェッチしたデータが無効化される割合は極めて小さく,プリフェッチを発行したコア以外のメモリアクセス時間を隠蔽するプリフェッチが約5%あることが明らかになった.
動的再構成可能プロセッサ Vulcan2 とそのソフトウェア開発環境ISAccに関する研究

平木哲夫, 門内伸吾, 山崎陽介, 神戸隆行, GAUTHIER Lovic, MAURO GOULART FERREIRA Victor, TROUVE Antoine, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. RECONF, リコンフィギャラブルシステム : IEICE technical report 2007年5月

　詳細を見る

記述言語：日本語

特定用途向けプロセッサとは,アプリケーションに特化した命令を実行することによって,汎用プロセッサに対して高性能を実現するものである.本稿では,特定用途向けプロセッサの実現方式として,データパスに動的再構成可能ハードウェアを用いたプロセッサVulcan2とそのソフトウェア開発環境ISAccを提案する.また,実際にISAccを用いてアプリケーションをVulcan2シミュレータ上に実装した結果を解析し,Vulcan2及びISAccの評価を行った.
大規模再構成可能データパスにおけるオンチップ・ネットワーク・アーキテクチャの検討

島崎慶太, 長野孝昭, 本田宏明, メディプーファラハド, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2007年5月

　詳細を見る

記述言語：日本語

Large Scale Reconfigurable Data Path(LSRDP)は,二次元アレイ状に配置した多数の演算器を搭載し,演算器の種類と演算器間のネットワークを再構成可能とするデータパスをもつプロセッサアクセラレータである.LSRDPにおいて,演算器数と演算器間のネットワーク構成の間には面積に関してトレードオフの関係が存在する.本稿ではLSRDPに量子化学計算の二電子積分の初期積分部分を実装し,クロスバースイッチにて演算器行間ネットワークを実装する場合の検討を行った.その結果,各演算器を他の9個の演算器と接合した場合,LSRDP全体の面積が最小となることが明らかになった.
通信タイミングを考慮したランク配置最適化技術

森江善之, 末安直樹, 松本透, 南里豪志, 石畑宏明, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2007年3月

　詳細を見る

記述言語：日本語

本稿では，通信性能の悪化の主要因である通信の衝突を避けるためのランク配置最適化技術の提案を行う．メッセージごとに通信のタイミングを考慮することで，衝突を回避する高精度なMPIランク配置最適化の提案を行った．また，本手法では衝突を制御するためIn this paper, it proposes the rank optimization of rank allocation technology of avoiding the communication contention that is the key factor of the deterioration of the communication performance. It proposes the method is possible to ward off a communication contention was allcated by considering the communication-timing of each message. Moreover, this method has a overhead that it has to add synchronous function.In the evaluation experiment, it check how does this method cut down communication time including that overhead. The communication pattern of the recursive doubling and the communication pattern of the real application such as CG and umt2000 are used in this evalucation experiments. The ratio of reduction in the communication time are 45％ or less for order rank allocation , 24％ or less for previous work rank allcation in the experiment.
単一磁束量子回路による再構成可能な大規模データパスをもつプロセッサ

高木直史, 村上和彰, 藤巻朗, 吉川信行, 井上弘士, 本田宏明

電子情報通信学会技術研究報告. SCE, 超伝導エレクトロニクス 2007年1月

　詳細を見る

記述言語：日本語

デスクサイドに設置可能な10テラフロップス級の超伝導コンピュータとして,単一磁束量子回路による再構成可能な大規模データパスをもつプロセッサを提案する.
Drowsyキャッシュにおけるモード切替アルゴリズムの評価

図子純平, 冨山宏之, 高田広章, 井上弘士

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006年11月

　詳細を見る

記述言語：日本語

組込み機器において、特にバッテリ駆動型のシステムでは消費エネルギーの削減が重要となる。近年、汎用プロセッサだけでなく組込み向けプロセッサにもキャッシュメモリが搭載されるようになってきている。また、回路の微細化によりキャッシュメモリにおけるリークエネルギーは年々増加しており、リークエネルギーの削減が求められている。キャッシュのリークエネルギー削減手法のひとつに、Drowdyキャッシュがある。この手法では、キャッシュラインのモードを低リークモードに切り替えることで、リークによる消費エネルギーを削減する。しかし、低リークモードのキャッシュラインへアクセスが発生した場合、ラインを通常モードに切り替える必要があり、この切替には１～数サイクルの切替ペナルティとエネルギーオーバーヘッドが発生する。本論文では、これらの性能低下を最小限に抑えつつ、リークエネルギーを小さくするアルゴリズムとして時間的局所性を応用しモード切替にウェイ予測を用いたウェイ予測Drowsyキャッシュを提案する。提案手法に対し、性能とリークエネルギーの削減に関しての評価を行う。In the design of embedded systems, especially battery-powered systems, it is important to reduce energy consumption. In these days, cache memories are used not only in general-purpose processors but also in processors for embedded systems. Static energy (leakage energy) consumed in cache has been increasing with the decrease of the feature size. The Drowsy cache is one of the techniques to reduce leakage energy consumption of caches. The Drowsy cache reduces leakage energy by changing cache line mode into the low-leakage mode. In the Drowsy cache, when the cache line in the low-leakage mode is accessed, it has to be changed into the normal mode, and it takes one or more clock sysles. Thus, these penalty cycles may significantly degrade the cache performance. In this paper, we propose three kinds of Way-Prediction Drowsy Cache which achieve a high-energy reduction with the minimum performance overhead. Experimental results demonstrate the effectiveness of the proposed cache architectures.
メモリ・アーキテクチャ・ベンチマーキング手法の提案

小野貴継, 井上弘士, 村上和彰

情報処理学会研究報告システム評価（EVA） 2006年8月

　詳細を見る

記述言語：日本語

本稿では，高い精度を維持しつつ，短時間でのシミュレーションを可能とするメモリ・アーキテクチャ・ベンチマーキング手法を提案する．一般に，メモリ．アーキテクチャの評価では，アドレス・トレースに基づいたシミュレーションを行う.アプリケーション・プログラムの高機能化によりアドレス・トレースサイズが増加していることからシミュレーション時間が長くなる傾向にあり，シミュレーション時間の短縮が不可欠である．アドレス・トレースサイズを削減することでシミュレーション時間を短縮できるが，精度が低下するという問題がある．そこで本手法は，まず，トレースを小規模なトレースに分割し，それぞれの類似性に基づき代表となるトレースを選択する．これによりシミュレーションするトレースが小さくなり，時間を短縮できる．キャッシュ性能測定に基づく評価実験の結果，本手法はシミュレーション時間を平均77.6％短縮し，そのときのキャッシュヒット率の予測誤差は平均4.2％であった．In order to determine the memory architecture from a lot of design candidates, we use a trace-driven simulation. It is a common approach for evaluating memory architecture. However, it also demands much time. In this paper, we propose a Memory Architecture Benchmarking technique. It is possible that to reduce the simulation time while maintaining simulation accuracy. In order to evaluate validity of proposed technique, we measured the cache hit ratio. In our evaluation, the proposed technique reduces the simulation time about 77.6% and cache hit ratio prediction errors about 4.2% in the average.
近似文字列照合プログラム実行の特徴解析と高速化に関する検討

柴田圭, 馬場謙介, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. DC, ディペンダブルコンピューティング 2006年7月

　詳細を見る

記述言語：日本語

本稿では高速なウイルス検索実現のため,近似文字列照合のアルゴリズムの1つであるBP法(Bit Parallel Algorithm)の高速プログラム実行に関して検討する.現在ウイルス検索は,不正プログラムの特徴を定義したファイルを用意し,診断対象ファイルとの完全一致を基本としている.そのため既存ウイルスを改変した亜種ウイルスを発見することができない.この問題の解決策として近似文字列照合の応用が考えられる.そこで,高速かつ高機能なウイルス検索の実現を目指し,近似文字列照合プログラム実行の特徴解析を行った.まず,プログラム実行時に必要となるメモリ容量と実行命令の出現頻度を解析した.その結果,メモリ性能に関しては現在のプロセッサに搭載されたL1キャッシュメモリの容量で十分であることが分かった.また命令の実行頻度解析において,データ依存関係のある命令列の実行頻度に偏りがあることを見いだした.さらに,データの依存関係のある命令列に対しRFU(Reconfigurable Function Unit)を利用することで,およそ14%の性能向上を期待できることが分かった.
キャッシュメモリ中の衰退ラインを利用したメモリ整合性検証の高速化

坂口高宏, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. DC, ディペンダブルコンピューティング 2006年7月

　詳細を見る

記述言語：日本語

本稿では,メモリ整合性検証を前提としたプロセッサ性能オーバヘッドの隠蔽技術を提案する.メモリ整合性検証では,保護対象メモリ空間の状態を安全な記憶領域に保持することで,ロードデータに対する改ざんを検出する.しかしながら,オンチップ・キャッシュやメモリバンド幅を浪費するため,プロセッサ性能に大きな悪影響を与える.この原因の1つは,プログラム実行にプロセッサが必要とするデータとメモリ整合性検証用データがキャッシュ領域で競合を引き起こすことにある.つまり,キャッシュミスに伴うメモリアクセスが発生し,その結果,プロセッサ性能が低下する.これを解決するために,衰退ラインと呼ばれるアクセス頻度の低いキャッシュ・ラインにメモリ整合性検証用データを配置する.これにより,プログラム実行に必要となるデータがキャッシュから追い出されることを防ぐ.ベンチマーク・プログラムを用いた定量定期評価を行った結果,従来の手法と比較して,平均23.8%の性能オーバヘッド削減を達成した.
チップマルチプロセッサにおけるキャッシュメモリの特性解析

三原智伸, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006年7月

　詳細を見る

記述言語：日本語

近年，より高い性能の実現を目的として１つのチップ上に複数のプロセッサコアを搭載したＣＭＰ (Chip MultiProcessor) アーキテクチャが注目されている．メモリバンド幅の制約，メモリウォール問題のさらなる深刻化を背景として，今後プロセッサシステムの高性能化を実現するには，ＣＭＰに適したメモリシステムを構築することは不可欠となる.メモリシステムの中でもオンチップ・キャッシュの構成は性能に与える影響が大きく，その主な設計選択肢としてプロセッサコア間での共有/非共有がある．本稿では，ＣＭＰにおけるキャッシュの共有と非共有によるメモリ性能への影響を解析し，Ｌ２キャッシミス率の差によりメモリ性能の優劣が異なる事を明らかにした．To achieve higher performance, ＣＭＰ (Chip MultiProcessor) is focused today. Because of narrow bus bandwidth and the memory wall problem, it is necessary to design the memory system which is suitable for CMP. In the system, on-chip cache architecture has a large impact on performance, and to decide sharing/dedicating an on-chip cache among multiple processor cores is a important choice. In this paper, we studied the difference of performance in shared-cacheCMP and dedicate-cacheCMP. We analyzed the factor which impacts memory-access-time qualitatively and quantitatively, and revealed that L2cache miss rate makes the largest gap between them.
演算結果再利用による高信頼かつ低消費電力なプロセッサに関する検討

橋口陽祐, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2006年6月

　詳細を見る

記述言語：日本語

プロセッサにおけるソフトエラー耐性の低下が問題になっている.ソフトエラーとは,雑音が原因で回路が一時的に誤動作する現象である.信頼性を向上させるため,メモリではパリティやECC等の誤り検出/訂正コードが用いられる.しかしながら,組合せ回路にこのような誤り検出/訂正コードを加えることは難しく,多くの場合はプログラム実行を多重化(複数回実行)することでエラー検出を可能にしている.本研究では,演算結果の再利用に基づく高信頼かつ低消費エネルギーなプロセッサアーキテクチャを検討する.本手法ではプログラム中の同一命令の演算結果を演算結果再利用テーブルに保持しておき,それを再利用する.演算結果再利用テーブルはECCで保護するため,各命令の実行を多重化することなく高い信頼性を実現できる.これにより,信頼性の向上に伴う消費エネルギー・オーバーヘッドを削減する.定量的評価を行った結果,従来の多重化に基づく方式では,多重度2のとき100%であった消費エネルギー・オーバーヘッドを6.3%に削減することができた.
A Reconfigurable Functional Unit for Adaptable Custom Instructions(集積回路技術とアーキテクチャ技術の協調・融合へ向けた,プロセッサ,並列処理,システムLSIアーキテクチャ及び一般)

Noori Hamid, Mehdipour Farhad, Murakami Kazuaki, INOUE Koji, SAHEBZAMANI Morteza

電子情報通信学会技術研究報告. ICD, 集積回路 2006年6月

　詳細を見る

記述言語：英語

This paper presents a reconfigurable functional unit (RFU) for an adaptive dynamic extensible processor. The processor can tune its extended instructions to the target applications, after chip-fabrication, which brings about more flexibility. The custom instructions (CIs) are generated deploying the hot basic blocks during the training mode. In the normal mode, CIs are executed on the RFU. A quantitative approach was used for designing the RFU. The RFU is a matrix of functional units with 8 inputs and 6 outputs. Performance is enhanced up to 1.5 using the proposed RFU for 22 applications of Mibench. The size of configuration memory has been reduced by 40% through making the RFU partially reconfigurable, finding subsets of CIs and merging small CIs into one configuration. This processor needs no extra opcodes for CIs, new compiler, source code modification and recompilation.
プログラムの実行経路の偏りに着目した分岐予測法

築地孝典, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006年6月

　詳細を見る

記述言語：日本語

近年多くの高性能プロセッサは分岐予測器を搭載している．分岐予測ミスが発生した場合には誤った命令列が実行されるため，分岐予測精度がプロセッサの性能および消費エネルギーに与える影響は大きい．より高精度な分岐予測の実現を目的として，大規模かつ複雑な分岐予測器も提案されている．しかしながら，その結果分岐予測における消費エネルギーが増大し，プロセッサの全消費エネルギーに悪影響を及ぼすようになってきた．前述したように，分岐予測ミス時には将来無効化される命令が実行されるため，プロセッサの消費エネルギーを増加させる．したがって，高い分岐精度を維持しつつ分岐予測器の低消費エネルギー化を達成することが極めて重要となる．そこで本研究では，分岐予測精度の向上と消費エネルギーの低減を目的とし，実行経路の偏りに着目した新しい分岐予測法を提案する．プログラム中には実行頻度の高い命令列（ホットパス）が存在し，ホットパス中の分岐命令は高確率で決まった方向に分岐する．また，少数のホットパス実行時間が全実行時間の大部分を占める．提案する分岐予測法では，ホットパス中の分岐命令と分岐先を小容量のメモリに保持し，ホットパス実行中はそのメモリを参照することで分岐予測を行う．従来のGshare分岐予測器と比較した結果，提案手法の採用により分岐予測ミス率は約２２ポイント増加したが，分岐予測器の消費エネルギーを約４０％削減することができた．Modern high performance processors employ branch predictors. The accuracy of branch prediction in fluences the processor performance because the processor executes wrong instructions when a mis-prediction occurs. To improve accuracy of branch prediction, large scale and complex branch predictors have been proposed. How-ever,the energy of branch predictors has been increasing. As mentioned above, when a mis-prediction occurs, total chip energy is increased due to the execution of invalid instructions. Therefore, achieving high accuracy of branch prediction and reducing the energy consumption of the branch predictor are very important. We propose a new method to solve the issues. It is well known that there is a small number of instruction paths executed frequently in program executions. In the hotpath, branch instructions tend to be output the same execution results, i.e. the same branch direction and the same target address. Moreover, the execution time of some hotpaths have a majority of the total execution time. A method of branch prediction we propose predicts by accessing to small memory that have branch instruction address and branch target address for hotpaths. We compare this method with Gshare predictor, As a result, it is observed that although the mis-prediction rate increases by 2.2 points, we can reduce the energy consumption by 40%.
プログラムの実行経路の偏りに着目した分岐予測法

築地孝典, 井上弘士, 村上和彰

電子情報通信学会技術研究報告. ICD, 集積回路 2006年6月

　詳細を見る

記述言語：日本語

近年多くの高性能プロセッサは分岐予測器を搭載している.分岐予測ミスが発生した場合には誤った命令列が実行されるため,分岐予測精度がプロセッサの性能および消費エネルギーに与える影響は大きい.より高精度な分岐予測の実現を目的として,大規模かつ複雑な分岐予測器も提案されている.しかしながら,その結果分岐予測における消費エネルギーが増大し,プロセッサの全消費エネルギーに悪影響を及ぼすようになってきた.前述したように,分岐予測ミス時には将来無効化される命令が実行されるため,プロセッサの消費エネルギーを増加させる.したがって,高い分岐精度を維持しつつ分岐予測器の低消費エネルギー化を達成することが極めて重要となる.そこで本研究では,分岐予測精度の向上と消費エネルギーの低減を目的とし,実行経路の偏りに着目した新しい分岐予測法を提案する.プログラム中には実行頻度の高い命令列(ホットパス)が存在し,ホットパス中の分岐命令は高確率で決まった方向に分岐する.また,少数のホットパス実行時間が全実行時間の大部分を占める.提案する分岐予測法では,ホットパス中の分岐命令と分岐先を小容量のメモリに保持し,ホットパス実行中はそのメモリを参照することで分岐予測を行う.従来のGshare分岐予測器と比較した結果,提案手法の採用により分岐予測ミス率は約2.2ポイント増加したが,分岐予測器の消費エネルギーを約40%削減することができた.
演算結果再利用による高信頼かつ低消費電力なプロセッサに関する検討

橋口陽祐, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2006年6月

　詳細を見る

記述言語：日本語

プロセッサにおけるソフトエラー耐U性の低下が問題になっている.ソフトエラーとは雑音が原因で回路が一時的に誤動作する現象である.信頼性を向上させるためメモリではパリティやＥＣＣ等の誤り検出/訂正コードが用いられるしかしながら組合せ回路にこのような誤り検出/訂正コードを加えることは難しく多くの場合はプログラム実行を多重化(複数回実行)することでエラー検出を可能にしている本研究では演算結果の再利用に基づく高信頼かつ低消費エネルギーなプロセッサアーキテクチャを検討する本手法ではプログラム中の同一命令の演算結果を演算結果再利用テーブルに保持しておきそれを再利用する.演算結果再利用テーブルはＥＣＣで保護するため各命令の実行を多重化することなく高い信頼性を実現できるこれにより信頼性の向上に伴う消費エネルギー・オーバーヘッドを削減する.定量的評価を行った結果従来の多重化に基づく方式では多重度２のとぎ100％であった消費エネルギー・オーバーヘッドを６３％に削減することができた．The decrease in the soft error tolerance in processors becomes a problem. The soft error is a phe nomenon that the circuit does not malfunctions temporarily by the noise. To improve reliability, there is parity and ECC in the memory. However, it is difficult to add the error detection/correction code in combinational circuits. It enables the error detection by multiplexing the execution program. It has the problem that increases the energy consumption. In this research, We investigate the reliable datapath by reusing execution results. It does not execute the same instruction in detail. It maintains the result in a table, and obtains the result without ALU. The table with ECC can have reliability. The energy consumption depends on the table composition. Result of examining table composition, it can adjust the amount of the increased energy consumption to 6.3%. Key words soft error, reliability, energy consumption
新世代マイクロプロセッサアーキテクチャ（後編）:2.新しいデザインバランス 2.信頼性・安全性とプロセッサ

井上弘士

情報処理 2005年11月

　詳細を見る

記述言語：日本語

New Generation Microprocessor Architecture (2):Security and Reliability of Advanced Microprocessors
待機ラインへの参照密度に基づく低リーク・キャッシュの動的制御

小宮礼子, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2005年8月

　詳細を見る

記述言語：日本語

これまでに多くの低リーク・キャッシュが提案された．しかしながら，これらの手法は待機状態ラインのデータを破棄するため，ミス回数が増加し必然的に性能が低下する．そこで本稿では低リーク・キャッシュにおける性能低下抑制方式として，常活性ライン方式を提案する．具体的には，性能低下の原因となる待機状態ライン・アクセスの局所性を考慮し，アクセスが集中するラインは常活性ラインにする．これまでに提案されたCache decay方式では，15.1%程度の性能低下をもたらす事で92.7%のリーク削減率を達成した．これに対し，本稿で提案する方式を適用すると，同程度のリーク削減率90.6%を維持しつつ，性能低下を5.0%に抑制することができた．A number of techniques to reduce cache leakage energy have so far been proposed. However, in these techniques, flushing the data of a turning off line causes a new cache miss. And, the increase miss degrade processor performance. We have analyzed the detail of cache-access behavior, and have found that there is a locality of accesses to the turning-off lines. Based on this observation, we propose a cache management technique to alleviate the negative effect of low-leakage caches. In our approach, cache lines having high degree of increase-miss locality are forced to stay in the high-speed but high-leakage mode. In our evaluation, the proposed scheme worsens the performance by only 5.0% with the same degree of energy reduction of the Cache decay approach.
実行振舞いを鍵情報とする不正プログラムの動的検出方式

井上弘士, 岩佐崇史

情報処理学会研究報告計算機アーキテクチャ（ARC） 2005年8月

　詳細を見る

記述言語：日本語

本稿では、コンピュータ・システムの安全性向上を目的とした、動的プログラム認証方式を提案する。また、その安全性に関する定性的評価、ならびに、コスト/性能オーバヘッドに関する定量的評価を行う。本方式では、実行の振舞いを共通の秘密鍵情報として利用することで、1) 低い性能オーバヘッド、ならびに、2)連続的なプログラム認証、を可能にする。アプリケーション発行側では、共通秘密鍵から決定される「プログラム実行の振舞い」を実現するオブジェクト・コードを生成する。一方、利用者がわでは、専用プロファイラを用いて鍵となる実行の振舞いを動的に検出する。もし、「鍵としての実行の振舞い」が検出できなかった場合にはプロセッサに実行停止割り込みを発行する。To challenge the security problem, we propose a hardware-base intrusion detection technique which regards the dynamic program-execution behavior as a certification key. Based on secret key. Based on secret key information, we determine an execution behavior. Then an object code which generates the determined execution behavior at run time is constructed by a secure compiler. While the program execution, a secure profiler monitors the execution behavior. If the secure profiler can not see the determined behavior, it alarms the microprocessor for terminating the current program execution. Since the viruses do not know the behavior required to continue the execution on the microprocessor, we can detect and prohibit the malicious attacks at the beginning of its execution.
待機状態ラインに対する参照局所性を考慮した低リーク・キャッシュの性能低下抑制方式

小宮礼子, 井上弘士, モシニャガ・ワシリー, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004年12月

　詳細を見る

記述言語：日本語

これまでに多くの低リーク・キャッシュが提案されてきた。しかしながら、これらの手法を用いると待機状態ラインへの低速なアクセスが発生するため、必然的に性能が低下する。そこで本稿では低リーク・キャッシュにおける性能低下抑制方式として、常時活性化（always-awake）ライン方式を提案する。具体的には、性能低下の原因となる待機状態ライン・アクセスの局所性を考慮し、アクセスが集中するラインは常時活性化状態にする。これまでに提案されたDrowsy方式では、15％程度の性能低下をもたらす事で84％のリーク削減率を達成した。これに対し、本稿で提案するalways-awakeラインを用いた場合、同程度のリーク削減率を維持しつつ、性能低下を8?11％に抑制することができた。A number of techniques to reduce cachu leakage energy have so far been proposed.However,in these techniques,low speed accesses to a standby mode line degrade processor performance.We have analyzed the detail of cache-access behavior,and have found that there is a locality of accesses to the standby-mode lines.Based on this observation,we prppose a cache management technique to alleviate the nagative effect of low-leakage caches.In our approach,cache lines having high degree of sleep-hit locality are forced to stay in the high-speed but high-leakage mode.In our evaluation,it has bee observed that the Drowsy cache can achieve the performance by only 8縲鰀11% with the same degree of energy reduction of the Drowsy approach.
キャッシュ・ミス頻発命令とその特徴解析

堂後靖博, 三輪英樹, ヴィクトル・マウロ・グラール・フェヘイラ, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004年12月

　詳細を見る

記述言語：日本語

メモリ・ウォール問題（プロセッサ?主記憶間の性能差拡大）を解決する有効な手段の1つとして，Delinquent命令の活用がある．例えば，Delinquent命令のアドレス計算用コードを別スレッドとして生成し，これを投機実行する事でプリフェッチ精度を向上できる．しかしながら，プロセッサ?主記憶間の性能差は依然として拡大の一途を辿っており，Delinquent命令に着目したより効果的な高性能化方式の確立が望まれる．そこで本稿では，D命令に関する特徴解析を行う．具体的には，Delinquent命令の発生頻度や入力依存性，生存区間，アクセス・パタン等について調査する．本研究で得られた結果は，Delinquent命令に基づくメモリ性能高性能化技術開発の基礎データとして用いる事ができる．Recent remarkable advances of VLSI technology have been increasing processor speed and DRAM capacity dramatically. However, the advances also have introduced a large and growing performance gap between the processor and DRAM, this problem is referred to as "Memory Wall", resulting in poor total system performance in spite of higher processor performance. In order to solve this problem, researchers have been proposed high-performance techniques to alleviate the effect of delinquent memory-access instructions. In this paper, we investigate the detail of behavior of the delinquent memory-access instructions. The results presented in this paper will be useful to develop new approaches against the memory wall problem.
キャッシュ・ミス頻発命令を考慮したメモリ・システムの高性能化

三輪英樹, 堂後靖博, ヴィクトルM グラールフェヘイラ, 井上弘士, 村上和彰

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004年12月

　詳細を見る

記述言語：日本語

マイクロプロセッサと主記憶との動作周波数差は，年々拡大する一方である．両者の周波数差は，マイクロプロセッサの性能阻害要因であり，一般的にメモリ・ウォール問題と呼ばれる．本稿では，メモリ・ウォール問題の解決策のうち，再計算に基づくメモリ・システム高性能化手法 CCC (Computing Centric Computation)を提案する．CCC は，キャッシュ・ミス頻発命令を実行する代わりに再計算を行なうことで，主記憶へのアクセス回数を削減する．本稿では，性能向上が得られる可能性があるかどうかに関する予備的な評価を行なった．評価対象ベンチマークにおいて最大 45.3% の実行サイクル数削減率を達成した．In recent years,the performance gap between microprocessor speed and main memory Latency has been increasing.This problem prevents higher throughput improvements and is well-known in the literature as the Memory-Wall Problem (MWP).This paper proposes a new method to minimize the MWP effect by means of [re-computation].The basic idea is to replace frequently cache-missed loads (or delinquent loads) with a piece of code that regenerates the missed value (recomputation code).This method can be reduce the number of main memory accesses and consequently alliviate the MWP.From the experiments,one can obtain up to 45.3% reduction on computation time for SPEC2000 benchmark programs.
デｰタパス分割に基づく高信頼プロセッサの提案とその予備評価

松坂茂治, 井上弘士

情報処理学会研究報告. SLDM, [システムLSI設計技術] 2004年12月

　詳細を見る

記述言語：日本語

コンピュータ・システムの高い信頼性を保つためには,障害の原因となる故障を検出する必要がある.故障を検出する一般的な手法として,時間的または空間的冗長性の利用が挙げられる.しかし,それらの冗長性の実現には追加ハードウェアや実行時間の増加といった問題が生じる.本稿では,ハードウェアの大幅な変更をすることなく空間的冗長性を実現するデータパス分割方式を提案する.また,演算時に必要となる最小のビット幅を考慮し,実行時間オーバヘッドを削減する方式を提案する.最善の場合を想定し実行時間を測定した結果,冗長度2の場合で平均1.62倍,4の場合で平均3.09倍の実行時間増加となり,本手法は非常に有効であることが確認できた.
不正プログラムの実行防止を目的とするオンチップ・キャッシュ・アーキテクチャ

井上弘士

情報処理学会研究報告計算機アーキテクチャ（ARC） 2004年7月

　詳細を見る

記述言語：日本語

本稿では，コンピュータ・システムの安全性向上を目的とし，それを実現するためのアーキテクチャ・アプローチとしてセキュア・キャッシュ(SCache)を提案する．また，その安全性，性能，ならびに，消費エネルギーに関する評価を行う．近年，多くのコンピュータ・ウィルスはバッファ・オーバフローを引き起こし，関数戻りアドレスを改ざんする事でプログラム実行制御を乗っ取る．この問題を回避するため，SCache は書き込まれた戻りアドレス値の複製を生成する．ベンチマーク・プログラムを用いて定量的評価を行った結果，多くのプログラムにおいて99.7%以上の戻りアドレスの安全性を保障することができた．This paper proposes an architectural support to improve computer security, called Secure Cache (SCache), and evaluates its energy/security efficiency. A number of malicious codes attempt to hijack program-execution flow by causing stack smashing that corrupts the return address stored in a stack. In order to avoid the return address corruption, SCache generates a replica data in the cache area. In our evaluation, for many benchmarks, it is observed that more than 99.7% of return-address loads can be protected.
オペランド再利用によるレジスタ・ファイルの低消費電力化

高村拓志, 井上弘士, G.MoshnyagaVasily

情報処理学会研究報告計算機アーキテクチャ（ARC） 2002年8月

　詳細を見る

記述言語：日本語

本稿では、オペランド再利用によるレジスタ・ファイル・アクセス数の削減手法を提案する。プログラム実行時、演算対象となるソース・オペランドはレジスタ・ファイルから読み出される。従来のプロセッサでは、2つの連続した命令が同一ソース・オペランドを必要とする場合、それぞれの命令に関してレジスタ・ファイル読出しが実行される。これに対し、提案手法では、先行命令によって読み出されたソース・オペランドの値をパイプライン・レジスタ内に保存し、後続命令のオペランド・フェッチ時に再利用する。また、RAWハザードを解決するために実装されたフォワーディング機能を活用することで、レジスタ・ファイル書込みに関するアクセス数も削減する。ベンチマーク・プログラムを用いて実験を行った結果、最大で62%のレジスタ・ファイル・アクセス数を削減できた。This paper proposes an energy reduction technique for register files. The proposed approach attempts to reuse operand data read from the register file in order to reduce the number of register-file accesses. If sequentially executed instructions, i and j, specify the same source operand, then the operand data read from the register file by the instruction i is reused for the instruction j. In this case, the operand fetch for the instruction j can be performed without register file activation, saving energy consumption. As well as the read operation, we can eliminate register-file write accesses by exploiting forwarding unit, which is used for solving RAW pipeline hazard problem. In our simulation, it is observed that the proposed approach can reduce the total number of register-file accesses by 62% from a conventional model.
低消費電力メディア・アプリケーション向けヒストリ・ベース・タグ比較キャッシュの評価

井上弘士, Moshnyaga Vasily G., 村上和彰

電子情報通信学会技術研究報告. DC, ディペンダブルコンピューティング 2002年4月

　詳細を見る

記述言語：日本語

これまでに我々は,ダイレクト・マップ命令キャッシュの低消費エネルギー化を目的として,ヒストリ・ベース・タグ比較(HBTC:History Based Tag-Comparison)方式を提案した.従来型キャッシュでは,ヒット/ミス判定のために,タグ比較が毎アクセス実行される.これに対し,HBTCキャッシュでは,プログラムの実行履歴に基づき必要に応じてタグ比較を行う.そして,無駄なタグ比較処理を動的に検出・削除し,命令キャッシュの低消費エネルギー化を実現する.本稿では,これまでに提案したHBTCキャッシュを改良し,オーバヘッドの小さい新しい実現方式を示す.また,信号処理アプリケーションを中心としたベンチマーク・プログラムを用いて,性能ならびに消費エネルギーに関するより詳細な評価を行う.
低消費電力メディア・アプリケーション向けヒストリ・ベース・タグ比較キャッシュの評価

井上弘士, Moshnyaga Vasily G., 村上和彰

電子情報通信学会技術研究報告. CPSY, コンピュータシステム 2002年4月

　詳細を見る

記述言語：日本語

これまでに我々は,ダイレグト・マップ命令キャッシュの低消費エネルギー化を目的として,ヒストリ・ベース・タグ比較(HBTC:History Based Tag-Comparison)方式を提案した.従来型キャッシュでは,ヒット/ミス判定のために,タグ比較が毎アクセス実行される.これに対し,HBTCキャッシュでは,プログラムの実行履歴に基づき必要に応じてタグ比較を行う.そして,無駄なタグ比較処理を動的に検出・削除し,命令キャッシュの低消費エネルギー化を実現する.本稿では,これまでに提案したHBTCキャッシュを改良し,オーバヘッドの小さい新しい実現方式を示す.また,信号処理アプリケーションを中心としたベンチマーク・プログラムを用いて,性能ならびに消費エネルギーに関するより詳細な評価を行う.
二電源電圧を用いた命令発行メモリの低消費電力化手法

辻寛司, 井上弘士, モシニャガワシリー

情報処理学会研究報告システムLSI設計技術（SLDM） 2001年11月

　詳細を見る

記述言語：日本語

命令発行メモリ（命令ウィンドウ）の低消費電力化を目的として、適応型命令発行メモリが提案された。プログラムが有する命令レベル並列度に応じて使用可能なエントリ数（命令発行メモリ・サイズ）を動的に最適化し、負荷容量を削減することで低消費電力化できる。本稿では、更なる低消費電力化を実現するため、二電源電圧を用いた適応型命令発行メモリを提案する。従来の適応型命令発行メモリでは、単一電源電圧が用いられる。これに対し、提案手法では、命令発行メモリ・サイズを縮小した際に低電源電圧を使用する。つまり、命令発行メモリ・サイズの変更に応じて電源電圧も変化させる。CMOS 回路の消費電力は電源電圧の２乗に比例するため、低電源電圧化により大幅な消費電力の削減を期待できる。また、命令発行メモリ・サイズを縮小した場合にのみ低電源電圧を用いるため、低電源電圧化に伴う遅延時間オーバヘッドを隠蔽できる。評価を行った結果、本手法を適用することで、大幅な性能低下を伴うことなく最大３６％の命令発行メモリ消費電力を削減できた。This paper presents a novel architectural technique to reduce energy dissipation of adaptive issue queue, whose functionality is dynamically adjusted at runtime to match the changing computational demands of instruction stream. In contrast to existing schemes, the technique exploits a new freedom in queue design, namely the voltage per access. Since Since loading capacitance operated in the adaptive queue varies in time, the clock cycle budget becomes inefficiently exploited. We propose to trade-off the unused cycle time with supply voltage, lowering the voltage level when the queue functionality is reduced and increasing it with the activation of resources in the queue. Experiments show that the approach can save up to 36% of the issue queue energy without large performance and area overhead.
タグ比較結果の再利用によるキャッシュメモリの低消費電力化

井上弘士, MoshnyagaG.Vasily, 村上和彰

情報処理学会研究報告システムLSI設計技術（SLDM） 2001年11月

　詳細を見る

記述言語：日本語

本稿では，低消費エネルギー化を実現する新しい命令キャッシュ・アーキテクチャとして、ヒストリ・ベース・ルックアップ・キャッシュ（HBLキャッシュ）を提案する。また、ベンチマーク・プログラムを用いた定量的評価を行い、その有効性を明らかにする。あるデータを格納可能なキャッシュ内ロケーションが複数存在するセット・アソシアティブ・キャッシュでは、参照データが唯一のウェイにのみ存在する（ヒットの場合）。それにも関わらず、従来型キャッシュでは、アクセス時間を短縮するために全てのウェイが並列に検索される。これに対し、HBL キャッシュは、過去のタグ比較結果を再利用し、参照データ検索における無駄なウェイ・アクセスを回避することで、低消費エネルギー化を実現する。ベンチマーク・プログラムを用いた定量的評価を行った結果、従来型キャッシュと比較して、約0.2％の性能低下を伴うだけで、最大72％のキャッシュ・アクセス消費エネルギーを削減できた。This paper proposes a novel architecture for low-power instruction caches called "history-based look-up cache (HBL cache)". In conventional n-way set-associative cashes, there are ｎ locations where a cache line can be placed in the cache space, and all ways are activated on every cache access because of the parallel search strategy. On the other hand, the HBL cache attempts to reuse the tag comparison results, and reduces the cache-access energy by avoiding the unnecessary way activations. The tag-comparison results are recorded in an extended BTB (Branch Target Buffer) for branch prediction. In our evalutation, it is observed that the HBL cache reduces the energy consumption by about 72% while it degrades the performance by only 0.2%, compared with a conventional set-associative cache.
データ圧縮による画像処理用メモリの低消費電力化手法とその評価

深川瑞香, 井上弘士, VasilyG.Moshnyaga

情報処理学会研究報告システムLSI設計技術（SLDM） 2001年11月

　詳細を見る

記述言語：日本語

本稿では、データ圧縮による画像処理用メモリの低消費電力化手法を提案する。画像処理システムでは、FIFOなどの逐次アクセスを基本動作とするメモリ（フレームメモリ等）が使用される。従来のメモリシステムでは、ワードデータ中に含まれる全てのビットデータが読み出し、書き込みの対象となる。これに対し、本手法では、連続するデータ間の差分情報のみを読み出し・書き込みの対象とする。これにより、活性化すべきビットライン数を削減し、低消費電力化を実現できる。一般に、連続する画素間には相関関係があるため、連続メモリ・アクセス対象データ間の差分をとることで効率的にデータ圧縮を行える。６種類の動画像データを用いて評価を行った結果、フレームメモリの消費電力を１１?１６％削減できた。This paper propses an idea for reducing power consumption of video memories through data compression. In video memory system, in-order-access memories are used, e.g., frame memory. In a conventional memory, all bitlines are activated for reading or writing. On the other hand, our approach attempts to compress the read (or write) data, and activates only bitlines corresponding to the difference-bits between the successively sccessed data. As a result, we can reduce the power consumption for the memory access by means of reducing the total number of bitline switching. In our simulation, it is ovserved that our approach can reduce the power consumption of frame memory by 11% - 16% for many video sequences.
A low-power instruction cache architecture exploiting program execution footprints

INOUE K.

Work-in-Progress Session in the 7th International Symposium on High-Performance Computer Architecture, Included in CD Proc. 2001年1月

　詳細を見る

記述言語：その他

A Low-Power Instruction Cache Architecture Exploiting Program Execution Footprints
Performance/Energy Efficiency of Variable Line-Size Caches on Intelligent Memory Systems

Koji Inoue, Koji Kai, Kazuaki J. Murakami

Proc. of the 2nd Workshop on Intelligent Memory Systems 2000年11月

　詳細を見る

記述言語：その他

DOI： 10.1007/3-540-44570-6_13
A High-Performance and Low-Power Cache Architecture with Speculative Way-Selection

INOUE Koji, ISHIHARA Tohru, MURAKAMI Kazuaki

IEICE transactions on electronics 2000年2月

　詳細を見る

記述言語：英語

This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the procesor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.
MOE: A special-purpose parallel computer for high-speed, large-scale molecular orbital calculation 査読

Koji Hashimoto, Hiroto Tomita, Koji Inoue, Katsuhiko Metsugi, Kazuaki Murakami, Shinjiro Inabata, So Yamada, Nobuaki Miyakawa, Hajime Takashima, Kunihiro Kitamura, Shigeru Obara, Takashi Amisaki, Kazutoshi Tanabe, Umpei Nagashima

ACM/IEEE SC 1999 Conference, SC 1999 1999年11月

　詳細を見る

記述言語：英語

We are constructing a high-performance, special-purpose parallel machine for ab initio Molecular Orbital calculations, called MOE (Molecular Orbital calculation Engine). The sequential execution time is O(N4) where N is the number of basis functions, and most of time is spent to the calculations of electron repulsion integrals (ERIs). The calculation of ERIs have a lot of parallelism of O(N4), and therefore MOE tries to exploit the parallelism. This paper discuss the MOE architecture and examines important aspects of architecture design, which is required to calculate ERIs according to the "Obara method". We conclude that n-way parallelization is the most cost-effective, hence we designed the MOE prototype system with a host computer and many processing nodes. The processing node includes a 76 bit oating-point MULTIPLY-and-ADD unit and internal memory, etc., and it performs ERI computations efficiently. We estimate that the prototype system with 100 processing nodes calculate the energy of proteins in a few days.

DOI： 10.1109/SC.1999.10000

▼全件表示

産業財産権

特許権	出願件数: 1件	登録件数: 0件
実用新案権	出願件数: 0件	登録件数: 0件
意匠権	出願件数: 0件	登録件数: 0件
商標権	出願件数: 0件	登録件数: 0件

所属学協会

情報処理学会
電子情報通信学会
IEEE
ACM
電子情報通信学会

　詳細を見る

researchmap
情報処理学会

　詳細を見る

researchmap
IEEE

　詳細を見る

researchmap
ACM

　詳細を見る

researchmap

▼全件表示

委員歴

ACM SIGMICRO Executive Committee Members 国際

2023年7月 - 2026年6月
主査主査国内

2018年3月 - 2022年3月
情報処理学会システムアーキテクチャ研究会主査

2018年3月 - 2022年3月

　詳細を見る

researchmap
Secretary Secretary 国際

2015年1月 - 2016年12月
幹事幹事国内

2012年4月 - 2013年3月

学術貢献活動

TPC 国際学術貢献

International Symposium on Microarchitecture (MICRO) （その他） 2024年11月

　詳細を見る

種別：大会・シンポジウム等
TPC 国際学術貢献

International Symposium on Computer Architecture (ISCA) （その他） 2024年6月 - 2023年7月

　詳細を見る

種別：大会・シンポジウム等
学術システム研究センター研究員

役割：審査・評価

2024年4月 - 2027年3月

　詳細を見る

種別：審査・学術的助言
TPC 国際学術貢献

International Symposium on High-Performance Computer Architecture (HPCA) （その他） 2024年3月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Microarchitecture (MICRO) （その他） 2023年10月

　詳細を見る

種別：大会・シンポジウム等
日本学術会議連携会員

役割：審査・評価

2023年10月 - 現在

　詳細を見る

種別：審査・学術的助言
TPC 国際学術貢献

IEEE Micro Top Picks （その他） 2023年6月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Computer Architecture (ISCA) （その他） 2023年6月

　詳細を見る

種別：大会・シンポジウム等
JST PRESTO/CREST 量⼦・古典の異分野融合による共創型フロンティアの開拓領域アドバイザー

役割：審査・評価

2023年6月 - 2032年3月

　詳細を見る

種別：審査・学術的助言
Other 国際学術貢献

International Symposium on High-Performance Computer Architecture (HPCA) （その他） 2023年2月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Microarchitecture (MICRO) （その他） 2022年10月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Computer Architecture (ISCA) （その他） 2022年6月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on High-Performance Computer Architecture （その他） 2022年2月

　詳細を見る

種別：大会・シンポジウム等
International Symposium on High-Performance Computer Architecture 国際学術貢献

（ Others ） 2022年2月

　詳細を見る

種別：大会・シンポジウム等

researchmap
Other 国際学術貢献

International Symposium on Microarchitecture （その他） 2021年10月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Computer Architecture (ISCA) （その他） 2021年5月 - 2021年6月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on High-Performance Computer Architecture （その他） 2021年2月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Microarchitecture （その他） 2020年10月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Computer Architecture （ Spain Spain ） 2020年5月 - 2020年6月

　詳細を見る

種別：大会・シンポジウム等
次世代計算基盤検討部会委員

役割：審査・評価

文部科学省 2020年4月 - 2021年3月

　詳細を見る

種別：審査・学術的助言
Other 国際学術貢献

International Symposium on High-Performance Computer Architecture （その他） 2020年2月

　詳細を見る

種別：大会・シンポジウム等
JSTさきがけ革新的な量子情報処理技術基盤の創出領域アドバイザー

役割：審査・評価

2019年5月 - 2025年3月

　詳細を見る

種別：審査・学術的助言
Other 国際学術貢献

International Symposium on Microarchitecture （ Japan Japan ） 2018年10月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Computer Architecture （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2018年6月

　詳細を見る

種別：大会・シンポジウム等
JST さきがけ「革新的コンピューティング技術の開拓」領域総括

役割：審査・評価

JST 2018年4月 - 2023年3月

　詳細を見る

種別：審査・学術的助言
JST さきがけ「革新的コンピューティング技術の開拓」領域総括

JST 2018年4月 - 2023年3月

　詳細を見る

researchmap
Other 国際学術貢献

International Symposium on High-Performance Computer Architecture （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2018年2月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Computer Architecture （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2017年6月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Microarchitecture （ Taiwan Taiwan ） 2016年10月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

18th Asia and South Pacific Design Automation Conference （ Japan Japan ） 2013年1月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

The 41st International Conference on Parallel Processing （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2012年9月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Low Power Electronics and Design 2012 （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2012年7月 - 2012年8月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Conference for High Performance Computing, Networking, Storage and Analysis （ UnitedStatesofAmerica UnitedStatesofAmerica ） 2011年12月

　詳細を見る

種別：大会・シンポジウム等
その他国際学術貢献

The 19th Annual IFIP/IEEE Conference on Very Large Scale Integration 2011 （ Hong Kong Hong Kong ） 2011年10月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Low Power Electronics and Design 2011 （ Japan Japan ） 2011年8月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Low Power Electronics and Design 2011 （ Japan Japan ） 2011年8月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

The 6th IEEE International Conference on Networking, Architecture, and Storage （ China China ） 2011年7月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

11th International Forum on Embedded MPSoC and Multicore 2011 （ France France ） 2011年7月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

The IEEE International Symposium on VLSI 2011 （ India India ） 2011年7月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Low Power Electronics and Design （ Austin UnitedStatesofAmerica UnitedStatesofAmerica ） 2010年8月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

IEEE Computer Society Annual Symposium on VLSI （ Lixouri Kefalonia Greece Greece ） 2010年7月

　詳細を見る

種別：大会・シンポジウム等
その他国際学術貢献

International Forum on Embedded MPSoC and Multicore （岐阜 Japan ） 2010年6月 - 2010年7月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

The IEEE Symposium on Low-Power and High-Speed Chips （ Yokohama Japan Japan ） 2010年4月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Embedded Multicore Systems-on-Chip （ Vienna Austria Austria ） 2009年9月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Low Power Electronics and Design 2009 （ San Francisco UnitedStatesofAmerica UnitedStatesofAmerica ） 2009年8月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Low Power Electronics and Design （ San Francisco, California UnitedStatesofAmerica UnitedStatesofAmerica ） 2009年8月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

IEEE Computer Society Annual Symposium on VLSI （ Tampa UnitedStatesofAmerica UnitedStatesofAmerica ） 2009年5月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

The IEEE Symposium on Low-Power and High-Speed Chips 2009 （ Yokohama Japan Japan ） 2009年4月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

The Workshop on Synthesis And System Integration of Mixed Information technologies 2009 （ Okinawa Japan Japan ） 2009年3月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

13th Asia and South Pacific Design Automation Conference 2009 （ Yokohama Japan Japan ） 2009年1月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Conference on Field-Programmable Technology 2008 （ Taipei Taiwan Taiwan ） 2008年12月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

MEDEA Workshop MEmory performance:DEaling with Applications, systems and architecture （ Toronto Canada Canada ） 2008年10月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Conference on Field Programmable Logic and Applications （ Heidelberg Germany Germany ） 2008年9月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

International Symposium on Low Power Electronics and Design 2008 （ Bangalore India India ） 2008年8月

　詳細を見る

種別：大会・シンポジウム等
Other 国際学術貢献

The IEEE Symposium on Low-Power and High-Speed Chips 20098 （ Yokohama Japan Japan ） 2008年4月

　詳細を見る

種別：大会・シンポジウム等
その他国際学術貢献

12th Asia and South Pacific Design Automation Conference 2008 （ソウル Korea ） 2008年1月

　詳細を見る

種別：大会・シンポジウム等
その他

第57回電気関係学会九州支部連合大会（鹿児島大学 Japan ） 2004年9月

　詳細を見る

種別：大会・シンポジウム等
その他

第17回回路とシステム軽井沢ワークショップ（軽井沢 Japan ） 2004年4月

　詳細を見る

種別：大会・シンポジウム等
英文論文誌A 2005年4月特集号「Special Section on Selected Papers from the 17th Workshop on Circuits and Systems in Karuizawa」国際学術貢献

2004年1月

　詳細を見る

種別：学会・研究会等

▼全件表示

共同研究・競争的資金等の研究課題

縦型半導体ナノワイヤアレイ量子集積回路基盤技術の創成

2023年10月 - 2029年3月

JST

　詳細を見る

担当区分：研究分担者

本研究は、ナノワイヤアレイ量子集積回路の基盤技術と基本学理を構築することで、現行Si-MOSFETによる集積回路の消費電力を劇的に削減する超低消費電力エレクトロニクスの実現を目指す。特に、新構造素子を３次元網目状に集積した立体構造を前提とし、そのための新しいコンピュータアーキテクチャを探索する。
ポストムーア時代を見据えた超伝導コンピューティング技術の創成と展開

2022年6月 - 2027年3月

科研費基盤研究(S)

　詳細を見る

担当区分：研究代表者

今から約30年前、超伝導コンピュータの実現に向け世界でデバイス研究が活発化し、その後、冬の時代に突入した。しかしながら、この局面が大きく変わりつつある。これは、材料や回路技術の進歩に加え、ここ数年で計算機工学分野での研究が飛躍的に進み、革新的アーキテクチャが次々と誕生したことに起因する。コンピュータの性能向上を支え続けた半導体の微細化は2030年頃に終焉を迎える。このような状況において、次世代計算基盤の最有力候補として超伝導コンピューティングが再び注目され、今まさに、冬の時代に終止符が打たれようとしている。本研究の目的は、本分野を牽引し続ける我々の最先端基礎研究をシステムレベルへと昇華させ、極低温超伝導汎用コンピューティング技術として世界に先駆けて確立することにある。そのために、デバイスからアーキテクチャまでを包括したシステム階層縦横断型研究を遂行し、新奇デバイス活用コンピュータ・アーキテクチャを創成する。これこそが、デバイス多様性に基づくポストムーア時代の計算機工学の新展開となる。
ポストムーア時代を見据えた超伝導コンピューティング技術の創成と展開

研究課題/領域番号：22H00518 2022年4月 - 2026年3月

科学研究費助成事業基盤研究(A)

井上弘士, 田中雅光, 川上哲志, 谷本輝夫, 廣川真男, 小野貴継

　詳細を見る

資金種別：科研費

本研究の目的は、単一磁束量子回路向けアーキテクチャを牽引し続ける我々の最先端基礎研究をシステムレベルへと昇華させ、極低温超伝導汎用コンピューティング技術として世界に先駆けて確立することにある。最初の2年間において、各種理論の構築、原理検証のためのチップ試作、アーキテクチャ概念設計、デバイスモデリング、といった要素技術開発を進める。そして3年目でこれらを統合したマイクロアーキテクチャ探索を実施し、最終年にて詳細設計ならびに総合評価を実施する。

CiNii Research
JST ムーンショット：2050年までに、経済・産業・安全保障を飛躍的に発展させる誤り耐性型汎用量子コンピュータを実現

2022年2月 - 2026年3月

　詳細を見る

担当区分：研究分担者
超伝導量子回路の集積化技術の開発

2022年2月 - 2026年3月

JST

　詳細を見る

担当区分：研究分担者

超伝導量子コンピュータを対象にした「冷凍機内マルチステージ・ヘテロジニアス量子制御機構アーキテクチャ」の探索を目的とする。具体的には、①誤り訂正符号回路アーキテクチャの策定と設計、②システムレベル量子コンピュータアーキテクチャ探索環境の構築と評価・分析、③冷凍機内マルチステージ（特に、mKと4K）間での協調動作の指針策定（定量的評価に基づく）、を行う。
ポストムーア時代を見据えた超伝導コンピューティング技術の創成と展開

研究課題/領域番号：22H05000 2022年 - 2026年

日本学術振興会科学研究費助成事業基盤研究(S)

井上弘士, 田中雅光, 中村宏, 川上哲志, 板垣奈穂, 谷本輝夫, 浜屋宏平

　詳細を見る

担当区分：研究代表者資金種別：科研費

本研究の狙いは「超伝導デバイスの活用を前提とした新計算原理の創出と革新的コンピューティング技術の開拓」にある。世界最先端となるこれまでの基礎研究を起点とし、1) SFQ回路に最適な情報表現法とそれに基づく極低温演算メカニズムの導出、2) 異種新奇デバイス融合による極低温新メモリ/通信方式の探求、3)これらに基づく極低温超伝導汎用コンピュータ・アーキテクチャの創成、を目指す。

CiNii Research
超伝導量子回路の集積化技術の開発

2022年 - 2025年

戦略的な研究開発の推進ムーンショット型研究開発事業

　詳細を見る

担当区分：研究分担者資金種別：受託研究
脳の仕組みに倣った省エネ型の人工知能関連技術の開発・実証事業

2021年10月 - 2024年3月

総務省

　詳細を見る

担当区分：研究分担者
近似計算手法を制御する進化型コンピュータのアーキテクチャの検討

2019年4月 - 2020年3月

共同研究

　詳細を見る

担当区分：研究代表者資金種別：その他産学連携による資金
My-IoT開発プラットフォームの研究開発

2019年1月 - 2022年3月

内閣府

　詳細を見る

担当区分：研究代表者

本研究では、利用者のIoTシステムを自身で容易に構築でき、さらに現場で日常に利用されているパソコンを使うようにIoTシステムを簡単に使えるいわゆるエッジセントリックなIoTシステムアーキテクチャとして「My-IoTプラットフォーム構想」を提案する。この「My-IoTプラットフォーム構想」では、従来のIoTの各種アセットを生かすだけでなく、ローカルPCを使うようにIoTシステムを利用できるような革新的な技術開発を行う。IoT開発者に頼まなくても、利用者自ら習熟容易で簡易に導入可能なIoTシステム設計・開発・運用を可能とすることで、開発コストの大幅な削減とIoT導入の障壁を取り除く。また、プラットフォーム提供者だけでなく、プラットフォーム利用者自ら作った設計資産を登録できる「IoTストア」を整備することで、開発者や利用者が、IoTシステム開発・利活用のノウハウを無償・有償で共有できる、いわゆるシェアリング要素の発展を込めたエコシステムを構築する。この構想を実現すべく、仮想化システムアーキテクチャ、次世代エッジコンピューティング、環境適応型エッジアクチュエーション、エッジプラットフォーム自動構築・開発環境に関する研究開発を行う。また、ユースケースを想定した実証実験を行うとともに、九州地方の企業を中心としたコミュニティを形成し、研究成果の普及に努める。
ポストムーア時代を支える100ギガヘルツ級時空間超伝導コンピューティング

研究課題/領域番号：19H01105 2019年 - 2021年

日本学術振興会科学研究費助成事業基盤研究(A)

　詳細を見る

担当区分：研究代表者資金種別：科研費
ポストムーア時代を支える100ギガヘルツ級時空間超伝導コンピューティング

研究課題/領域番号：19H01105 2019年 - 2021年

日本学術振興会科学研究費助成事業基盤研究(A)

　詳細を見る

担当区分：研究代表者資金種別：科研費
低炭素AI処理基盤のための革新的超伝導コンピューティング

2018年10月 - 2023年3月

JST

　詳細を見る

担当区分：研究代表者

本研究の目的は、来たるべくAI社会を支える極低温コンピューティング基盤の実用化を念頭に、その主要構成要素となるAI処理エンジンSFNuroを開発し、その実現可能性ならびに情報処理インフラとしてのCO2排出量削減効果を示すことにある。SFNuroは単一磁束量子（SFQ：single-flux-quantum）回路を用いた深層学習向けニューラルネットワーク処理エンジンであり、極低温環境でのコンピューティング環境基盤として位置づけられる。上図に示すRSFQやその派生形（Energy-efficient RSFQ、RQL, AQFP, HSTP）など単一磁束量子を利用した超伝導回路を「SFQ回路」と呼ぶが、これらは従来のMOS-FETでは実現できない超高速動作を低電力で行うことが可能であり、ポストムーア時代を見据えた上で有望なコンピューティング環境の一つである。過去にもSFQに関する研究成果が報告されているが、①アーキテクチャレベルの探索、ならびに、②応用を見据えた最適化が十分に行われていなかった。また、③完全動作を追求するが故に動作マージンを確保せざるを得ず、その結果として電力効率に限界が生じていた。これら①〜③は、従来研究において既存CMOS汎用プロセッサを模倣したアーキテクチャを採っていたことに起因する。これらを解決するためには、SFQデバイスや回路の利点を最大限に活かし、かつ、欠点を隠蔽するシステムアーキテクチャを抜本的に再構築しなければならない。そこで本研究では、SFQデバイスの特性を最大限に発揮し、その上で欠点を隠蔽するためのシステム構成法を、回路・アーキテクチャ・アルゴリズムの技術レイヤを跨いだ横断的最適化により導き出す。
近似計算手法を制御する進化型コンピュータのアーキテクチャの検討

2018年4月 - 2019年3月

共同研究

　詳細を見る

担当区分：研究代表者資金種別：その他産学連携による資金
低炭素AI処理基盤のための革新的超伝導コンピューティング

2018年 - 2022年

戦略的創造研究推進事業 (文部科学省)

　詳細を見る

担当区分：研究代表者資金種別：受託研究
My-IoT開発プラットフォームの研究開発

2018年 - 2022年

戦略的イノベーション創造プログラム（SIP）第2期／フィジカル空間デジタルデータ処理基盤

　詳細を見る

担当区分：研究代表者資金種別：受託研究
物理事象空間に基づくサイバーセキュリティ技術

研究課題/領域番号：17K19984 2017年 - 2018年

日本学術振興会科学研究費助成事業挑戦的研究（萌芽）

　詳細を見る

担当区分：研究代表者資金種別：科研費
シリコン限界を凌駕する100ギガヘルツ級超伝導プロセッサ・アーキテクチャの研究

2016年4月 - 2019年3月

日本学術振興会

　詳細を見る

担当区分：研究代表者

本研究は、ポストシリコン時代を支えるコンピューティング要素技術として、消費電力5ワット程度かつ動作周波数100ギガヘルツ級の超高性能低消費電力な超伝導プロセッサ・アーキテクチャを世界に先駆けて開発する。また、主要構成部品のチップ試作ならびにシステムレベル・シミュレーションにより、その有効性ならびに実現可能性を明かにする。計算機工学ならびに超伝導工学のを跨いだ分野横断型研究であり、超伝導素子の利用を前提としたアーキテクチャと回路のコデザインを実施する。これにより、シリコンに変わる新デバイスを利用したプロセッサ構成法を示すとともに、その実現に必要となる超伝導回路設計技術を確立する。
シリコン限界を凌駕する100ギガヘルツ級超伝導プロセッサ・アーキテクチャの研究

研究課題/領域番号：16H02796 2016年 - 2018年

日本学術振興会科学研究費助成事業基盤研究(B)

　詳細を見る

担当区分：研究代表者資金種別：科研費
集積ナノフォトニクスによる超低レイテンシ光演算技術の研究

2015年12月 - 2021年3月

JST

　詳細を見る

担当区分：研究分担者

本研究では、この問題を根本的に解決するために、ナノフォトニクスの精密制御技術を駆使した新しい光コンピューティング技術を提案し、情報処理分野に破壊的イノベーションを引き起こすことを目指す。光コンピュータは 80-90 年代に活発に研究されたが、その後 CMOS に対する優位性を見いだせずに衰退した技術と位置付けられている。本研究では、当時の光コンピュータ研究に関する分析を踏まえて、今後 10-20 年先のレイテンシボトルネックを解消するという目的の元に、新しい演算技術を提案する。
集積ナノフォトニクスによる超低レイテンシ光演算技術の研究

2015年 - 2020年

JST CREST

　詳細を見る

担当区分：研究分担者資金種別：受託研究
宇宙空間コンピューティングの実現に向けた超伝導プロセッサアーキテクチャの研究

研究課題/領域番号：26540022 2014年 - 2015年

科学研究費助成事業萌芽研究

　詳細を見る

担当区分：研究代表者資金種別：科研費
ポストペタスケールシステムのための電力マネージメントフレームワークの開発

2012年10月 - 2018年3月

JST

　詳細を見る

担当区分：研究分担者

ポストペタスケール高性能計算システムでは、供給電力、あるいは熱設計電力制約の中でハードウェア資源を投入し、運用時のピーク消費電力が制約を超えないことを保証する従来の設計思想では、アプリケーションを今後の大規模システムに対してスケールさせることは難しい。そこで、本研究課題では、ピーク消費電力が制約を超過することを積極的に許し、ハードウェアの電力性能ノブを最適化することで実効電力を制約以下に制御するシステム形態がポストペタスケール高性能計算システムのあるべき姿との認識に立ち、これを前提とするアーキテクチャのコンセプトとする。このような電力制約適応型システムでは、従来のように利用可能な全ハードウェア資源を使い切るという発想ではなく、限られた電力資源を各アプリケーションに、またその中でも計算・記憶・通信という各要素に適応的に配分し、性能やシステムの電力効率を最適化することが重要となる。この適応的な電力制御を行うことができれば、単一システムのもと、電力性能ノブの調整次第で様々なハードウェア資源への要求に対応でき、多くのアプリケーションに適用可能なシステムが構築可能となる。電力制約適応型システム上で高性能かつ高電力効率を達成するためには、アプリケーションの特徴や運用状況等に合わせた電力制御・電力管理がシステムソフトウェアの最も重要な役割の一つとなるが、現状では十分なソフトウェア資産が構築されていないばかりか、システムアーキテクチャや各ソフトウェア階層に求められる要件も明白ではない。そこで、本研究では電力制約適応型システムにおいて、ハードウェアに搭載された電力性能ノブ制御をアプリケーションの特性および運用状況に合わせて最適化し、アプリケーションの性能とシステム全体の電力効率を向上させることを目指す。そのための要素技術として１）アプリケーションの特徴と運用状況に合わせた電力性能ノブ最適化技術、２）大規模アプリケーション向け電力性能挙動予測技術、３）システムソフトウェアから効果的に電力性能ノブを制御可能なシステムアーキテクチャ、の３項目を研究開発する。１）ではライブラリやミドルウェアを含むシステムソフトウェアと性能最適化ツールを、２）では電力予測ツール群を、３）ではソフトウェアからハードウェア依存の最適化を解放するための電力性能ノブ抽象化手法を開発し、最終的にポストペタスケール時代の電力マネージメントフレームワークとして、電力資源を有効利用できる計算環境を創出することが本研究の目的である。
ポストペタスケールシステムのための電力マネージメントフレームワークの開発

2012年 - 2017年

JST CREST

　詳細を見る

担当区分：研究分担者資金種別：受託研究
SMYLEプロジェクト

2010年12月 - 2012年3月

独立行政法人新エネルギー・産業技術総合開発機構（日本）

　詳細を見る

担当区分：研究代表者

低消費電力メニーコアの実現においては、大多数の小規模コアの徹底した使用率の向上と、その動作時に消費する電力の大幅な削減が最も重要となる。そして、「コア数にスケール可能な高性能化（コア数を増やせばより性能が高くなる）」と「コア数にスケール可能な低消費電力化（コア数を増やせばより消費電力を削減できる）」といったメニーコアならではの技術開発の実施が急務の課題である。そこで本事業では、組込みシステムにおける低消費電力メニーコアのあるべき姿として「仮想アクセラレータとその実行プラットフォームとしてのメニーコア」を提案し、それを可能にするアーキテクチャの開発、各種APIの策定、ならびに、コンパイラを含めたアプリケーション開発環境の開発を行う。また、シミュレーションならびにプロトタイプにより有効性を明らかにすると共に、提案メニーコアの適応分野に関する調査を実施し実用化に向けた方向性を示す。提案方式では、ハードウェアに柔軟性を持たせ、コンパイラによるアーキテクチャの決定を可能にする。これにより自動並列化戦略の選択肢を拡大することで、多種多様な応用が想定される組込みシステムにおいてもコア数にスケール可能な高い性能を実現できる。また、0.5〜0.6V程度の極低電圧動作において生じる諸問題をメニーコアの豊富なハードウェア資源の徹底利用により解決する。これにより、コア数にスケール可能な低消費電力化が可能となる。
本事業の実施に関しては、従来の固定観念に捕らわれない斬新的かつ実効的な体制で実施する。具体的には、九州大学（全体統括、アーキテクチャ）、立命館大学（コンパイラ）、電気通信大学（低消費電力手法）の若手研究者と、現在急成長中のベンチャー企業であるフィックスターズ（プログラミングとコンパイラ）ならびにトプスシステムズ（プロセッサ開発とその応用展開）の5組織による強固な連携体制を採る。また、本事業実施場所としては、九州大学大学院システム情報科学研究院井上研究室、立命館大学理工学部電子情報デザイン学科　冨山研究室、電気通信大学大学院情報システム学研究科　近藤研究室、株式会社フィックスターズ本社（大崎）、ならびに、株式会社トプスシステムズ本社（つくば）とする。
SMYLEメニーコア

2010年12月 - 2012年3月

独立行政法人新エネルギー・産業技術総合開発機構（日本）

　詳細を見る

担当区分：研究代表者

低消費電力メニーコアの実現においては、大多数の小規模コアの徹底した使用率の向上と、その動作時に消費する電力の大幅な削減が最も重要となる。そして、「コア数にスケール可能な高性能化（コア数を増やせばより性能が高くなる）」と「コア数にスケール可能な低消費電力化（コア数を増やせばより消費電力を削減できる）」といったメニーコアならではの技術開発の実施が急務の課題である。そこで本事業では、組込みシステムにおける低消費電力メニーコアのあるべき姿として「仮想アクセラレータとその実行プラットフォームとしてのメニーコア」を提案し、それを可能にするアーキテクチャの開発、各種APIの策定、ならびに、コンパイラを含めたアプリケーション開発環境の開発を行う。また、シミュレーションならびにプロトタイプにより有効性を明らかにすると共に、提案メニーコアの適応分野に関する調査を実施し実用化に向けた方向性を示す。提案方式では、ハードウェアに柔軟性を持たせ、コンパイラによるアーキテクチャの決定を可能にする。これにより自動並列化戦略の選択肢を拡大することで、多種多様な応用が想定される組込みシステムにおいてもコア数にスケール可能な高い性能を実現できる。また、0.5〜0.6V程度の極低電圧動作において生じる諸問題をメニーコアの豊富なハードウェア資源の徹底利用により解決する。これにより、コア数にスケール可能な低消費電力化が可能となる。
本事業の実施に関しては、従来の固定観念に捕らわれない斬新的かつ実効的な体制で実施する。具体的には、九州大学（全体統括、アーキテクチャ）、立命館大学（コンパイラ）、電気通信大学（低消費電力手法）の若手研究者と、現在急成長中のベンチャー企業であるフィックスターズ（プログラミングとコンパイラ）ならびにトプスシステムズ（プロセッサ開発とその応用展開）の5組織による強固な連携体制を採る。また、本事業実施場所としては、九州大学大学院システム情報科学研究院井上研究室、立命館大学理工学部電子情報デザイン学科　冨山研究室、電気通信大学大学院情報システム学研究科　近藤研究室、株式会社フィックスターズ本社（大崎）、ならびに、株式会社トプスシステムズ本社（つくば）とする。
「極低電力回路・システム技術開発（グリーンITプロジェクト）」研究開発項目⑦「低消費電力メニーコア用アーキテクチャとコンパイラ技術」

2010年 - 2012年

新エネルギー・産業技術総合開発機構（NEDO）

　詳細を見る

担当区分：研究代表者資金種別：受託研究
オンチップ・スーパーコンピューティングを可能にするメニーコア・プロセッサの研究

2009年4月 - 2013年3月

日本学術振興会（日本）

　詳細を見る

担当区分：研究代表者

本研究では、次世代情報化社会を支える基盤要素技術の1つとして、オンチップ・スーパーコンピューティングを可能にする「新時代3次元メニーコア・プロセッサ」を開発する。また、プロトタイピングならびにシミュレーションを実施し、提案プロセッサの有効性と実現可能性を実証する。具体的には、1個のLSIチップに3次元実装された数百個のプロセッサ・コア（以降コアと略す）を適応的に協調動作させ、図1に示すように中規模スーパーコンピュータと同等の性能を達成しつつ、環境問題対策としての消費電力削減、ならびに、安定・安全運用のための信頼性/安全性の向上をも可能にする。これにより、図2のような近未来情報社会を支える高性能基幹サーバでの実用化を目指す。
マルチコア・プロセッサの実効性能最大化を目的としたコア・オーケストレーション技術の開発

2009年4月 - 2012年3月

半導体理工学研究センター：STARC（日本）

　詳細を見る

担当区分：研究代表者

本研究の目的は、マルチコア・プロセッサが本来有する潜在能力を最大限に引出すべく、複数コアが適応的に協調実行する（つまり、必要に応じて助け合い実行する）コア・オーケストレーション技術を確立することにある。これにより、ハードウェア・コストや消費電力を殆ど増加することなく、従来の並列実行方式と比較して60%以上の性能向上を目指す（これまでの予備実験結果に基づきこの目標値を設定）。また、本研究ではテストチップ試作ならびにプロトタイピングにより、提案方式の実現可能性を実証する。
エネルギー効率の最大化を目的とした適応型3次元マイクロプロセッサ・アーキテクチャの研究

2009年1月 - 2012年12月

独立行政法人新エネルギー・産業技術総合開発機構：NEDO若手グラント（日本）

　詳細を見る

担当区分：研究代表者

本研究では、「半導体デバイスの3次元実装技術」と「アーキテクチャ技術」を融合し、エネルギー効率を最大化する新しいマイクロプロセッサを開発する。具体的には、「複数プロセッサ・コア＋動的再構成可能アクセラレータ＋大容量メモリ」を3次元に積層した適応型次世代マイクロプロセッサ・アーキテクチャを提案する。また、その潜在能力を最大限引き出すための協調実行方式ならびにコンパイル技術を確立し、提案方式の有効性を示すと共に、実用化を見据えたプロトタイピングにより実現可能性を実証する。
オンチップ・スーパーコンピューティングを可能にするメニーコア･プロセッサの研究

研究課題/領域番号：21680005 2009年 - 2012年

科学研究費助成事業若手研究(A)

　詳細を見る

担当区分：研究代表者資金種別：科研費
エネルギー効率の最大化を目的とした適応型3次元マイクロプロセッサ・アーキテクチャの研究

2008年 - 2012年

独立行政法人新エネルギー・産業技術総合開発機構（NEDO若手グラント）

　詳細を見る

担当区分：研究代表者資金種別：受託研究
単一磁束量子回路による再構成可能な低電力高性能プロセッサ

2006年9月

　詳細を見る

担当区分：研究分担者

10テラフロップス程度の計算能力をもつ、デスクサイドに設置可能なコンピュータを、超伝導単一磁束量子（SFQ）回路による再構成可能な大規模データパス（RDP）を有するプロセッサによって実現することを目指し、アーキテクチャ、演算回路からデバイスに至る研究を行う。現在のCMOS半導体集積回路技術を用い、並列プロセッサ方式で実現する場合に比べ、消費電力がプロセッサ部で10,000分の１以下、コンピュータ全体で約400分の１、空調機や冷凍機も含めて約100分の１に抑制されると予想される。本研究では、コンピュータアーキテクチャ、算術演算回路、SFQ回路のそれぞれの分野で研究業績を有する研究者が協力して研究を進め、RDPアーキテクチャ技術の確立、SFQ回路による再構成可能な回路の構成法の開発、SFQ−RDPに適した浮動小数点演算ユニットの構成法の開発などを行い、それにより大規模SFQ-RDPを有する10テラフロップスコンピュータの基盤技術を確立する。
ペタスケール・システムインターコネクト技術の開発

2005年4月 - 2008年3月

文部科学省

　詳細を見る

担当区分：研究分担者

PSIプロジェクトとは、ペタフロップス超級スーパーコンピュータシステムの構成において数千〜数十万規模の高速計算ノードを相互結合するシステムインターコネクト技術を対象に、現状のシステムよりもコスト対性能比で１桁上を目指して高性能化、高機能化、低コスト化を同時に達成するための３つの要素技術、すなわち、①光パケットスイッチと超小型光リンク技術、②動的通信最適化によるMPI高速化、③システムインターコネクトの総合性能評価技術を開発するプロジェクトです。
高信頼化と低消費電力化の両立を目的とした環境適応型プロセッサに関する研究

2005年4月 - 2007年3月

日本学術振興会（日本）

　詳細を見る

担当区分：研究代表者

本研究では、次世代の情報化社会を支える基盤技術として、「耐故障性の向上と低消費エネルギー化の両立を目的した環境適応型プロセッサ・システム」を開発する。本研究では、個人携帯型電子機器システムの使用を前提とし、耐故障性の向上だけでなく、安全性までも考慮に入れたディペンダブル・プロセッサを開発します。また、信頼性と消費エネルギーのトレードオフに関する解析も行います。
高信頼化と低消費電力化の両立を目的とした環境適応型プロセッサに関する研究

研究課題/領域番号：17680005 2005年 - 2007年

科学研究費助成事業若手研究(A)

　詳細を見る

担当区分：研究代表者資金種別：科研費
安全で低消費エネルギーなプロセッサに関する研究

2004年9月 - 2005年3月

受託研究

　詳細を見る

担当区分：研究代表者資金種別：その他産学連携による資金
安全で低消費エネルギーなプロセッサに関する研究

2003年9月 - 2007年3月

科学技術振興機構

　詳細を見る

担当区分：研究代表者

安全で安定した情報化社会システムを実現するためには、コンピュータ・システムの安全性向上と更なる低消費エネルギー化が極めて重要となります。そこで本研究では、特にコンピュータ・ウィルス問題に着目し、その解決策として「プログラム実行の振舞いを鍵情報とする動的プログラム認証技術」を提案します。また、そのようなプロセッサ・システムを構築し、安全性と消費エネルギーの間に存在するトレード・オフ関係を解析します。
安全で低消費エネルギーなプロセッサに関する研究

2003年 - 2006年

科学技術振興機構個人型研究さきがけ

　詳細を見る

担当区分：研究代表者資金種別：受託研究
予測技術に基づく高性能/低消費電力メモリシステムの開発

2002年4月 - 2005年3月

日本学術振興会（日本）

　詳細を見る

担当区分：研究代表者

予測技術を活用した高性能かつ低消費電力なメモリシステムを開発しています。プログラム実行、メモリアクセス・パタンを観測し、動的最適化処理を施します。これにより、高性能かつ低消費電力といった相反する要求を同時に満足します。
予測技術を用いた高性能/低消費電力メモリ・システムの開発

研究課題/領域番号：14702064 2002年 - 2004年

科学研究費助成事業若手研究(A)

　詳細を見る

担当区分：研究代表者資金種別：科研費

▼全件表示

教育活動概要

修士ならびに学部教育においては、「問題解決能力の習得」に重きを置き、新しいアイデアの考案からその有効性の実証までを一環して教育している。また、博士課程の学生においては、これに加え、「問題発見能力の習得」を中心とした指導を行っている。また、海外大学や研究機関との共同研究を通して国際的な教育にも力を入れている。博士後期学生の海外留学も推進している。世界最先端研究を通して、次世代のコンピュータアーキテクチャ技術を支える人材を育成する。

教育活動に関する受賞

九州大学工学講義賞

2021年10月九州大学

受賞者：井上弘士

担当授業科目

Seminar in Information Science and Technology

2023年4月 - 2024年3月通年
Research in Information Science and Technology I

2023年4月 - 2024年3月通年
【通年】情報理工学講究

2023年4月 - 2024年3月通年
【通年】情報理工学演習

2023年4月 - 2024年3月通年
【通年】情報理工学研究Ⅰ

2023年4月 - 2024年3月通年
Seminar in Information Science and Technology

2023年4月 - 2024年3月通年
Research in Information Science and Technology I

2023年4月 - 2024年3月通年
【通年】情報理工学講究

2023年4月 - 2024年3月通年
【通年】情報理工学演習

2023年4月 - 2024年3月通年
【通年】情報理工学研究Ⅰ

2023年4月 - 2024年3月通年
情報理工学読解

2023年4月 - 2023年9月前期
Presentation Methods in Information Science and Technology

2023年4月 - 2023年9月前期
情報理工学論議Ⅰ

2023年4月 - 2023年9月前期
情報理工学論述Ⅰ

2023年4月 - 2023年9月前期
情報理工学読解

2023年4月 - 2023年9月前期
Presentation Methods in Information Science and Technology

2023年4月 - 2023年9月前期
情報理工学論議Ⅰ

2023年4月 - 2023年9月前期
情報理工学論述Ⅰ

2023年4月 - 2023年9月前期
Advanced Computer System Architecture

2023年4月 - 2023年6月春学期
Advanced Computer System Architecture

2023年4月 - 2023年6月春学期
コンピュータアーキテクチャⅡ

2022年10月 - 2023年3月後期
コンピュータアーキテクチャⅠ（EC）

2022年6月 - 2022年8月夏学期
コンピュータアーキテクチャⅠ（B)

2022年6月 - 2022年8月夏学期
集積回路工学通論B

2022年6月 - 2022年8月夏学期
情報理工学研究Ⅰ

2022年4月 - 2023年3月通年
情報理工学演習

2022年4月 - 2023年3月通年
情報理工学講究

2022年4月 - 2023年3月通年
情報理工学論議Ⅰ

2022年4月 - 2022年9月前期
集積回路工学通論

2022年4月 - 2022年9月前期
情報知能工学演習第二

2022年4月 - 2022年9月前期
情報知能工学講究第二

2022年4月 - 2022年9月前期
情報理工学読解

2022年4月 - 2022年9月前期
情報理工学論述Ⅰ

2022年4月 - 2022年9月前期
Advanced Computer System Architecture

2022年4月 - 2022年6月春学期
集積回路工学通論A

2022年4月 - 2022年6月春学期
コンピュータシステム・アーキテクチャ特論

2022年4月 - 2022年6月春学期
コンピュータシステム・アーキテクチャ特論

2022年4月 - 2022年6月春学期
Advanced Computer System Architecture

2022年4月 - 2022年6月春学期
(IUPE)Computer Architecture I

2021年12月 - 2022年2月冬学期
情報理工学演示

2021年10月 - 2022年3月後期
コンピュータアーキテクチャⅡ

2021年10月 - 2022年3月後期
情報知能工学演習第三

2021年10月 - 2022年3月後期
情報知能工学講究第三

2021年10月 - 2022年3月後期
情報知能工学演習第一

2021年10月 - 2022年3月後期
情報知能工学講究第一

2021年10月 - 2022年3月後期
コンピュータアーキテクチャⅡ

2021年10月 - 2022年3月後期
組込みソフトウェア特論

2021年6月 - 2021年8月夏学期
[M2][計算機分野]組込みシステム特論

2021年6月 - 2021年8月夏学期
コンピュータアーキテクチャⅠ

2021年6月 - 2021年8月夏学期
コンピュータアーキテクチャⅠ（A前半，B）

2021年6月 - 2021年8月夏学期
集積回路工学通論B

2021年6月 - 2021年8月夏学期
組込みシステム特論

2021年6月 - 2021年8月夏学期
[M2][通信/社会分野]組込みシステム特論

2021年6月 - 2021年8月夏学期
情報理工学演習

2021年4月 - 2022年3月通年
国際演示技法

2021年4月 - 2022年3月通年
知的財産技法

2021年4月 - 2022年3月通年
ティーチング演習

2021年4月 - 2022年3月通年
先端プロジェクト管理技法

2021年4月 - 2022年3月通年
Scientific English Presentation

2021年4月 - 2022年3月通年
Intellectual Property Management

2021年4月 - 2022年3月通年
Exercise in Teaching

2021年4月 - 2022年3月通年
Advanced Project Management Technique

2021年4月 - 2022年3月通年
計算機構特別講究

2021年4月 - 2022年3月通年
Advanced Research in Computer Systems and Applications

2021年4月 - 2022年3月通年
情報知能工学特別講究第一

2021年4月 - 2022年3月通年
情報知能工学特別講究第二

2021年4月 - 2022年3月通年
知的情報システム工学特別演習

2021年4月 - 2022年3月通年
社会情報システム工学特別演習

2021年4月 - 2022年3月通年
Advanced Research in Advanced Information Technology I

2021年4月 - 2022年3月通年
Advanced Research in Advanced Information Technology II

2021年4月 - 2022年3月通年
Adv Semi in Intelligent Information Systems Engineering

2021年4月 - 2022年3月通年
Advanced Seminar in Social Information Systems Engineering

2021年4月 - 2022年3月通年
情報理工学研究Ⅰ

2021年4月 - 2022年3月通年
情報理工学読解

2021年4月 - 2021年9月前期
[M2]情報知能工学演習第二

2021年4月 - 2021年9月前期
[M2]情報知能工学講究第二

2021年4月 - 2021年9月前期
Exercise in Embedded System

2021年4月 - 2021年9月前期
[M2]Exercise in Embedded System

2021年4月 - 2021年9月前期
[M2]組込みシステム演習

2021年4月 - 2021年9月前期
集積回路工学通論

2021年4月 - 2021年9月前期
組込みシステム演習

2021年4月 - 2021年9月前期
[M2]Advanced Computer System Architecture

2021年4月 - 2021年6月春学期
Advanced Computer System Architecture

2021年4月 - 2021年6月春学期
[M2]コンピュータシステム・アーキテクチャ特論

2021年4月 - 2021年6月春学期
集積回路工学通論A

2021年4月 - 2021年6月春学期
コンピュータシステム・アーキテクチャ特論

2021年4月 - 2021年6月春学期
(IUPE)Computer Architecture I

2020年12月 - 2021年2月冬学期
コンピュータアーキテクチャⅡ

2020年10月 - 2021年3月後期
電気情報工学入門Ⅱ

2020年10月 - 2021年3月後期
コンピュータアーキテクチャⅡ

2020年10月 - 2021年3月後期
情報知能工学演習第一

2020年10月 - 2021年3月後期
情報知能工学演習第三

2020年10月 - 2021年3月後期
情報知能工学講究第一

2020年10月 - 2021年3月後期
情報知能工学講究第三

2020年10月 - 2021年3月後期
集積回路工学通論B

2020年6月 - 2020年8月夏学期
コンピュータアーキテクチャⅠ

2020年6月 - 2020年8月夏学期
コンピュータアーキテクチャⅠ（B）

2020年6月 - 2020年8月夏学期
Advanced Seminar in Social Information Systems Engineering

2020年4月 - 2021年3月通年
国際演示技法

2020年4月 - 2021年3月通年
知的財産技法

2020年4月 - 2021年3月通年
ティーチング演習

2020年4月 - 2021年3月通年
先端プロジェクト管理技法

2020年4月 - 2021年3月通年
Scientific English Presentation

2020年4月 - 2021年3月通年
Intellectual Property Management

2020年4月 - 2021年3月通年
Exercise in Teaching

2020年4月 - 2021年3月通年
Advanced Project Management Technique

2020年4月 - 2021年3月通年
計算機構特別講究

2020年4月 - 2021年3月通年
Advanced Research in Computer Systems and Applications

2020年4月 - 2021年3月通年
情報知能工学特別講究第一

2020年4月 - 2021年3月通年
情報知能工学特別講究第二

2020年4月 - 2021年3月通年
知的情報システム工学特別演習

2020年4月 - 2021年3月通年
社会情報システム工学特別演習

2020年4月 - 2021年3月通年
Advanced Research in Advanced Information Technology I

2020年4月 - 2021年3月通年
Advanced Research in Advanced Information Technology II

2020年4月 - 2021年3月通年
Adv Semi in Intelligent Information Systems Engineering

2020年4月 - 2021年3月通年
情報知能工学講究第二

2020年4月 - 2020年9月前期
電気情報工学入門Ⅰ

2020年4月 - 2020年9月前期
コンピュータシステム・アーキテクチャ特論

2020年4月 - 2020年9月前期
情報知能工学演習第二

2020年4月 - 2020年9月前期
集積回路工学通論

2020年4月 - 2020年6月春学期
集積回路工学通論A

2020年4月 - 2020年6月春学期
(IUPE)Computer Architecture I

2019年12月 - 2020年2月冬学期
情報知能工学講究第三

2019年10月 - 2020年3月後期
コンピュータアーキテクチャⅡ

2019年10月 - 2020年3月後期
コンピュータアーキテクチャⅡ

2019年10月 - 2020年3月後期
情報知能工学演習第一

2019年10月 - 2020年3月後期
情報知能工学演習第三

2019年10月 - 2020年3月後期
情報知能工学講究第一

2019年10月 - 2020年3月後期
コンピュータ・アーキテクチャⅠ

2019年6月 - 2019年8月夏学期
コンピュータアーキテクチャⅠ（B)

2019年6月 - 2019年8月夏学期
集積回路工学通論B

2019年6月 - 2019年8月夏学期
集積回路工学通論A/B

2019年4月 - 2019年9月前期
集積回路工学通論

2019年4月 - 2019年9月前期
コンピュータアーキテクチャ特論

2019年4月 - 2019年9月前期
コンピュータシステム・アーキテクチャ特論

2019年4月 - 2019年9月前期
情報知能工学演習第二

2019年4月 - 2019年9月前期
情報知能工学講究第二

2019年4月 - 2019年9月前期
集積回路工学通論A

2019年4月 - 2019年6月春学期
コンピュータ・アーキテクチャⅡ

2018年10月 - 2019年3月後期
コンピュータアーキテクチャⅡ

2018年10月 - 2019年3月後期
コンピュータアーキテクチャⅡ

2018年10月 - 2019年3月後期
情報知能工学演習第一

2018年10月 - 2019年3月後期
情報知能工学演習第三

2018年10月 - 2019年3月後期
情報知能工学講究第一

2018年10月 - 2019年3月後期
情報知能工学講究第三

2018年10月 - 2019年3月後期
コンピュータ・アーキテクチャⅠ

2018年6月 - 2018年8月夏学期
コンピュータアーキテクチャⅠ

2018年6月 - 2018年8月夏学期
コンピュータシステム・アーキテクチャ特論

2018年4月 - 2018年9月前期
コンピュータアーキテクチャ特論

2018年4月 - 2018年9月前期
コンピュータシステム・アーキテクチャ特論

2018年4月 - 2018年9月前期
情報知能工学演習第二

2018年4月 - 2018年9月前期
情報知能工学講究第二

2018年4月 - 2018年9月前期
情報知能工学講究第三

2017年10月 - 2018年3月後期
コンピュータアーキテクチャⅡ

2017年10月 - 2018年3月後期
コンピュータアーキテクチャⅡ

2017年10月 - 2018年3月後期
情報知能工学演習第一

2017年10月 - 2018年3月後期
情報知能工学演習第三

2017年10月 - 2018年3月後期
情報知能工学講究第一

2017年10月 - 2018年3月後期
コンピュータアーキテクチャⅠ

2017年6月 - 2017年8月夏学期
Advanced Research in Computer Systems and Applications

2017年4月 - 2018年3月通年
国際演示技法

2017年4月 - 2018年3月通年
知的財産技法

2017年4月 - 2018年3月通年
ティーチング演習

2017年4月 - 2018年3月通年
先端プロジェクト管理技法

2017年4月 - 2018年3月通年
Overseas Internship

2017年4月 - 2018年3月通年
Scientific English Presentation

2017年4月 - 2018年3月通年
Intellectual Property Management

2017年4月 - 2018年3月通年
Exercise in Teaching

2017年4月 - 2018年3月通年
Advanced Project Management Technique

2017年4月 - 2018年3月通年
情報知能工学特別講究第一

2017年4月 - 2018年3月通年
情報知能工学特別講究第二

2017年4月 - 2018年3月通年
Advanced Research in Advanced Information Technology I

2017年4月 - 2018年3月通年
Advanced Research in Advanced Information Technology II

2017年4月 - 2018年3月通年
知的情報システム工学特別演習

2017年4月 - 2018年3月通年
社会情報システム工学特別演習

2017年4月 - 2018年3月通年
Adv Semi in Intelligent Information Systems Engineering

2017年4月 - 2018年3月通年
Advanced Seminar in Social Information Systems Engineering

2017年4月 - 2018年3月通年
計算機構特別講究

2017年4月 - 2018年3月通年
コンピュータ・アーキテクチャⅠ

2017年4月 - 2017年9月前期
ﾌﾟﾛｸﾞﾗﾐﾝｸﾞ演習

2017年4月 - 2017年9月前期
コンピュータアーキテクチャ特論

2017年4月 - 2017年9月前期
コンピュータシステム・アーキテクチャ特論

2017年4月 - 2017年9月前期
情報知能工学演習第二

2017年4月 - 2017年9月前期
情報知能工学講究第二

2017年4月 - 2017年9月前期
コンピュータシステム・アーキテクチャ特論

2017年4月 - 2017年9月前期
コンピュータ・アーキテクチャⅡ

2017年4月 - 2017年9月前期
コンピュータ・アーキテクチャⅠ

2016年4月 - 2016年9月前期
コンピュータ・アーキテクチャⅡ

2016年4月 - 2016年9月前期
コンピュータシステム・アーキテクチャ特論

2016年4月 - 2016年9月前期
ハードウェア設計論特論

2015年10月 - 2016年3月後期
コンピュータ・アーキテクチャⅠ

2015年4月 - 2015年9月前期
コンピュータシステム・アーキテクチャ特論

2015年4月 - 2015年9月前期
回路理論Ⅰ

2015年4月 - 2015年9月前期
ハードウェア設計論特論

2014年10月 - 2015年3月後期
コンピュータアーキテクチャ特論

2014年4月 - 2014年9月前期
コンピュータ・アーキテクチャⅠ

2014年4月 - 2014年9月前期
情報処理演習I

2013年10月 - 2014年3月後期
コンピュータアーキテクチャ特論

2013年4月 - 2013年9月前期
コンピュータ・アーキテクチャⅠ

2013年4月 - 2013年9月前期
コンピュータ・アーキテクチャⅠ

2012年4月 - 2012年9月前期
コンピュータアーキテクチャ特論

2012年4月 - 2012年9月前期
コンピュータアーキテクチャ特論

2011年4月 - 2011年9月前期
コンピュータ・アーキテクチャⅠ

2011年4月 - 2011年9月前期
コンピュータアーキテクチャ特論

2010年4月 - 2010年9月前期
コンピュータ・アーキテクチャⅠ

2010年4月 - 2010年9月前期
コンピュータ・アーキテクチャⅠ

2009年4月 - 2009年9月前期
コンピュータアーキテクチャ特論

2009年4月 - 2009年9月前期
計算機構成論Ⅰ

2008年10月 - 2009年3月後期
システム・アーキテクチャ特論

2008年10月 - 2009年3月後期
情報論理学

2008年10月 - 2009年3月後期
コンピュータ・アーキテクチャⅠ

2008年4月 - 2008年9月前期
システム・アーキテクチャ特論

2007年10月 - 2008年3月後期
情報論理学

2007年10月 - 2008年3月後期
コンピュータ・アーキテクチャⅠ

2007年4月 - 2007年9月前期
システムアーキテクチャ特論

2006年10月 - 2007年3月後期
システム・アーキテクチャ特論

2005年10月 - 2006年3月後期
情報科学講究

2005年10月 - 2006年3月後期
情報論理学

2005年10月 - 2006年3月後期
情報理学演習第一

2005年4月 - 2006年3月通年
情報科学特別研究

2005年4月 - 2006年3月通年
基礎情報学特別演習

2005年4月 - 2006年3月通年
基礎情報学特別講究

2005年4月 - 2006年3月通年
情報理学特別演習第一

2005年4月 - 2006年3月通年
情報理学特別講究第一

2005年4月 - 2006年3月通年
情報理学特別研究

2005年4月 - 2006年3月通年
情報理学講究第二

2005年4月 - 2006年3月通年
情報理学講究第一

2005年4月 - 2006年3月通年

▼全件表示

FD参加状況

2024年5月名称：科研費の最近の動向について

他大学・他機関等の客員・兼任・非常勤講師等

2023年国立情報学研究所区分:客員教員国内外の区分:国内
2022年国立情報学研究所区分:客員教員国内外の区分:国内
2021年国立情報学研究所区分:客員教員国内外の区分:国内
2020年国立情報学研究所区分:客員教員国内外の区分:国内
2013年北九州市立大学区分:非常勤講師国内外の区分:国外

学期、曜日時限または期間：前期
2012年北九州市立大学区分:非常勤講師国内外の区分:国外

学期、曜日時限または期間：前期
2011年北九州市立大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：前期
2010年北九州市立大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：前期、隔週
2009年北九州市立大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：前期、集中講義
2007年北九州市立大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：前期集中講義
2007年福岡大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：後期４限
2006年北九州市立大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：前期集中講義
2006年福岡大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：後期
2005年福岡大学区分:非常勤講師国内外の区分:国内

学期、曜日時限または期間：後期火曜日４限

▼全件表示

国際教育イベント等への参加状況等

UPWARDS

UPWARDS

その他教育活動及び特記事項

2023年クラス担任学部
2022年クラス担任学部
2021年クラス担任学部
2020年クラス担任学部
2013年クラス担任学部
2012年クラス担任学部
2011年クラス担任学部

▼全件表示

大学全体における各種委員・役職等

2024年10月 - 現在副学長

その他部局等における各種委員・役職等

2023年3月 - 2024年3月センター量子コンピューティングシステム研究センター長
2019年4月 - 2024年3月研究院システムLSI研究センター長
2019年4月 - 2021年3月研究院情報通信委員会委員長
2017年4月 - 2024年3月センター EJUST連携センター長
2015年4月 - 2017年3月センター EJUST連携センター副センター長
2014年1月 - 2014年12月学部友誼会幹事
2013年4月 - 2014年3月学部教育用計算機仕様策定委員（学部用システムの取りまとめ）
2010年4月 - 2012年3月研究院乙酉会幹事
2004年1月 - 2012年3月研究院夏の理科教室実行委員
その他教授

▼全件表示

社会貢献・国際連携活動概要

企業を対象とした研究成果報告会や、国際会議での役員として活動している。また、ソウル大学とも共同で研究を進めている。

社会貢献活動

中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2009年8月

　詳細を見る

対象：社会人・一般,　学術団体,　企業,　市民団体,　行政機関

種別：セミナー・ワークショップ
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2009年8月

　詳細を見る

対象：幼稚園以下,　小学生,　中学生,　高校生

種別：セミナー・ワークショップ
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2009年8月

　詳細を見る

種別：セミナー・ワークショップ

researchmap
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2009年8月

　詳細を見る

種別：サイエンスカフェ

researchmap
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2008年8月

　詳細を見る

対象：社会人・一般,　学術団体,　企業,　市民団体,　行政機関

種別：セミナー・ワークショップ
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2008年8月

　詳細を見る

対象：幼稚園以下,　小学生,　中学生,　高校生

種別：セミナー・ワークショップ
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

九州大学大学院システム情報科学研究院九州大学伊都キャンパス 2008年8月

　詳細を見る

種別：セミナー・ワークショップ

researchmap
中学生を対象とした「夏の理科教室」において、ロボットを題材とした「コンピュータの動作原理を理解する」といった実験コースを開催している。

2008年8月

　詳細を見る

種別：サイエンスカフェ

researchmap

▼全件表示