Kyushu University [Satoshi Ohshima (Associate Professor) Research Institute for Information Technology, Section of Advanced Computational Science]

Satoshi Ohshima

Last modified date：2024.04.10

Associate Professor / Section of Advanced Computational Science / Research Institute for Information Technology

Papers

1.	Kenji Sugisaki, Srinivasa Prasannaa, Satoshi Ohshima, Takahiro Katagiri, Yuji Mochizuki, Bijaya Kumar Sahoo, Bhanu Pratap Das, Bayesian phase difference estimation algorithm for direct calculation of fine structure splitting: accelerated simulation of relativistic and quantum many-body effects, Electronic Structure, 10.1088/2516-1075/acf909, 2023.09, Abstract Despite rapid progress in the development of quantum algorithms in quantum computing as well as numerical simulation methods in classical computing for atomic and molecular applications, no systematic and comprehensive electronic structure study of atomic systems that covers almost all of the elements in the periodic table using a single quantum algorithm has been reported. In this work, we address this gap by implementing the recently-proposed quantum algorithm, the Bayesian Phase Difference Estimation (BPDE) approach, to determine fine structure splittings of a wide range of boron-like atomic systems. Since accurate estimate of fine structure splittings strongly depend on the relativistic as well as quantum many-body effects, our study can test the potential of the BPDE approach to produce results close to the experimental values. Our numerical simulations reveal that the BPDE algorithm, in the Dirac–Coulomb–Breit framework, can predict fine structure splittings of ground states of the considered systems quite precisely. We performed our simulations of relativistic and electron correlation effects on Graphics Processing Unit by utilizing NVIDIA’s cuQuantum, and observe a ×42.7 speedup as compared to the Central Processing Unit-only simulations in an 18-qubit active space..
2.	Satoshi Ohshima, Akihiro Ida, Rio Yokota, Ichitaro Yamazaki, QR Factorization of Block Low-Rank Matrices on Multi-Instance GPU, Parallel and Distributed Computing, Applications and Technologies, https://doi.org/10.1007/978-3-031-29927-8_28, 2023.04.
3.	Development Status of ABINIT-MP in 2022 We have been developing the ABINIT-MP program for fragment molecular orbital (FMO) calculations over 20 years. Several improvements for accelerated processing were made after the release of Open Version 2 Revision 4 at September 2021. Functionalities were enhanced as well. In this short report, we summarize such developments toward the next release of Revision 8.
4.	Naruya Kitai, Daisuke Takahasi, Franz Franchetti, Takahiro Katagiri, Satoshi Ohshima, Toru Nagai, Adaptation of A64 Scalable Vector Extension for Spiral, 情報処理学会研究報告(HPC-178), 1-6, 2021.03.
5.	Fumiya Ishiguro, Takahiro Katagiri, Satoshi Ohshima, Toru Nagai, Performance Evaluation of Accurate Matrix-Matrix Multiplication on GPU Using Sparse Matrix Multiplications, 2020 Eighth International Symposium on Computing and Networking Workshops (CANDARW), 10.1109/CANDARW51189.2020.00044, 178-184, 2020.11.
6.	Utilization of Low-precision and Mixed-precision Calculation in Parareal Method.
7.	Kenji Ono, Toshihiro Kato, Satoshi Ohshima, Takeshi Nanri, Scalable Direct-Iterative Hybrid Solver for Sparse Matrices on Multi-Core and Vector Architectures, HPCAsia2020: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 10.1145/3368474.3368484, 11-21, 2020.01, In the present paper, we propose an efficient direct-iterative hybrid solver for sparse matrices that can derive the scalability of the latest multi-core, many-core, and vector architectures and examine the execution performance of the proposed SLOR-PCR method. We also present an efficient implementation of the PCR algorithm for SIMD and vector architectures so that it is easy to output instructions optimized by the compiler. The proposed hybrid method has high cache reusability, which is favorable for modern low B/F architecture because efficient use of the cache can mitigate the memory bandwidth limitation. The measured performance revealed that the SLOR-PCR solver showed excellent scalability up to 352 cores on the cc-NUMA environment, and the achieved performance was higher than that of the conventional Jacobi and Red-Black ordering method by a factor of 3.6 to 8.3 on the SIMD architecture. In addition, the maximum speedup in computation time was observed to be a factor of 6.3 on the cc-NUMA architecture with 352 cores..
8.	Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota, Optimization of Numerous Small Dense-Matrix-Vector Multiplications in H-matrix Arithmetic on GPU, 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 10.1109/MCSoC.2019.00009, 9-16, 2019.10.
9.	GPUによる階層型行列計算法の高速化に向けた多数の小密行列ベクトル積計算の最適化.
10.	Satoshi Ohshima, Soichiro Suzuki, Tatsuya Sakashita, Masao Ogino, Takahiro Katagiri, Yoshimichi Andoh, Performance evaluation of the MODYLAS application on modern multi-core and many-core environments, In Proceedings of IPDPSW2019, pp.xx--xx, 2019.08.
11.	Satoshi Ohshima, Soichiro Suzuki, Tatsuya Sakashita, Masao Ogino, Takahiro Katagiri, Yoshimichi Andoh, Performance evaluation of the MODYLAS application on modern multi-core and many-core environments, IPDPSW2019, 2019.05.
12.	512bit SIMD環境における分子動力学アプリケーションMODYLASの性能評価.
13.	マルチコア・メニーコア計算機環境におけるChebyshev基底通信削減CG法の性能評価.
14.	高精度行列‐行列積のためのBatched BLASおよび疎行列演算を用いた実装方式のGPU環境での性能評価.
15.	Ichitaro Yamazaki, Ahmad Abdelfattah, Akihiro Ida, Satoshi Ohshima, Stanimire Tomov, Rio Yokota, Jack Dongarra, Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters, 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018 Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018, 10.1109/IPDPS.2018.00102, 930-939, 2018.08, HACApK is a software package for solving dense linear systems of equations and is used in other software packages, like ppohBEM for solving boundary integral equations. To enable the solution of large-scale boundary value problems, HACApK hierarchically compresses the coefficient matrix and uses the BiConjugate Gradient Stabilized (BiCGStab) method for solving the linear system. To extend HACApK's capability, this paper outlines how we ported the HACApK linear solver onto GPU clusters. Though the potential of GPUS has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUS for a solver, like HACApK, that requires fine-grained irregular computation and global communication. To utilize the GPUS, we integrated the variable-size batched GPU kernel that was recently released in the MAGMA software package. This is the first time the variable-size batched kernels were used in a solver or application code. We discuss several techniques to improve the performance of the batched kernel and demonstrate the effects of these techniques on two state-of-The-Art GPU clusters. For instance, with two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUS per node, the GPU kernel obtained a solver speedup of 8× on one node and 4× on eight nodes. We also show that when the inter-GPU communication becomes significant, the solution time can be further reduced by a factor of 2× by carefully designing the communication layer with the underlying node architecture in mind..
16.	Chebyshev基底通信削減CG法のマルチコア・メニーコア計算環境における性能評価.
17.	GPGPUによる高精度行列‐行列積アルゴリズムのためのBatched BLASを用いた実装方式の提案.
18.	Mellanox社のスイッチ装置への集団通信オフロード機能による集団通信隠蔽効果の調査.
19.	Yoshimichi Andoh, Soichiro Suzuki, Satoshi Ohshima, Tatsuya Sakashita, Masao Ogino, Takahiro Katagiri, Noriyuki Yoshii, Susumu Okazaki, A thread-level parallelization of pairwise additive potential and force calculations suitable for current many-core architectures, Journal of Supercomputing, 10.1007/s11227-018-2272-2, 74, 6, 2449-2469, 2018.06, In molecular dynamics (MD) simulations, calculations of potentials and their derivatives by coordinate, i.e., forces, in a pairwise additive manner such as the Lennard–Jones interactions and a short-range part of the Coulombic interactions form the main part of arithmetic operations. It is essential to achieve high thread-level parallelization efficiency of these pairwise additive calculations of potentials and forces to use current supercomputers with many-core architectures effectively. In this paper, we propose four new thread-level parallelization algorithms for the pairwise additive potential and force calculations. We implement the four codes in a MD calculation code based on the fast multipole method. Performance benchmarks were taken on the FX100 supercomputer and Intel Xeon Phi coprocessor. The code succeeds in achieving high thread-level parallelization efficiency with 32 threads on the FX100 and up to 60 threads on the Xeon Phi..
20.	Software Auto-Tuning for Hierarchical Matrix Computation.
21.	Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters..
22.	Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota, Optimization of Hierarchical matrix computation on GPU, In proceedings of Supercomputing Frontiers. SCFA 2018, Lecture Notes in Computer Science, 10.1007/978-3-319-69953-0_16, 10776, 274-292, Lecture Notes in Computer Science, vol 10776. Springer, Cham, pp.274–292, Print ISBN=978-3-319-69952-3, Online ISBN=978-3-319-69953-0, DOI=https://doi.org/10.1007/978-3-319-69953-0_16, 2018.03, [URL].
23.	Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota, Optimization of Hierarchical matrix computation on GPU, Lecture Notes in Computer Science, 10.1007/978-3-319-69953-0_16, 10776, 2018.03.
24.	Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota, Optimization of hierarchical matrix computation on GPU, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10.1007/978-3-319-69953-0_16, 10776, 274-292, 2018.03, The demand for dense matrix computation in large scale and complex simulations is increasing however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel..
25.	高精度行列‐行列積アルゴリズムにおけるBatched BLASの適用.
26.	Yoshimichi Andoh, Soichiro Suzuki, Satoshi Ohshima, Tatsuya Sakashita, Masao Ogino, Takahiro Katagiri, Noriyuki Yoshii, Susumu Okazaki, A thread-level parallelization of pairwise additive potential and force calculations suitable for current many-core architectures, The Journal of Supercomputing, 10.1007/s11227-018-2272-2, 1573-0484, 2018.02, [URL].
27.	スーパーコンピュータシステムITOの性能評価.
28.	非ブロッキング集団通信の通信隠蔽効果に関する調査.
29.	スーパーコンピュータ上でのDeep Learning学習環境の初期構築.
30.	階層型行列計算のGPU向け最適化.
31.	GPUクラスタ上における階層型行列計算の最適化.
32.	Takahiro Katagiri, Satoshi Ohshima, Masaharu Matsumoto, Auto-tuning on NUMA and Many-core Environments with an FDM code, 31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017 The Twelfth International Workshop on Automatic Performance Tuning (iWAPT2017) (In Conjunction with the IEEE IPDPS2017), 2017.06.
33.	Takahiro Katagiri, Satoshi Ohshima, Masaharu Matsumoto, Auto-Tuning on NUMA and many-core environments with an FDM code, Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, 10.1109/IPDPSW.2017.27, 1399-1407, 2017.06, In this paper, we focus on auto-tuning (AT) performance on nonuniform memory access (NUMA) and many-core architectures. Code from the finite difference method (FDM) is selected to evaluate AT performance, and results on the Xeon Phi (Knights Landing, KNL) for four kinds of memory (FLAT and CACHE) and cluster modes (QUADRANT and SNC4) yielded the following findings: (1) The KNL memory mode did not affectoverall performance, except FLAT-SNC4. The difference ofexecution time for the CACHE mode to the FLAT mode was only 0.99%. (2) Hyper-threading (HT) technology worked well, and yielded 1.86x (baseline) and 1.50x (with AT). (3) Varying hybrid MPI/OpenMP execution was very effective for KNL. Themaximum factors of speedups were 2.16x in the baseline and2.91x with AT. (4) AT with code selection persisted as a powerful tool, even in KNL. We obtained speedups by AT for a maximum of 1.64x. Moreover, we had room to speedup by a further 1.31x by adapting AT for the fastest execution..
34.	Tetsuya Hoshino, Satoshi Ohshima, Toshihiro Hanawa, Kengo Nakaima, Akihiro Ida, Pascal vs KNL: Performance Evaluation with ICCG Solve, HPC in Asia Workshop Poster Session, ISC High Performance 2017, 2017.06.
35.	A Consideration of Auto-tuning Technology for Finite Difference Method in Post Moore's era.
36.	GPU搭載スーパーコンピュータReedbush‐Hの性能評価.
37.	OpenACCを用いたICCG法ソルバーのPascal GPUにおける性能評価.
38.	Xeon Phi+OmniPath環境におけるOpenMP,MPI性能最適化.
39.	Optimization of ICCG Solver for Intel Xeon Phi.
40.	Performance Evaluation of Pipelined CG Method.
41.	データ解析・シミュレーション融合スーパーコンピュータシステムReedbush‐Uの性能評価.
42.	高バンド幅メモリ環境における数値計算アルゴリズムの変革と自動チューニング技術~FDMコードを例にして~.
43.	3次元積層技術による高メモリバンド幅時代の自動チューニング~FDMコードを例にして~.
44.	FPGAを用いた階層型行列ベクトル積.
45.	階層型行列ベクトル積のメニーコア向け最適化.
46.	Kengo Nakajima, Masaki Satoh, Takashi Furumura, Hiroshi Okuda, Takeshi Iwashita, Hide Sakaguchi, Takahiro Katagiri, Masaharu Matsumoto, Satoshi Ohshima, Hideyuki Jitsumoto, Takashi Arakawa, Futoshi Mori, Takeshi Kitayama, Akihiro Ida, Miki Y. Matsuo, ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT), OPTIMIZATION IN THE REAL WORLD: TOWARD SOLVING REAL-WORLD OPTIMIZATION PROBLEMS, 10.1007/978-4-431-55420-2_2, 13, 15-35, 2016.08, ppOpen-HPC is an open source infrastructure for development and execution of large-scale scientific applications on post-peta-scale (pp) supercomputers with automatic tuning (AT). ppOpen-HPC focuses on parallel computers based on many-core architectures and consists of various types of libraries covering general procedures for scientific computations. The source code, developed on a PC with a single processor, is linked with these libraries, and the parallel code generated is optimized for post-peta-scale systems. In this article, recent achievements and progress of the ppOpen-HPC project are summarized..
47.	Takahiro Katagiri, Masaharu Matsumoto, Satoshi Ohshima, Auto-tuning of hybrid MPI/OpenMP execution with code selection by ppOpen-AT, 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, 10.1109/IPDPSW.2016.49, 1488-1495, 2016.07, In this paper, we propose an effective kernel implementation for an application of the finite difference method (FDM) by merging computations of central-difference and explicit time expansion schemes without IF statements inside the loops. The effectiveness of the implementation depends on the CPU architecture and execution situation, such as the problem size and the number of MPI processes and OpenMP threads. We adopt auto-tuning (AT) technology to select the best implementation. The AT function for the selection, referred to as «code selection», is implemented in an AT language, namely, ppOpen-AT. The results of experiments conducted using current advanced CPUs (Xeon Phi, Ivy Bridge, and FX10) indicated that crucial speedups of conventional AT are achieved by code selection. In particular, the heaviest kernels achieved speedups of 4.21x (Xeon Phi), 2.52x (Ivy Bridge), and 2.03x (FX10)..
48.	Satoshi Ohshima, Takahiro Katagiri, Masaharu Matsumoto, Utilization and expansion of ppOpen-AT for OpenACC, 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, 10.1109/IPDPSW.2016.123, 1496-1505, 2016.07, For application programmers, reducing efforts for optimizing programs is an important issue. Our solution of this issue is an auto-tuning (AT) technique. We are developing an AT language named ppOpen-AT. We have shown that this language is useful for multi-and many-core parallel programming. Today, OpenACC attracts attention as an easy and useful graphics processing unit (GPU) programming environment. While OpenACC is one possible parallel programming environment, users have to spend time and energy in order to optimize OpenACC programs. In this study, we investigate the usability of ppOpen-AT for OpenACC programs and propose to expand ppOpen-AT for further optimization of OpenACC..
49.	Performance Evaluation of a Hierarchical Auto-tuning Function toward to Post Moore's Era.
50.	Takahiro Katagiri, Masaharu Matsumoto, Satoshi Ohshima, Auto-tuning of Hybrid MPI/OpenMP Execution with Code Selection by ppOpen-AT, 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 10.1109/IPDPSW.2016.49, 1488-1495, 2016.05, In this paper, we propose an effective kernel implementation for an application of the finite difference method (FDM) by merging computations of central-difference and explicit time expansion schemes without IF statements inside the loops. The effectiveness of the implementation depends on the CPU architecture and execution situation, such as the problem size and the number of MPI processes and OpenMP threads. We adopt auto-tuning (AT) technology to select the best implementation. The AT function for the selection, referred to as "code selection", is implemented in an AT language, namely, ppOpen-AT. The results of experiments conducted using current advanced CPUs (Xeon Phi, Ivy Bridge, and FX10) indicated that crucial speedups of conventional AT are achieved by code selection. In particular, the heaviest kernels achieved speedups of 4.21x (Neon Phi), 2.52x (Ivy Bridge), and 2.03x (FX10)..
51.	Satoshi Ohshima, Takahiro Katagiri, Masaharu Matsumoto, Utilization and Expansion of ppOpen-AT for OpenACC, 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 10.1109/IPDPSW.2016.123, 1496-1505, 2016.05, For application programmers, reducing efforts for optimizing programs is an important issue. Our solution of this issue is an auto-tuning (AT) technique. We are developing an AT language named ppOpen-AT. We have shown that this language is useful for multi-and many-core parallel programming. Today, OpenACC attracts attention as an easy and useful graphics processing unit (GPU) programming environment. While OpenACC is one possible parallel programming environment, users have to spend time and energy in order to optimize OpenACC programs. In this study, we investigate the usability of ppOpen-AT for OpenACC programs and propose to expand ppOpen-AT for further optimization of OpenACC..
52.	FPGAを用いた疎行列数値計算の性能評価.
53.	Optimization of matrix assembly process in FEM applications on manycore architectures.
54.	Takahiro Katagiri, Satoshi Ohshima, Masaharu Matsumoto, Directive-Based Auto-Tuning for the Finite Difference Method on the Xeon Phi, 29th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015 Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015, 10.1109/IPDPSW.2015.11, 1221-1230, 2015.09, In this paper, we present a directive-based auto-tuning (AT) framework, called ppOpen-AT, and demonstrate its effect using simulation code based on the Finite Difference Method (FDM). The framework utilizes well-known loop transformation techniques. However, the codes used are carefully designed to minimize the software stack in order to meet the requirements of a many-core architecture currently in operation. The results of evaluations conducted using ppOpen-AT indicate that maximum speedup factors greater than 550% are obtained when it is applied in eight nodes of the Intel Xeon Phi. Further, in the AT for data packing and unpacking, a 49% speedup factor for the whole application is achieved. By using it with strong scaling on 32 nodes in a cluster of the Xeon Phi, we also obtain 24% speedups for the overall execution..
55.	Parallel FEM application using ppOpen‐APPL/FVM.
56.	ppOpen‐ATによる静的コード生成で実現する自動チューニング方式の評価.
57.	ppOpen‐ATによるOpenACCプログラムの自動チューニング.
58.	SCG‐AT:静的コード生成のみによる自動チューニング実現方式.
59.	ppOpen‐ATを用いたOpenACCプログラムの自動チューニング.
60.	Towards Auto-tuning for Era of 200+ Threads Parallelism on One Node : An Adaptation of an FDM code.
61.	An Auto-Tuning Methodology in The Era of Over 200 Threads per Node : Adaptation of Code Optimization to an FDM Program.
62.	Takahiro Katagiri, Satoshi Ohshima, Masaharu Matsumoto, Directive-based Auto-tuning for the Finite Difference Method on the Xeon Phi, 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, 10.1109/IPDPSW.2015.11, 1221-1230, 2015.05, In this paper, we present a directive-based auto-tuning (AT) framework, called ppOpen-AT, and demonstrate its effect using simulation code based on the Finite Difference Method (FDM). The framework utilizes well-known loop transformation techniques. However, the codes used are carefully designed to minimize the software stack in order to meet the requirements of a many-core architecture currently in operation. The results of evaluations conducted using ppOpen-AT indicate that maximum speedup factors greater than 550% are obtained when it is applied in eight nodes of the Intel Xeon Phi. Further, in the AT for data packing and unpacking, a 49% speedup factor for the whole application is achieved. By using it with strong scaling on 32 nodes in a cluster of the Xeon Phi, we also obtain 24% speedups for the overall execution..
63.	1ノード200超スレッド時代の自動チューニング手法~FDMコード最適化を中心に~.
64.	未知語の音声クエリに対する複数検索結果を用いた音声中の検索語検出.
65.	動的な並列実行機構を用いたSpMV実装の性能評価.
66.	Takahiro Katagiri, Satoshi Ohshima, Masaharu Matsumoto, Auto-tuning of computation kernels from an FDM code with ppOpen-AT, Proceedings - 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2014, 10.1109/MCSoC.2014.22, 91-98, 2014.11, In this paper, we propose an Auto-tuning (AT) function with an AT language for a dedicated numerical library with respect to supercomputers in operation. The AT function is based on well-known loop transformation techniques, such as loop split, fusion, and re-ordering of statements. However, loop split with copies or increase of computations, and loop fusion to the split loop are taken into account by utilizing user knowledge..
67.	Satoshi Ohshima, Takahiro Katagiri, Masaharu Matsumoto, Performance optimization of SpMV using CRS format by considering OpenMP scheduling on CPUs and MIC, Proceedings - 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2014, 10.1109/MCSoC.2014.43, 253-260, 2014.11, In this study, we evaluate the performance of sparse matrix-vector multiplication (SpMV) using the compressed row storage (CRS) format on CPUs and MIC. We focus on the relationship between OpenMP scheduling and performance. The performance of SpMV is measured using various OpenMP scheduling settings and the results are analyzed, which show that OpenMP scheduling has a considerable effect on the performance of SpMV. We confirm that some scheduling settings resulted in performance improvements compared with default scheduling for particular matrices. The results of the evaluation show that the performance of SpMV is improved by up to 1.57 times compared with SPARC64 IXfx, 2.47 times compared with Xeon Ivy Bridge-EP, and 2.26 times compared with Knights Corner. Next, we modify the SpMV function of OpenATLib, an auto-tuned numerical library, to consider the scheduling of optimization as an additional SpMV implementation. We measure the performance of the GMRES solver and obtain performance improvements of up to 11.4%. These results will help to improve the performance of various numerical calculation applications..
68.	Optimization of matrix assembling process in FEM applications on multicore/manycore architectures Finite-element method (FEM) is one of the most well-known numerical methods for solving partial differential equations (PDE), and applied to various kinds of scientific simulations. Matrix assembling and sparse matrix solver are the most expensive processes in finite-element procedures. In the present work, the matrix assembling process is parallelized using OpenMP, and three types of implementations are evaluated on various types of multicore/manycore architectures. Results and analyses of computations and strategies towards automatic tuning will be described in the presentation..
69.	1ノード200超スレッド時代の自動チューニング手法~FDMコードを例にして~.
70.	疎行列ソルバーにおける自動チューニングを用いたOpenMP指示文の最適化.
71.	様々な計算機環境におけるOpenMP/OpenACCを用いたICCG法の性能評価.
72.	Auto-tuning for A Code from Finite Difference Method with ppOpen-AT on the Xeon Phi.
73.	Masaharu Matsumoto, Futoshi Mori, Satoshi Ohshima, Hideyuki Jitsumoto, Takahiro Katagiri, Kengo Nakajima, Implementation and evaluation of an AMR framework for FDM applications, Procedia Computer Science, 10.1016/j.procs.2014.05.084, 29, 936-946, 2014.06, In order to execute various finite-difference method applications on large-scale parallel computers with a reasonable cost of computer resources, a framework using an adaptive mesh refinement (AMR) technique has been developed. AMR can realize high-resolution simulations while saving computer resources by generating and removing hierarchical grids dynamically. In the AMR framework, a dynamic domain decomposition (DDD) technique, as a dynamic load balancing method, is also implemented to correct the computational load imbalance between each process associated with parallelization. By performing a 3D AMR test simulation, it is confirmed that dynamic load balancing can be achieved and execution time can be reduced by introducing the DDD technique. © The Authors. Published by Elsevier B.V..
74.	通信削減アルゴリズムCAQRのRSDFTの直交化処理への適用と評価.
75.	レイテンシコアの高度化・高効率化による将来のHPCIシステムに関する調査研究のアプリケーションの異機種環境での評価~メニーコア環境を中心に~.
76.	Takahiro Katagiri, Satoshi Ohshima, Masaharu Matsumoto, Auto-tuning of computation kernels from an FDM code with ppOpen-AT, 2014 8th IEEE International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2014 Proceedings - 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2014, 10.1109/MCSoC.2014.22, 91-98, 2014.01, In this paper, we propose an Auto-tuning (AT) function with an AT language for a dedicated numerical library with respect to supercomputers in operation. The AT function is based on well-known loop transformation techniques, such as loop split, fusion, and re-ordering of statements. However, loop split with copies or increase of computations, and loop fusion to the split loop are taken into account by utilizing user knowledge..
77.	Masaharu Matsumoto, Futoshi Mori, Satoshi Ohshima, Hideyuki Jitsumoto, Takahiro Katagiri, Kengo Nakajima, Implementation and evaluation of an AMR framework for FDM applications, 14th Annual International Conference on Computational Science, ICCS 2014 Procedia Computer Science, 10.1016/j.procs.2014.05.084, 29, 936-946, 2014.01, In order to execute various finite-difference method applications on large-scale parallel computers with a reasonable cost of computer resources, a framework using an adaptive mesh refinement (AMR) technique has been developed. AMR can realize high-resolution simulations while saving computer resources by generating and removing hierarchical grids dynamically. In the AMR framework, a dynamic domain decomposition (DDD) technique, as a dynamic load balancing method, is also implemented to correct the computational load imbalance between each process associated with parallelization. By performing a 3D AMR test simulation, it is confirmed that dynamic load balancing can be achieved and execution time can be reduced by introducing the DDD technique..
78.	Performance Evaluation of Computer Systems Consisted of Various Architectures with Scientific Application.
79.	Takahiro Katagiri, Cheng Luo, Reiji Suda, Shoichi Hirasawa, Satoshi Ohshima, Energy optimization for scientific programs using auto-tuning language ppOpen-AT, Proceedings - IEEE 7th International Symposium on Embedded Multicore/Manycore System-on-Chip, MCSoC 2013, 10.1109/MCSoC.2013.14, 123-128, 2013.11, In this paper, we demonstrate a new approach for power-consumption optimization using a dedicated Auto-tuning (AT) language. Our approach is based on recently developed technologies: (1) a power measurement application programming interface, (2) an AT mathematical core library. Preliminary performance evaluation enables us to select the best kernel for a real-world scientific program using either the CPU or Graphics Processing Unit, with respect to energy consumption. From the results of the evaluation, we found the performance-changing point in the experimental environment. © 2013 IEEE..
80.	Takahiro Katagiri, Satoshi Ito, Satoshi Ohshima, Early experiences for adaptation of auto-tuning by ppOpen-AT to an explicit method, Proceedings - IEEE 7th International Symposium on Embedded Multicore/Manycore System-on-Chip, MCSoC 2013, 10.1109/MCSoC.2013.15, 153-158, 2013.09, We present a code optimization technique by adapting an auto-tuning (AT) function to an explicit method with the static code generator FIBER. The AT function is evaluated with current multicore processors to match situations with high-thread parallelism (HTP). The results of performance evaluations indicate that the AT function is crucial for HTP, as the speedups of the explicit method with a static code generator are as much as 7.4x compared to that of original implementations based on compiler optimization only. © 2013 IEEE..
81.	ppOpen‐ATにより自動生成されたppOpen‐HPCコードにおける自動チューニング機能の性能評価.
82.	メニーコアアーキテクチャ向けのSpMV最適化と自動チューニング.
83.	Xeon PhiにおけるSpMVの性能評価.
84.	陽解法カーネルのための自動チューニング記述言語ppOpen‐ATの新機能について.
85.	A New Function of an Auto-tuning Description Language ppOpen-AT for Kernels of Explicit Method.
86.	Takao Sakurai, Takahiro Katagiri, Hisayasu Kuroda, Ken Naono, Mitsuyoshi Igai, Satoshi Ohshima, A Sparse Matrix Library with Automatic Selection of Iterative Solvers and Preconditioners, 2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 10.1016/j.procs.2013.05.300, 18, 1332-1341, 2013.06, Many iterative solvers and preconditioners have recently been proposed for linear iterative matrix libraries. Currently, library users have to manually select the solvers and preconditioners to solve their target matrix. However, if they select the wrong combination of the two, they have to spend a lot of time on calculations or they cannot obtain the solution. Therefore, an approach for the automatic selection of solvers and preconditioners is needed. We have developed a function that automatically selects an effective solver/preconditioner combination by referencing the history of relative residuals at run-time to predict whether the solver will converge or stagnate. Numerical evaluation with 50 Florida matrices showed that the proposed function can select effective combinations in all matrices. This suggests that our function can play a significant role in sparse iterative matrix computations. (C) 2013 The Authors. Published by Elsevier B.V. and peer review under responsibility of the organizers of the 2013 International Conference on Computational Science.
87.	メニーコアプロセッサXeon Phiの性能評価.
88.	レイテンシコアの高度化・高効率化による将来のHPCIシステムに関する調査研究のためのアプリケーション最適化と異機種計算機環境での性能評価.
89.	レイテンシコアの高度化・高効率化による将来のHPCIシステムに関する調査研究のためのアプリケーションと性能評価.
90.	Research activity for Ultra Low Performance HPC.
91.	Takahiro Katagiri, Takao Sakurai, Mitsuyoshi Igai, Satoshi Ohshima, Hisayasu Kuroda, Ken Naono, Kengo Nakajima, Control formats for unsymmetric and symmetric sparse matrix-vector multiplications on OpenMP implementations, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10.1007/978-3-642-38718-0_24, 7851, 236-248, 2013.01, In this paper, we propose "control formats" to obtain better thread performance of sparse matrix-vector multiplication (SpMV) for unsymmetric and symmetric matrices. By using the control formats, we established the following maximum speedups of SpMV in 16-thread execution on one node of the T2K Open Supercomputer: (1) 7.14x for an unsymmetric matrix by using the proposed Branchless Segmented Scan compared to the original Segmented Scan method (2) 12.7x for a symmetric matrix by using the proposed Zero-element Computation-free method compared to a simple SpMV implementation. © 2013 Springer-Verlag..
92.	Satoshi Ohshima, Masae Hayashi, Takahiro Katagiri, Kengo Nakajima, Implementation and Evaluation of 3D Finite Element Method Application for CUDA, HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2012, 10.1007/978-3-642-38718-0_16, 7851, 140-148, 2013.01, This paper describes a fast implementation of a FEM application on a GPU. We implemented our own FEM application and succeeded in obtaining a performance improvement in two of our application components: Matrix Assembly and Sparse Matrix Solver. Moreover, we found that accelerating our Boundary Condition Setting component on the GPU and omitting CPU-GPU data transfer between Matrix Assembly and Sparse Matrix Solver slightly further reduces execution time. As a result, the execution time of the entire FEM application was shortened from 44.65 sec on only a CPU (Nehalem architecture, 4 cores, OpenMP) to 17.52 sec on a CPU with a GPU (TeslaC2050)..
93.	An Improvement in Preconditioned Algorithm of BiCGStab Method.
94.	Performance Evaluation of Oakleaf‐FX (Fujitsu PRIMEHPC FX10) Supercomputer System.
95.	Satoshi Ito, Satoshi Ohshima, Takahiro Katagiri, SSG-AT: An auto-tuning method of sparse matrix-vector multiplicataion for semi-structured grids-An adaptation to openfoam, Proceedings - IEEE 6th International Symposium on Embedded Multicore SoCs, MCSoC 2012, 10.1109/MCSoC.2012.26, 191-197, 2012.09, We are developing ppOpen-AT, which is an infrastructureof auto-tuning (AT) for ppOpen-HPC. ppOpen-HPC is numerical middleware for post Petascale era. In this study, we propose a new auto-tuning (AT) facility for semi-structured grids in OpenFOAM. We focus on sparse matrix-vector multiplication and the matrix storage formats. Using the features of input data and mesh connectivity, we propose a hybrid storage format that is suitable for semistructured grids. We evaluate the proposed AT facility on the T2K supercomputer and an Intel Xeon cluster. For a typical computational fluid dynamics scenario, we obtain speedup factors of 1.3 on the T2K and 1.84 on the Xeon cluster. These results indicate that the proposed AT method has the potential to select the optimal data format according to features of the input sparse matrix. © 2012 IEEE..
96.	Fault Convergence: A New Concept of Safety for Numerical Computation Software.
97.	GPUを用いた疎行列ベクトル積計算の最適化.
98.	Xabclib:ソルバ・前処理自動選択機能を備えた疎行列ライブラリ.
99.	ポストペタスケール環境のための自動チューニング基盤ppOpen‐ATの新機能について.
100.	Ken Naono, Takao Sakurai, Takahiro Katagiri, Satoshi Ohshima, Shoji Itoh, Kengo Nakajima, Mitsuyoshi Igai, Hisayasu Kuroda, A Fully Run-time Auto-tuned Sparse Iterative Solver with OpenATLib, 2012 4TH INTERNATIONAL CONFERENCE ON INTELLIGENT AND ADVANCED SYSTEMS (ICIAS), VOLS 1-2, 10.1109/ICIAS.2012.6306176, 143-148, 2012.06, We propose a general application programming interface called OpenATLib for auto-tuning (AT). OpenATLib is carefully designed to establish the reusability of AT functions for sparse iterative solvers. Using APIs of OpenATLib, we develop a fully auto-tuned sparse iterative solver called Xabclib. Xabclib has several novel runtime AT functions. We also develop a numerical computation policy that can optimize memory space and computational accuracy. Using the above functions and policies, we obtain the following important findings: (1) an average memory space is reduced to 1/45 under lower memory policies, and (2) fault convergence, which the conventional solvers judges to be converged but actually not converged in the sense of the before-preconditioned matrix, is avoided under higher accuracy policies. The results imply policy-based runtime AT plays significant role in sparse iterative matrix computations..
101.	An Improvement in Preconditioned Algorithm of BiCGStab Method 前処理付きBiCGStab（PBiCGStab）法の改善アルゴリズムを提案する．前処理付きBiCG法にCGS法の導出手順を適用すると，CGS法の合理的な前処理付きアルゴリズムが構成される．この手法をPBiCGStab法へと拡張するにあたり，BiCGStab法に現れるMR演算に対し論理面からの新たな考察を行い，適用できることを示した．本提案アルゴリズムが従来のPBiCGStabよりも合理的であることと，数値実験により本提案の有効性を示す．An improved preconditioned BiCGStab algorithm (improved PBiCGStab) is proposed. Rational preconditioned algorithm of CGS has been constructed, by applying the derivation procedure of the CGS to the preconditioned BiCG. In order to extend this approach to the BiCGStab, minimum residual part of the BiCGStab must be considered logically. This proposed algorithm is also more rational than the conventional typical PBiCGStab mathematically. Numerical results show advantages of this improved PBiCGStab..
102.	Study of OpenFOAM tuning in ppOpen-AT.
103.	Development of Auto-tuning Infrastructure ppOpen-AT for ppOpen-HPC.
104.	Fault Convergence: A New Concept of Safety for Numerical Computation Software 本論文では，数値計算ソフトウェアで多く用いられている数値反復解法において生じると考えられる収束障害（Fault Convergence）の概念を提案する．数値計算分野で用いられている偽収束（False Convergence）との違いを議論する．Laprie により定義されたディペンダブルコンピューティング実現のための 3 つの脅威―障害 (fault) ‐異常 (error) ‐故障 (failure) モデル－を用いて数値反復解法での収束問題を議論することにより，収束障害の一例を示す．In this paper, we propose a concept of "Fault Convergence" for numerical iteration methods, which are widely used methods in numerical software. With respect to the difference to the concept of "False" convergence on numerical computation field, we explain a situation that fault convergence occurs. By using the model proposed by Laprie with the 3 kinds of threats to dependable computing―the fault-error-failure model―, we discuss an example of fault convergence situation in convergence problem to numerical iterative methods..
105.	Performance Evaluation of HITACHI SR16000 model M1 Supercomputer System.
106.	複数GPU向けのCUDAコードを生成するOpenMP処理系の提案.
107.	Performance Evaluation of HITACHI SR16000 model M1 Supercomputer System 本稿では東京大学情報基盤センターにおいて 2011 年 10 月に稼働を開始したスーパーコンピューターシステム HITACHI SR16000 モデル M1（愛称 Yayoi）の性能について報告する．本システムは計算ノードに Power7 プロセッサを搭載した最新のスーパーコンピューターシステムである．いくつかのベンチマークを用いて性能評価を行った結果，性能の特性や重要な実行時環境変数の設定などが明らかとなった．We report the performance of HITACHI SR16000 model M1 supercomputer system (named Yayoi) which has started in October 2011 at Information Technology Center, The University of Tokyo. This is a latest supercomputer system which mounts Power7 CPU on the computation node. We executed several benchmarks on the system and unveiled characteristic features of performance and imporant parameters..
108.	An improvement in preconditioned algorithm of BiCGStab method.
109.	Parallel ILU Preconditioner Based on Extended Hierarchical Interface Decomposition for Heterogeneous environments.
110.	Implementation of Matrix Assembly in 3D Finite Element Method for CUDA.
111.	Evaluation of Auto‐Tuning Function on OpenATLib.
112.	Parallel ILU Preconditioner Based on Extended Hierarchical Interface Decomposition for Heterogeneous environments 拡張階層型領域間分割は，領域外からの高次フィルインを考慮可能とする並列化手法であり，分散データの局所性も高いことから，メニーコア環境での効率的な並列化手法として期待される．本研究では，物性値分布に不均質性をもつことから悪条件となる三次元静弾性問題に対し，拡張階層型領域間分割に基づく高次フィルイン前処理付き反復解法を適用する．本報告では，T2K(東大) を利用し，マルチコア環境における本並列実装プログラムの収束性および高次フィルイン付き前処理の並列性能についてマルチカラー法との比較に基づき評価する．Extended version of Hierarchical Interface Decomposition(HID) is developed as an effective parallelization method for Finite Element Method(FEM) on multi/many-core environments for its high locality of distributed mesh data. And thichker separators introduced in Extended HID allow us to take into account high level fill-ins in parallel ILU preconditioners. We implemented Extended HID to OpenMP parallel FEM base simulation of linear elasticity problem with heterogeneous property. The developed code has been tested on the T2K Open Super Computer(T2K/Tokyo) using 1 node, 16 cores to evaluate the robustness and the scalability of parallel ILU decomposition based on the comparison with MC ordering..
113.	Implementation of Matrix Assembly in 3D Finite Element Method for CUDA 本稿では三次元有限要素法 (FEM) アプリケーションにおける行列生成処理の CUDA 向け実装について述べる．GPU は高い演算性能・メモリ転送性能を持つため様々な科学技術計算アプリケーションに利用されており，FEM についても多くの研究がなされている．特に，FEM の実行時間の多くは疎行列ソルバーが占めるため，疎行列ソルバーを対象とした GPU 実装の研究が盛んである．本稿では，疎行列ソルバーに次いで実行時間を要する処理である行列生成処理を対象として，1GPU，2GPU および 2GPU と CPU を用いた実装と性能について報告する．We describe the implementation of matrix assembly process in 3D Finite Element Method (FEM) using CUDA. Because GPU has high calculation performance and memory transfer performance, GPU is now utilizing for several scientific applications include FEM. Especially, many researches aim at speeeding up of sparse matrix solver because sparse matrix solver has the largest time ratio of execution time of FEM. In this paper, we focus on matrix assembly process which has the second largest time ratio of FEM, and show the result of implementation and performance evaluation..
114.	Evaluation of Auto-Tuning Function on OpenATLib 科学技術計算等で利用される行列計算ライブラリは高い演算性能が得られるパラメータの選択や入力に多大な手間が必要なため，それを自動的に設定する方式が求められている．そこで，筆者らは自動チューニングインターフェース OpenATlib を開発している．本稿では OpenATLib の提供する機能の 1 つであるリスタート周期自動チューニング機能について述べる．本機能では残差履歴を用いて最適なリスタート周期を自動的に選択する．T2K オープンスパコンを用いて 3 種の解法で本機能の効果を評価した結果，固定値と比較して最大で 38.5 倍の性能差があり，機能の有効性が確認できた．Matrix libraries have many parameters as inputs by the user. They include problem parameters what are difficult to set values and the approach of automatically setting them is needed. Then, we proposed Auto-tuning interface "OpenATLib." In this paper, we explain a runtime automatic tuning approach for deciding the size of projection matrix in Krylov subspace methods. This approach searches the best size of projection matrix with history of residual values at runtime.Performance evaluations of OpenATLib using 3 solvers on T2K Open Supercomputer indicates that the maximum speedup establishes 38.5x..
115.	前処理付きBiCGStab法の問題点に対する改良.
116.	HxABCLibScript : An Extension of an Auto-tuning Language for Heterogeneous Computing Environment 本稿では，CPUおよびGPU(Graphics Processing Unit)を混載した非均質計算機において，任意のプログラムの一部分が，適する計算資源上で実行される最適化を実現する自動チューニング専用言語HxABCLibScriptを提案する．性能評価の結果，HxABCLibScript記述から自動生成されるコードは，問題サイズや反復回数に応じ，CPUとGPU間で適切に計算資源を切り替えることで最適化されることを確認した．In this paper, we propose HxABCLibScript, which is a dedicated language for auto-tuning description on heterogeneous computer environment, which includes CPU and GPU (Graphics Processing Unit), to adapt arbitrary parts of programs. Results of performance evaluation indicated that the automatically generated codes from the description of HxABCLibScript can select the best computer resources between CPU and GPU according to problem size or the number of iterations on the program..
117.	Implementation and Evaluation of 3D Finite Element Method for CUDA.
118.	Implementation and Evaluation of 3D Finite Element Method for CUDA 本稿では三次元弾性静力学を対象とした有限要素法(Finite Element Method, FEM)のGPU(CUDA)向け実装と性能評価について述べる．高い演算性能・メモリ転送性能を持つGPUは様々な科学技術計算アプリケーションに利用されており，FEMについても多くの研究がなされている．本稿では特に前処理付き共役勾配法(Conjugate Gradient Method, CG法)による疎行列ソルバーと係数行列生成部分に注目し，CUDA向けの実装と性能評価を行った結果を報告する．In this paper, we describe the implementation and evaluation of Finite Element Method(FEM) on GPU(CUDA). Because GPU has high calculation performance and memory transfer performance, GPU is now utilizing for several scientific applications include FEM. We show the result of implementation and performance evaluation especially about sparse matrix solver using Conjugate Gradient Method and matrix creation..
119.	Optimized Implementation of Segmented Scan Method for CUDA.
120.	ATのGPUへの展開.
121.	Research on Intermediate Language for GPU Computing In this presentation, we will discuss a intermediate language suitable for GPU computing. GPUs as data parallel processors have very high execution peak performance for general purpose computation. GPUs attract attention as accelerators in HPC (High Performance Computing). Accelerators usually have parallel execution models like SIMD and SPDM, independent memory, and high-speed on-chip scratchpad memory. Intermediate languages used in CPU compilers cannot fully describe these features. Users are using different programming environments for each accelerators and tuning source codes toward the peak performance. We evaluate the execution performance of a native compiling environment based on Java Bytecode and discuess the intermediate language which is suitable to describe the accelerator features..
122.	Optimized Implementation of Segmented Scan Method for CUDA 本稿では Segmented Scan 法を用いた疎行列ベクトル積の CUDA 向け最適化実装について述べる．我々は実装の再利用性に着目した自動チューニングインターフェース OpenATLib の提案を行い，また OpenATLib の提供する機能の一つである疎行列ベクトル積においては Segmented Scan 方式を元にスカラ計算機向けに改良を行った Branchless Segmented Scan 方式を提案している．本稿ではこれらの方式を元にして CUDA 向けの新たな Segmented Scan 方式を考案し実装した．GPU 上で高速実行可能なようにアルゴリズムの改良や各種の最適化を行った結果，偏りの大きな行列に対して NVIDIA GeForceGTX285 上で最大で 3.26GFLOPS の性能を達成した．We discuss about optimized implementation of sparse matrix vector multiplication for CUDA using Segmented Scan method. We proposed Auto-tuning interface OpenATLib and we also proposed Branchless Segmented Scan method besed on Segmented Scan method for scalar computer as an important new feature of sparse matrix vector multiplication. In this paper, we proposed and implemented new Segmented Scan method for CUDA based on Segmented Scan method and Branchless Segmented Scan method. As a result of optimized implementation, we aimed 3.26GFLOPS on NVIDIA GeForceGTX285..
123.	A Software Cache Implementation for GPU.
124.	A Software Cache Implementation for GPU 高性能コンピューティングにおいて GPU が注目されている．NVIDIA 製 GPU は CUDA において高性能なシェアードメモリを有効に用いるプログラミング技術により各種アプリケーションで非常に高いピーク性能が得られている一方，プログラミングの容易さ，汎用性に問題を残している．本研究においては CUDA においてユーザが明示的に使用するシェアードメモリの一部をデバイスメモリのキャッシュとするソフトウェアキャッシュ機構を提案する．本機構によりデバイスメモリからシェアードメモリへ暗黙的にデータ転送が行われ汎用計算の高速化が達成される．In HPC, GPU attracts attention. Although programming difficulty still remains, very high peak performance can be achieved using NVIDIA GPUs. In this research, we propose a software cache mechanism which caches the device memory of CUDA with the shared memory. User data can be transfered implicitly with the software cache and performance improvement of general-purpose computation benchmark programs can be achieved..
125.	A Software Cache Implementation for GPU 高性能コンピューティングにおいて GPU が注目されている．NVIDIA 製 GPU は CUDA において高性能なシェアードメモリを有効に用いるプログラミング技術により各種アプリケーションで非常に高いピーク性能が得られている一方，プログラミングの容易さ，汎用性に問題を残している．本研究においては CUDA においてユーザが明示的に使用するシェアードメモリの一部をデバイスメモリのキャッシュとするソフトウェアキャッシュ機構を提案する．本機構によりデバイスメモリからシェアードメモリへ暗黙的にデータ転送が行われ汎用計算の高速化が達成される．In HPC, GPU attracts attention. Although programming difficulty still remains, very high peak performance can be achieved using NVIDIA GPUs. In this research, we propose a software cache mechanism which caches the device memory of CUDA with the shared memory. User data can be transfered implicitly with the software cache and performance improvement of general-purpose computation benchmark programs can be achieved..
126.	OMPCUDA:GPU向けOpenMP処理系.
127.	OMPCUDA:GPU向けOpenMPの実装.
128.	OMPCUDA : Implementation of OpenMP for GPU General-purpose computation using GPU (GPGPU) has been a focus of attention because of its performance, but the difficulity of GPGPU programming is a problem. So we have proposed GPGPU programming style using existing parallel programming style. In this paper, we implemented OMPCUDA OpenMP for CUDA-capable GPU, for explore a possibilities of GPGPU using OpenMP. Then, we evaluated out implimentation using test program. As a result, we confirmed that OMPCUDA make GPGPU parallel programming easy and can get speed-up..
129.	Message Passing GPGPU Programming As GPU's performance increases, general-purpose computation using GPU (GPGPU) is watched with keen interest more and more. GPGPU is expected to overtake CPU's performance by its parallel processing tendency, but, programming for GPGPU is not easy because of its special programming style. In this paper, we propose GPGPU programming style using existing parallel programming style. We take up several existing parallel programming styles such as message passing, and examined how they can be applied to GPGPU programming..
130.	Message Passing GPGPU Programming As GPU's performance increases, general-purpose computation using GPU (GPGPU) is watched with keen interest more and more. GPGPU is expected to overtake CPU's performance by its parallel processing tendency, but, programming for GPGPU is not easy because of its special programming style. In this paper, we propose GPGPU programming style using existing parallel programming style. We take up several existing parallel programming styles such as message passing, and examined how they can be applied to GPGPU programming..
131.	GPGPU Programming Using Existing Parallelizing Method.
132.	Proposal of GPGPU Programming Using Existing Parallelizing Method GPGPU utilizing GPU's performance for general-purpose computation is attracting much attention. GPGPU is expected to effect higher performance than CPU. However, creating GPGPU programming is not easy because programming methods peculiar to GPGPU programming are needed. In this paper, we propose to use existing parallellizing method as one of a new method making GPGPU programming easier. Also we consider writing programs running on GPUs with OpenMP and MPI based on the new GPGPU programming language of CUDA..
133.	A Performance Comparison of Software-DSM Mocha and MPI Using Parallel Benchmarks Software distributed shared memory (S-DSM) system is more friendly and easier to do programming compared with message pcissing interface. In this paper, we compared the performance of Mocha which is one of S-DSM systems, and MPICH which is a widely-used parallel programming library with message passing interface. Four applications (MM, SOR, IS, LU) were used for the parallel benchmarks. To measure S-DSM system overhead, the exe- cution time of interrupt handlers was measured. The result shows that the followings should be performed to archive high performance in the S-DSM system compared with MPI: 1. Introdaction of a pre-fetching for shared data before a page fault; 2. Improvement of acquiring lock performance; 3. Improvement of barrier synchronization performance..
134.	A Performance Comparison of Software-DSM Mocha and MPI Using Parallel Benchmarks Software distributed shared memory (S-DSM) system is more friendly and easier to do programming compared with message pcissing interface. In this paper, we compared the performance of Mocha which is one of S-DSM systems, and MPICH which is a widely-used parallel programming library with message passing interface. Four applications (MM, SOR, IS, LU) were used for the parallel benchmarks. To measure S-DSM system overhead, the exe- cution time of interrupt handlers was measured. The result shows that the followings should be performed to archive high performance in the S-DSM system compared with MPI: 1. Introdaction of a pre-fetching for shared data before a page fault; 2. Improvement of acquiring lock performance; 3. Improvement of barrier synchronization performance..
135.	CPUとGPUを用いた基本行列計算ライブラリ.
136.	Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, Toshitsugu Yuba, Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment, HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2006, 4395, 305-318, 2007.01, GPUs for numerical computations are becoming an attractive alternative in research. In this paper, we propose a new parallel processing environment for matrix multiplications by using both CPUs and GPUs. The execution time of matrix multiplications can be decreased to 40.1% by our method, compared with using the fastest of either CPU only case or GPU only case. Our method performs well when matrix sizes are large..
137.	Proposal and Implementation of Parallel GEMM Routine Using CPU and GPU GPUs for numerical computations are becoming an attractive research topics. We have proposed a new computation method of GPU, which utilizes parallel processing based on CPU and GPU. In this paper, we apply this method to existing numerical computation library. We examine a performance tuning method and execute performance experiments using a benchmark program. We also apply the method to the GEMM routine of BLAS and execute the HPL benchmark. As a result, the performance can be improved to 1.45 times by our method, compared with a CPU only environment using Pentium4 3.0GHz. There is a precision problem depending on GPU's arithmatic precision, but we show such a potentiality that our method can be applied to various applications..
138.	A Performance Comparison of Parallel Applications between Software-DSM and MPI Software distributed shared memory (S-DSM) system achieves a virtual shared memory on distributed memory environment such as PC cluster without special hardware supports. S-DSM system is more friendly and easier to do programming compared with message passing interface. In this paper, we compare the performance of Mocha which is one of S-DSM systems, and MPICH which is a popular parallel programming library with message passing interface. Three applications (MM, SOR, LU) are used for our benchmarks. The results show the application in MPI that can be tuned for communication is better performance than one on S-DSM, but the one in MPI that can not be tuned is equal performance to one on S-DSM..
139.	Ainori Communication : A Method to Reduce the Page Transfer of S-DSM Systems We discuss the method to speed up the software distributed shared memory (S-DSM) systems. By predicting a page which will be needed to be transferred in the future and prefetching it, we can speed up the S-DSM systems. In this paper, we propose a method to transfer the predicted pages together with a message used in S-DSM systems. We will call this method Ainori communication. We evaluate our implementation using four S-DSM benchmarks and show that it can decrease the number of communication and improve the performance..
140.	A Performance Comparison of Parallel Applications between Software‐DSM and MPI.
141.	Ainori Communication: A Method to Reduce the Page Transfer of S‐DSM Systems We discuss the method to speed up the software distributed shared memory (S-DSM) systems. By predicting a page which will be needed to be transferred in the future and prefetching it, we can speed up the S-DSM systems. In this paper, we propose a method to transfer the predicted pages together with a message used in S-DSM systems. We will call this method Ainori communication. We evaluate our implementation using four S-DSM benchmarks and show that it can decrease the number of communication and improve the performance..
142.	Proposal and Implementation of Parallel GEMM Routine Using CPU and GPU.
143.	CPUとGPUを複数用いた並列数値計算環境の検討.
144.	Proposal of Matrix Multiply and Add Method by Parallel Processing Using CPU and GPU A research that uses GPU for numerical calculation is becoming active. In this paper, we do not only solve numerical problem using GPU but also propose a method that divide a problem and calculate it using CPU and GPU. Measure a execution time of matrix multiply and add, and because of parallel processing the performance was improved of 38.1% than the case solved only with CPU. Moreover, we exmine a method for a best problem distribution using each FLOPS of CPU and GPU obtained by a preliminary experiment. As a result, we could obtain values by prediction nearby experimental values..
145.	GPUによるBLAS演算の性能評価.
146.	GPUによる高速な行列積の実装.
147.	命令レベル並列性を利用したOpenMPによるプロセッサシミュレータの並列実行.

Unauthorized reprint of the contents of this database is prohibited.