Investigation of the Effectiveness of Programs Optimization Methods for Parallel Computing Systems with GPU
https://doi.org/10.21822/2073-6185-2023-50-4-59-74
Abstract
Objective. The paper defines the relevance of the task of increasing the efficiency of software, which in this case is understood as reducing the operating time of the designed software in the process of solving computationally complex problems. Method. As an example of such a task, the implementation of the singular value decomposition by the Jacobi method is used. This task finds its application in various fields from signal and image processing to artificial intelligence systems. Parallel computing systems equipped with GPU are chosen as the target computing architecture. The paper discusses methods for improving the efficiency of software for target computing architectures using CUDA. Result. The existing analytical models for evaluating the effectiveness of computer programs are described. The influence of various optimizations, such as optimization of data transfers, use of the unified memory system, the number of threads, memory access patterns, and a number of others on the efficiency of the resulting software is considered. The process of optimizing the SVD implementation program is described, the results of computational experiments are presented. Conclusion. As the number of threads increases, performance may increase more than the number of threads. Impact of memory access pattern: When the memory access sequence is optimal, performance improves noticeably. Adjusting the share of memory used for L1 cache and shared memory does not have a significant impact on performance
Keywords
About the Authors
A. Yu. BezruchenkoRussian Federation
Aleksei Yu. Bezruchenko - Postgraduate Student.
28 Lenin Ave., Volgograd 400005
V. A. Egunov
Russian Federation
Vitaly A. Egunov - Cand. Sci. (Eng.), Assoc. Prof., Computers and Systems Department.
28 Lenin Ave., Volgograd 400005
References
1. Akritas A. G. and G. I. Malaschonok, “Applications of singular-value decomposition (SVD),” Mathematics and Computers in Simulation, 2004; 67:15-31,
2. Natarajan, Venkatanathan. “Singular Spectral Analysis (Ssa) of Solid Earth Tide (Set)-Implications to Identify Earthquake Precursors and Earthquakes in the Himalayan Region (M≥ 6) During 1991-2021.” (2022).
3. Ahmadi-Asl S. et al. Randomized algorithms for computation of Tucker decomposition and higher order SVD (HOSVD), 2021; 9: 28684-28706
4. Wall M. E., Rechtsteiner A., Rocha L. M. Singular value decomposition and principal component analysis. A practical approach to microarray data analysis. – Boston, MA : Springer US, 2003; 91-109.
5. Hammarling S. The singular value decomposition in multivariate statistics.ACM Signum Newsletter. 1985;. 20(3): 2-25.
6. Amey J. L. et al. Neural network interpretation using descrambler groups. Proceedings of the National Academy of Sciences. 2021; 118 (5):2016917118.
7. S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Comm. ACM, 2009; 52( 4): 65-76,
8. H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan, and Y. Li, “GPURoofline: A Model for Guiding Performance Optimizations on GPUs,” Proc. 18th Int’l Conf. Parallel Processing (Euro-Par ’12), 2012; 7484: 920-932.
9. T. Cramer, D. Schmidl, M. Klemm, and D. an Mey, “OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison,” Proc. Many-Core Applications Research Community Symp. at RWTH Aachen Univ., 2012; 38-44.
10. J. Treibig, G. Hager and G. Wellein: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA, September 13, 2010. DOI: 10.1109/ICPPW.2010.38 Preprint: http://arxiv.org/abs/1004.4431
11. Tutorial: Empirical Roofline Model // github.com: elektronnyj resurs. URL: https://github.com/RRZE-HPC/likwid/wiki/Tutorial:-Empirical-Roofline-Model (data obrashcheniya: 18.08.2023)
12. Z. Cui, Y. Liang, K. Rupnow, and D. Chen, “An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization,” Proc. IEEE 26th Int’l Parallel Distributed Processing Symp. (IPDPS), 2012; 83-94
13. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 40, 2007, pp. 407–420.
14. J. Meng, D. Tarjan, and K. Skadron, “Dynamic warp subdivision for integrated branch and memory divergence tolerance,” in Proceedings of the 37th annual international symposium on Computer architecture, ser. ISCA’10, 2010;. 235–246
15. E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen, “Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping,” in Proceedings of the 24th ACM International Conference on Supercomputing, ser. ICS ’10, 2010; 115–126.
16. CUDA C++ programming guide // docs.nvidia.com: elektronnyj resurs. URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (data obrashcheniya: 18.08.2023)
17. Yu, Qi & Childers, Bruce & Huang, Libo & Qian, Cheng & Wang, Zhiying. (2020). A quantitative evaluation of unified memory in GPUs. The Journal of Supercomputing. 76. DOI: 10.1007/s11227-019-03079-y.
18. Chien, Steven & Peng, Ivy & Markidis, Stefano. (2019). Performance Evaluation of Advanced Features in CUDA Unified Memory. 50-57. DOI: 10.1109/MCHPC49590.2019.00014. A quantitative evaluation of unified memory in GPUs Qi Yu Bruce Childers Libo Huang Cheng Qian Zhiying Wang
19. How to Access Global Memory Efficiently in CUDA C/C++ Kernels // Nvidia technical blog: elektronnyj resurs. URL: https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/ (data obrashcheniya: 18.08.2023)
20. QueryPerformanceCounter function – Win32 apps. Microsoft Learn: elektronnyj resurs. URL: https://learn.microsoft.com/en-us/windows/win32/api/profileapi/nf-profileapi-queryperformancecounter (data obrashcheniya: 18.08.2023)
21. CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops. Nvidia technical blog: elektronnyj resurs. URL: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ (data obrashcheniya: 18.08.2023)
Review
For citations:
Bezruchenko A.Yu., Egunov V.A. Investigation of the Effectiveness of Programs Optimization Methods for Parallel Computing Systems with GPU. Herald of Dagestan State Technical University. Technical Sciences. 2023;50(4):59-74. (In Russ.) https://doi.org/10.21822/2073-6185-2023-50-4-59-74