基于KNL众核处理器平台的并行矩量法性能优化

顾宗静; 赵勋旺; 刘莹玉; 林中朝; 张玉; 赵玉萍

doi:10.3969/j.issn.1000-5641.2019.01.012

基于KNL众核处理器平台的并行矩量法性能优化

doi: 10.3969/j.issn.1000-5641.2019.01.012

1.
西安电子科技大学电子工程学院, 西安 710071
2.
英特尔(中国)有限公司, 北京 100013

基金项目:

国家重点研发计划 2017YFB0202102

国家重点研发计划 2016YFE0121600

中国博士后科学基金 2017M613068

详细信息

作者简介:
顾宗静, 男, 博士研究生, 研究方向为电磁场与微波技术.E-mail:xdguzongjing@163.com

通讯作者:
张玉, 男, 教授, 博士生导师, 研究方向为电磁场与微波技术.E-mail:yuzhang@mail.xidan.edu.cn

中图分类号: TN820
计量
- 文章访问数: 125
- HTML全文浏览量: 39
- PDF下载量: 165
- 被引次数: 0
出版历程
- 收稿日期: 2017-09-29
- 刊出日期: 2019-01-25

Optimization of parallel method of moments based on KNL many-core processors

1.
School of Electronic Engineering, XiDian University, Xian 710071, China
2.
Intel(China) Limited, Beijing 100013, China

摘要

摘要: 基于Intel第二代Xeon Phi代号为Knights Landing（KNL）众核处理器平台，利用MPI+OpenMP混合编程策略对并行矩量法（Method of Moments，MoM）进行了优化.利用OpenMP编程技术和KNL的计算资源，提高了CPU（Center Processing Unit）使用率；线程的引入，大幅度减少了矩阵填充过程中进程间的冗余积分；为发挥KNL的512位矢量宽度优势，通过向量化优化进一步提高了循环结构的执行效率；对计算密集型、CPU利用率高的矩阵求解过程，通过引入的OpenMP编程策略，减少了MPI（Message Passing Interface）通信时间，加速了求解.数值结果表明，通过在KNL众核处理器平台上的优化，可以极大地提升矩量法计算复杂电磁问题的效率.
- 众核处理器 /
- MPI+OpenMP /
- 并行矩量法 /
- 向量化
Abstract: The parallel method of moments (MoM) is successfully optimized using the MPI+OpenMP hybrid programming strategy, based on the second-generation Intel Xeon Phi many-core processor platform, codenamed Knights Landing (KNL). Using OpenMP programming technology, the utilization rate of the CPU (Center Processing Unit) is increased, and the computing resources of KNL are fully utilized. The introduction of threads substantially reduces the inter-process redundant integrals in the filling matrix process. In order to give full play to the advantage of KNL's 512-bit vector width, the efficiency of the loop structure is further enhanced through vector optimization. For the matrix solution process, which typically requires intensive computation and high CPU utilization, MPI (Message Passing Interface) communication time is reduced and the solution process is accelerated by introducing an OpenMP programming strategy. Numerical results show that the efficiency of solving complex electromagnetic problems by parallel MoM is greatly improved through optimization on the KNL many-core processor platform.
- many-core processor /
- MPI+OpenMP /
- parallel method of moments (MoM) /
- vectorization

HTML全文

图 1 三角形和公共边编号对应关系

Fig. 1 Correspondence between triangle and common edge number

下载: 全尺寸图片幻灯片

图 2 二维块循环分布

Fig. 2 2D block-cyclic distribution

下载: 全尺寸图片幻灯片

图 3 矩阵填充向量化优化伪代码

Fig. 3 Pseudo-code of vector optimization for the filling matrix

下载: 全尺寸图片幻灯片

图 4 矩阵填充MPI+OpenMP优化伪代码

Fig. 4 Pseudo-code of optimization using MPI+OpenMP for the filling matrix

下载: 全尺寸图片幻灯片

图 5 飞机Ⅰ仿真模型及3D双站RCS结果

Fig. 5 Model of aircraftⅠand 3D bistatic RCS

下载: 全尺寸图片幻灯片

图 6 飞机Ⅰ的2D双站RCS结果

Fig. 6 2D bistatic RCS of aircraft Ⅰ

下载: 全尺寸图片幻灯片

图 7 飞机Ⅱ仿真模型及3D双站RCS结果

Fig. 7 Model of aircraftⅡand 3D bistatic RCS

下载: 全尺寸图片幻灯片

图 8 飞机Ⅱ的2D双站RCS结果

Fig. 8 2D bistatic RCS of aircraftⅡ

下载: 全尺寸图片幻灯片

表 1 矩阵填充向量化优化结果

Tab. 1 Results of vector optimization for the filling matrix

	进程网格	矩阵填充/s	矩阵填充效率提升/%
优化前	8$\times $8	766.94	—
向量化优化	8$\times $8	681.87	11.09

下载: 导出CSV

表 2 矩阵填充过程积分次数测试

Tab. 2 Integral numbers of the filling matrix

	进程数目	进程网格	线程数目	总积分次数	冗余次数	冗余比例/%
串行	1	1	1	1529435664	0	0
优化前	64	8$\times $8	1	5129137924	3599702260	70.18
优化后	2	1$\times$2	128/32	2293527768	764092104	33.32
	4	2$\times $2	64/16	3439353316	1909917652	55.53
	8	2$\times $4	32/8	4069270002	2539834338	62.41
	16	4$\times $4	16/4	4814555769	3285120105	68.23
注:线程数目一列中, $m/n$表示矩阵填充和矩阵求解在每个节点分别开启$m$和$n$个OpenMP线程.

下载: 导出CSV

表 3 运行时间对比

Tab. 3 Comparison of running time

	进程网格	线程数目	矩阵填充/s	矩阵求解/s	总时间/s	总加速倍数
优化前	8$\times $8	1	766.94	255.27	1022.21	—
优化后	1$\times$2	128/32	230.20	285.76	515.96	1.98
	2$\times$2	64/16	131.91	258.49	390.40	2.62
	2$\times$4	32/8	153.16	261.16	414.32	2.47
	4$\times$4	16/4	177.24	259.23	436.47	2.34
注:线程数目一列中, $m/n$表示矩阵填充和矩阵求解在每个节点分别开启$m$和$n$个OpenMP线程.

下载: 导出CSV

表 4 矩阵填充过程积分次数测试

Tab. 4 Integral numbers of the filling matrix

	进程数目	进程网格	线程数目	总积分次数	冗余次数	冗余比例/%
串行	1	1	1	22937708304	0	0
优化前	512	16$\times $32	1	87453385158	64515676854	73.77
优化后	8	2$\times$4	256/64	61178300578	38240592274	62.51
	16	4$\times $4	128/32	71538061156	48600352852	67.94
	32	4$\times $8	64/16	76226472670	53288764355	69.91
	64	8$\times $8	32/8	81222150025	58284441721	71.76
	128	8$\times $16	16/4	84157313530	61219605226	72.74
注:线程数目一列中, $m/n$表示矩阵填充和矩阵求解在每个节点分别开启$m$和$n$个OpenMP线程.

下载: 导出CSV

表 5 运行时间对比

Tab. 5 Comparison of running time

	进程网格	线程数目	矩阵填充/s	矩阵求解/s	总时间/s	总加速倍数
优化前	16$\times $32	1	7977.94	3079.69	11057.63	- -
优化后	2$\times$4	256/64	1124.01	3261.46	4385.46	2.52
	4$\times $4	128/32	715.20	3231.76	3946.96	2.80
	4$\times $8	64/16	601.36	2449.08	3050.44	3.62
	8$\times $8	32/8	810.63	2947.81	3758.44	2.94
	8$\times $16	16/4	1243.13	3081.21	4324.35	2.56
注:线程数目一列中, $m/n$表示矩阵填充和矩阵求解在每个节点分别开启$m$和$n$个OpenMP线程.

下载: 导出CSV

参考文献(19)

[1]	CHEN Y, ZHANG G, LIN Z, et al. Solution of EM problems using hybrid parallel MIC/CPU implementation of higher-order MoM[C]//IEEE, International Symposium on Microwave, Antenna, Propagation, and Emc Technologies. IEEE, 2016: 789-791.
[2]	张光辉. CPU/MIC异构平台中矩量法与时域有限差分法的研究[D].西安: 西安电子科技大学, 2015.
[3]	左胜, 陈岩, 张玉, 等.一种可扩展异构并行核外高阶矩量法[J].西安电子科技大学学报(自然科学版), 2017, 44(1):146-151. doi: 10.3969/j.issn.1001-2400.2017.01.026
[4]	赖明澈.数据并行协处理器体系结构的研究与实现[D].长沙: 国防科学技术大学, 2005.
[5]	HARRINGTON R F, HARRINGTON J L. Field Computation by Moment Methods[M]. NewYork:Oxford University Press, 1996.
[6]	ZHANG Y, SARKAR T K. Parallel Solution of Integral Equation Based EM Problems in the Frequency Domain[M]. Hoboken, NJ:Wiley-IEEE Press, 2009.
[7]	RAO S M, WILTON D R, GLISSON A W. Electromagnetic scattering by surfaces of arbitrary shape[J]. IEEE Transactions on Antennas & Propagation, 1982, 30(3):409-418. doi: 10.1109-TAP.1982.1142818/
[8]	张玉, 赵勋旺, 陈岩, 等.计算电磁学中的超大规模并行矩量法[M].西安:西安电子科技大学出版社, 2016.
[9]	RANA V S, LIN M, CHAPMAN B. A scalable task parallelism approach for LU decomposition with multicore CPUs[C]//Proceedings of the 2nd Internationsl Workshop on Extreme Scale Programming Models and Middleware. Piscataway, NJ, USA: IEEE Press, 2016: 17-23.
[10]	ZHANG G, CHEN Y, ZHANG Y, et al. MIC accelerated LU decomposition for method of moments[C]//IEEE International Symposium on Antennas and Propagation & Usnc/ursi National Radio Science Meeting. IEEE, 2015: 756-757.
[11]	JEFFERS J, REINDERS J. Intel Xeon Phi协处理器高性能编程指南[M].陈健, 李慧, 杨昆, 等, 译.北京: 人民邮电出版社, 2014.
[12]	高伟, 赵荣彩, 韩林, 等. SIMD自动向量化编译优化概述[J].软件学报, 2015, 26(6):1265-1284. http://d.old.wanfangdata.com.cn/Periodical/rjxb201506001
[13]	周领良, 朱延超, 刘轶, 等.基于Cache命中率校准的并行程序性能预测[C]//2014全国高性能计算学术年会论文集.中国计算机学会, 2015: 814-817.
[14]	艾维丽.浅析Cache命中率与块的大小之间的关系[J].价值工程, 2011, 32:153. http://d.old.wanfangdata.com.cn/Periodical/jzgc201132110
[15]	叶凝, 应忍冬, 朱新忠, 等.众核处理器系统可靠性优化方案[J].计算机与现代化, 2013, 218(10):143-148. doi: 10.3969/j.issn.1006-2475.2013.10.036
[16]	MIWA M, NAKASHIMA K. Progression of MPI Non-blocking Collective Operations Using Hyper-Threading[C]//201523rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE, 2015: 163-171.
[17]	QUN N H, KHALIB Z I A, WARIP M N, et al. Hyper-threading technology: Not a good choice for speeding up CPU-bound code[C]//International Conference on Electronic Design. IEEE, 2017: 578-581.
[18]	RAJESH N, MALATHI K, RAJU S, et al. Design of vivaldi antenna with wideband radar cross section reduction[J]. IEEE Transactions on Antennas and Propagation, 2017, 65(4):2102-2105. doi: 10.1109/TAP.2017.2670566
[19]	HU C F, LI N J, CHEN W J, et al. High-precision RCS measurement of aircraft's weak scattering source[J]. Chinese Journal of Aeronautics, 2016, 29(3):772-778. doi: 10.1016/j.cja.2016.03.003