Q&A

After trying to run MASNUM_WAVE on Sunway via the exp1_run.csh script, the job is killed without any reason. What should we do even if the official example doesn't work on Sunway?

Every team has his queue, you can input `qload -w` to see your queue. Every team can request at most 16 nodes for his job. Then you should modify your submit command `bsub -I -n 1 -q q_sw_expr -share_size 6000 -host_stack 1024 -b -m 1 -o out.qrunout ./masnum.wam.mpi` to `bsub -I -n 1 -q q_sw_asc_3 -share_size 6000

-host_stack 1024 -b -m 1 -o out.qrunout ./masnum.wam.mpi`.

 

We ran the program with the following makefile and bsub command line:

bsub -n 2 -q q_sw_expr -share_size 6000 -host_stack 1024 -b -m 1 -o out.qrunout ./masnum.wam.mpi

We also validated with the newest validation pack that used Intel compiler, but the result was the same, with the error at point (138, 62).

Your job has not completed, please check whether or not you have generated a file named pac_ncep_wav_20090228.nc.

The q_sw_expr is a test queue, your job will be killed after running for an hour. Please execute the command:

qload -w

to check the queue you can use, and resubmit your job.

 

We found that the output module in masnum_wave takes a lot of time, and the output frequence could be set in the script 'masnum_wave/exp/exp1/exp1_run.csh'.

Set outflag = 3 # output wave variables into file multi-records,

# 1 : one file every year,

# 2 : one file every month,

# 3 : one file every day,

# else : one file every run.

Set wiofreq = 24 # --- The output frequence for wave results (hour).

Set ciofreq = 24 # --- The output frequence for current coef.s (hour).

Set rstfreq = 24 # --- The output frequence for model restart (hour).

However, we are not sure whether those parameters are allowed to be modified in this competition, although the result we got could pass the validation. Could you please tell us about that?

We provide a small scale example at /usr/sw-mpp/apps/src/masnum_wave_mini in order to debug conveniently. It would run for about an hour.

And the parameters are not allowed to be modified.

 

While running masnum_wave on Sunway, how many cores per node will be available to the contestants?

A total of 64 cores (4 cores per node) will be available to the contestants.

 

I tried to compile the MASNUM with sw5f90 compilers, not the default mpif90. However, I encountered the error “not recognizing mpi_* statements”, could you tell me how to solve this problem?

'sw5f90' is a serial compiler, so it can't find mpif.h. We recommend you to use 'swafort'.

 

Could we use OpenMP on the Sunway platform?

OpenMP is ok on the platform. We suggest you use swafort compiler, you can refer to the manual for more info.

 

Where could I get the training PPT?

You can download the training PPT from the KNL platform: /home/training.

 

While running masnum_wave on Sunway, how many cores per node will be available to the contestants?

Each node has 4 nuclear groups, each group has 1 MPE core and 64 CPE cores, a total of 260 cores.

 

How to download MASNUM_WAVE program from the Sunway platform?

cd /usr/sw-mpp/apps/src/

cp -r masnum_wave/ ~/online1/ #copy program to online1 folder

Please refer to the manual.

 

While working on the directory online1, it encountered an error “Disk quota exceeded”. How to solve this problem?

The home directory quota is 100M, and online1 quota is 500G, please check your disk space.

 

I have a question about PaddlePaddle!

If there’s a problem installing PaddlePaddle or about PaddlePaddle itself, submit your question at https://github.com/PaddlePaddle/Paddle to get replies from the Baidu team. If it’s related to the traffic prediction application, ask [email protected]. The lecturer at the training program will not answer questions sent to his email address.

 

How to set the parameter ietime in exp2 example?

Please check the parameter 'set ietime = 20090701' in the /masnum_wave/exp/exp2/exp2_run.csh.

 

We have a question when we solve the masnum problem: there is no data such as running-time when we’ve finished running the program. How can we know our performance?

When you submit the job with the command "bsub", you can add the following options:

    main core: -sw3runarg "-P master"

    main-standby core: -sw3runarg "-p -f"

When the job is completed, it will generate some files such as gmon.out.0.6401. Then execute the command: gprof masnum.wam.mpi gmon.out.0.6401 and it will print some useful information.

 

It's required that the proposal should include the performance estimation of MASNUM app. Can you please clarify some details?

You should provide the strategies of optimization such as performance characteristic analysis, hotspots analysis, scalability, algorithm etc.

 

In masnum wave, what is the difference between a "mini" and an "original"?

They are the same source code. The mini version only changed the size to easy test. We have added new folder netcdf_0816 and modified the validation folder in the path /usr/sw-mpp/apps/src/masnum_wave_mini, you should re-download to test.

 

On the platform, the user’s queue is batch, right?

The platform only has one queue: batch.

 

Where can we get the log file in MASNUM_WAVE?

Please use the screen printing log.

 

We failed to compile masnum with sw5f90, as we did not find “mpif.h" while linking.

Please try to use mpif90 instead of sw5f90.

 

In masnum, how to calculate the time spending by masnum_wave?

You can add the code. See the manual page for further details.

 

How to monitor the job status and resource use on KNL platform?

The platform has deployed the software ClusterEngine. You can access it: https://202.196.76.107:9999/module/login/login.html, using your username and password.

 

How many resources could be applied for an account on KNL platform?

An account is allowed to submit 4 jobs.

 

How to set compiler and MPI on KNL platform?

You can reset your environment variable. Using cmd:
source /opt/intel/impi/2017.1.132/bin64/mpivars.sh
source /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64

 

The tasks will be killed once they have been running for more than 1 hour on Sunway platform? Is the running time for each job limited?

The test queue is scheduled to run for 1 hour and each team has its own queue. You can use the command 'qload -w' to view your own queue. The tasks running on Sunway system are allowed to run for 12 hours at most. Each team is allowed to apply 16 nodes at most.

 

The exp2 didn’t include 20090228_standard.nc, but you asked us to do validation for this data on Sunway platform.

This is a typo in the document. And the code shall prevail.

 

The masnum_wave running time is too long to get the results of operation on Sunway platform, could you provide a small example?

The/usr/sw-mpp/apps/src/masnum_wave_mini directory has updated a small example, which is used for user testing. Running for about one hour can get the result. Please let the users download the use, and pay attention to the backup.

 

How to update the masnum_wave_mini program on Sunway platform by running cmd?

Regarding the new update on masnum_wave_mini, once you finished the validation compilation, you could directly use ./compare_exp1 to run not bsub.

 

May I install VTune performance analysis tools on Sunway platform?

We haven’t used Vtune on this platform. You can try it.

 

About the RMSE test program, is it written by ourselves or provided by PaddlePaddle? About the training log files, which chair do we need to prompt ours?

Now you need to submit the output result to us. If you do the validation by yourself, you should write the RMSE program by yourself.

 

Can I get more time for KNL computing job?

The good news is that Zhengzhou University provided more KNL servers to the KNL remote platform, so yes, the time limit for a KNL computing job will be expanded to 1 hour.

 

Regarding KNL, are we allowed to use Intel's optimized HPL "Intel Optimized MP LINPACK Benchmark for Clusters"?

Yes.

 

I want to know the mode of KNL platform used.

Flat/Quadrant.

 

Can we change the hardware settings of KNL CPU? 

No.

 

About KNL, is the “HPCG version” mentioned in the question about HPL test incorrect?

It should be “HPL version” instead.

 

How can I submit my jobs on Sunway Platform?

Every team has its queue, you can input `qload -w` to see your queue. Every team can request at most 16 nodes for his job. Then you should modify your submit command `
bsub -I -n 1 -q q_sw_expr -share_size 6000 -host_stack 1024 -b -m 1 -o out.qrunout ./masnum.wam.mpi` to `bsub -I -n 1 -q q_sw_asc_3 -share_size 6000
-host_stack 1024 -b -m 1 -o out.qrunout ./masnum.wam.mpi`

 

About Sunway, how should I deal with the error indicating that “Disk quota exceeded”?

Work in the "online" directory instead of home directory.

 

Which environment is suggested to access Sunway?

It is recommended to access Sunway under MacOS or Windows environment.

 

Is accuracy of output data the only scoring criterion for the PaddlePaddle application, regardless of processing speed?

Yes

 

Is PaddlePaddle the only deep learning framework that is allowed? (tensorflow, MXnet, caffe are not allowed?)

Yes

 

About PaddlePaddle, our main task is to optimize the benchmark demo, to make it run faster and the results more accurate?

Accuracy is more concerned instead of speed.

 

In the speeds.csv file for PaddlePaddle task 4, what does data 0 mean?

0 means no data; the data is missing.

 

In the "Deep learning contest: traffic prediction" problem, the URL of getting data "https://github.com/PaddlePaddle/Paddle/blob/develop/demo/traffic_prediction/data/get_ddat.sh" is not available.

Sorry, it’s a mistake. The correct address is https://github.com/PaddlePaddle/Paddle/blob/develop/demo/traffic_prediction/data/get_data.sh

 

How can I submit jobs on KNL remote platform?

There is a correction in the KNL remote platform guide:
pbs  script introduction
Brief description:
###Define the job name as myjob, can be modified when required.
#PBS -N myjob
### Specify both the number of nodes and the number of CPUs that are needed for the workload. This parameter can be changed later. The following example requests 1 nodes.
#PBS -l nodes=1 
###Set the environment variables needed by the computation job
source /opt/intel/impi/2017.1.132/bin64/mpivars.sh
source /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64 

###Compute the number of CPUs used for the job
NP=`cat $PBS_NODEFILE | wc -l`
### The command to run the job; cpi-mpich is the name of the compiled executable file, and needs to be modified.
/opt/intel/compilers_and_libraries_2017.1.132/linux/ mpi/intel64/bin/mpirun -np $NP -machinefile $PBS_O_WORKDIR/hosts cpi-mpich
Note: #PBS line is not the comment. All comment lines start with ###.

 

We still don’t know how to login the Sunway/KNL remote platform

Every team was emailed their own login information. For KNL platform example, 
KNL IP KNL Compiling Node KNL Account KNL Password
202.196.76.107:7777 KNL001 asc0026 x2o1YBB4

You can ssh [email protected] (The port is 7777), and input your password x2o1YBB4, and now you are on the login node, please do not do any compiling or other operations on login node, you should access your compiling node KNL001: ssh KNL001. Again, you must do your work on your compiling node, not the login node.
For Sunway example,

Sunway IP Sunway VPN Account Sunway VPN Password Sunway Account Sunway Password
42.0.0.70 ascvpn001 [email protected] ascusr026 669328ae

You must download the VPN software following the Sunway quick start guide and login VPN using your VPN account ascvpn001 and password [email protected], then you can ssh [email protected] and input your password 669328ae.

 

How many computing nodes can we use on Sunway/KNL remote platform?

Every team can submit 4-nodes job on Sunway, and single node job on KNL platform. We will add more KNL servers into the KNL platform in early February.

 

MASNUM sounds familiar. Will any team have any advantage in the competition?

MASNUM is an ultra-high resolution global surface wave simulation. We adopt this Gordon Bell Prize finalist as a competition application in the ASC Student Supercomputer Challenge 2017. The First Institute of Oceanography provides its original code to not only offer students access to the world’s top technologies but also challenge young talent worldwide for ingenious approaches. To this purpose, FIO will by no means share the optimized code with anyone that participates in the ASC17 before the Final. There are some optimizations published in this paper www.computer.org/csdl/proceedings/sc/2016/8815/00/8815a046.pdf. However, we strongly encourage creative thinking and anticipate further breakthroughs on top of this already successful innovation.

 

How can I get more information about the resource utility of KNL remote platform?

You can access https://202.196.76.107:9999/module/login/login.html with your account and password to monitor cluster resource usage by Inspur ClusterEngine 4.0. 

 

How much is it to compete in the ASC Student Supercomputer Challenge?

ASC is free to get in. Plus, we provide supercomputing systems so you don't have to worry about finding a sponsor. You'll also have access to HPC classes, training materials, and a tour in a supercomputing center -- all complimentary. If your team makes the Final, we can cover accommodations including dining, local transport, and housing. We will provide supporting paperwork if you need to look for funding for travel.

 

 

Contact Us
Technical Support Yu Liu, Weiwei Wang [email protected]
Media Jie He [email protected]
Collaboration Yan Yan [email protected]
General Information Dongzhi Kong [email protected]

 

Partners
Copyright 2017 Asia Supercomputer Community. All Rights Reserved