g++ - OpenMP 5 offloading C++ struct with member functions and data pointers - Stack Overflow

admin2025-01-07  3

I have used OpenMP in a number of my past projects. Separately, I have also written CUDA and OpenCL GPU codes in a number of open source projects.

I've heard a lot that OpenMP 5 added support to NVIDIA/AMD GPU, and started some experiments in porting my CUDA/OpenCL codes to OpenMP5, but got stuck when offloading the code to nvidia GPU. I think I might have missed some basics, hence seeking some pointers here.

I decided to use C++ for this project. My full source code (~400 lines of C++ codes) can be accessed at

To compile/test the code, one should run

git clone .git
cd umcx/src
make           # build multi-threading without gpu offloading
./umcx cube60  # running the benchmark

the above multi-threaded version can be built/run properly on g++ 11 or newer. The code also works on NVIDIA GPU when built with nvc using

make clean
make nvc
./umcx cube60

However, I got many problems when trying to build it using g++-12/13 for nvptx-none, for example, if I run

make clean
make nvidia CXX=g++-12

[Update Dec 31, 2024] The missing sinf math function error was fixed by adding -foffload="-lm" as suggested by Mat Colgrove in the below comment.

Now g++-12 was able to produce a valid binary, however, running it produces the below memory error

$ make nvidia CXX=g++-12
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
/usr/bin/ld: /tmp/cc2w2usX.crtoffloadtable.o: warning: relocation against `__offload_funcs_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
$ ./umcx cube60
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed

running compute-sanitizer ./umcx cube60, it reports many memory reading errors for line#185 of the code

========= Invalid __global__ read of size 4 bytes
=========     at main$_omp_fn$1+0x14f0 in umcx.cpp:185
=========     by thread (0,4,0) in block (25,0,0)
=========     Address 0x7f7f651712d0 is out of bounds
=========     and is 522,050 bytes after the nearest allocation at 0x7f7f64c00000 of size 5,184,399 bytes

Interestingly, g++-12 was also able to build an amdgcn offloaded binary with the same warning as above.

$ make amd CXX=g++-12
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=amdgcn-amdhsa="-march=gfx906" -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=amdgcn-amdhsa="-march=gfx906" -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
/usr/bin/ld: /tmp/ccep7zpg.crtoffloadtable.o: warning: relocation against `__offload_funcs_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
$./umcx cube60

libgomp: Offload data incompatible with GCN plugin (expected 3, received 2)

libgomp: Cannot map target functions or variables (expected 1, have 4294967295)

I am not able to get g++-13 to work.

If I build it with make nvidia CXX=g++-13, I got a different error

g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
x86_64-linux-gnu-accel-nvptx-none-gcc-13: fatal error: cannot read spec file ‘libgomp.spec’: No such file or directory
compilation terminated.
nvptx mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-13 returned 1 exit status

On a different machine, same xubuntu 22.04 distro, same g++-12/13 installed from ppa, CUDA 12.6, both compilers gave me Value 'sm_30' is not defined for option 'gpu-name' error.

g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
ptxas fatal   : Value 'sm_30' is not defined for option 'gpu-name'
nvptx-as: ptxas returned 255 exit status
nvptx mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-13 returned 1 exit status

so my questions are

  1. what change do I need to make to make g++ build my code for nvptx?
  2. what is the supported mechanism for g++-12/13 to deep-copy dynamic array element of a struct/class to the GPU? I see OpenMP 5.1 example document, Page 181/183 show samples using declare mapper()/map(mapper(id)) for this, but gcc does not yet support mapper.
  3. nvc++ builds the following directives for mapping a dynamic array inside a nested class
map(alloc: inputvol.vol)  map(to: inputvol.vol[0:inputvol.dimxyzt]) map(alloc: outputvol.vol) map(from: outputvol.vol[0:outputvol.dimxyzt]) \

and produced correct results on the GPU. is this supported on gcc?

I have used OpenMP in a number of my past projects. Separately, I have also written CUDA and OpenCL GPU codes in a number of open source projects.

I've heard a lot that OpenMP 5 added support to NVIDIA/AMD GPU, and started some experiments in porting my CUDA/OpenCL codes to OpenMP5, but got stuck when offloading the code to nvidia GPU. I think I might have missed some basics, hence seeking some pointers here.

I decided to use C++ for this project. My full source code (~400 lines of C++ codes) can be accessed at https://github.com/fangq/umcx

To compile/test the code, one should run

git clone https://github.com/fangq/umcx.git
cd umcx/src
make           # build multi-threading without gpu offloading
./umcx cube60  # running the benchmark

the above multi-threaded version can be built/run properly on g++ 11 or newer. The code also works on NVIDIA GPU when built with nvc using

make clean
make nvc
./umcx cube60

However, I got many problems when trying to build it using g++-12/13 for nvptx-none, for example, if I run

make clean
make nvidia CXX=g++-12

[Update Dec 31, 2024] The missing sinf math function error was fixed by adding -foffload="-lm" as suggested by Mat Colgrove in the below comment.

Now g++-12 was able to produce a valid binary, however, running it produces the below memory error

$ make nvidia CXX=g++-12
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
/usr/bin/ld: /tmp/cc2w2usX.crtoffloadtable.o: warning: relocation against `__offload_funcs_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
$ ./umcx cube60
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed

running compute-sanitizer ./umcx cube60, it reports many memory reading errors for line#185 of the code

========= Invalid __global__ read of size 4 bytes
=========     at main$_omp_fn$1+0x14f0 in umcx.cpp:185
=========     by thread (0,4,0) in block (25,0,0)
=========     Address 0x7f7f651712d0 is out of bounds
=========     and is 522,050 bytes after the nearest allocation at 0x7f7f64c00000 of size 5,184,399 bytes

Interestingly, g++-12 was also able to build an amdgcn offloaded binary with the same warning as above.

$ make amd CXX=g++-12
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=amdgcn-amdhsa="-march=gfx906" -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-12  -g -Wall -Wextra -std=c++14 -O3  -foffload=amdgcn-amdhsa="-march=gfx906" -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
/usr/bin/ld: /tmp/ccep7zpg.crtoffloadtable.o: warning: relocation against `__offload_funcs_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
$./umcx cube60

libgomp: Offload data incompatible with GCN plugin (expected 3, received 2)

libgomp: Cannot map target functions or variables (expected 1, have 4294967295)

I am not able to get g++-13 to work.

If I build it with make nvidia CXX=g++-13, I got a different error

g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
x86_64-linux-gnu-accel-nvptx-none-gcc-13: fatal error: cannot read spec file ‘libgomp.spec’: No such file or directory
compilation terminated.
nvptx mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-13 returned 1 exit status

On a different machine, same xubuntu 22.04 distro, same g++-12/13 installed from ppa, CUDA 12.6, both compilers gave me Value 'sm_30' is not defined for option 'gpu-name' error.

g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-13  -g -Wall -Wextra -std=c++14 -O3  -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
ptxas fatal   : Value 'sm_30' is not defined for option 'gpu-name'
nvptx-as: ptxas returned 255 exit status
nvptx mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-13 returned 1 exit status

so my questions are

  1. what change do I need to make to make g++ build my code for nvptx?
  2. what is the supported mechanism for g++-12/13 to deep-copy dynamic array element of a struct/class to the GPU? I see OpenMP 5.1 example document, Page 181/183 show samples using declare mapper()/map(mapper(id)) for this, but gcc does not yet support mapper.
  3. nvc++ builds the following directives for mapping a dynamic array inside a nested class
map(alloc: inputvol.vol)  map(to: inputvol.vol[0:inputvol.dimxyzt]) map(alloc: outputvol.vol) map(from: outputvol.vol[0:outputvol.dimxyzt]) \

and produced correct results on the GPU. is this supported on gcc?

Share Improve this question edited Dec 31, 2024 at 23:01 FangQ asked Dec 29, 2024 at 4:08 FangQFangQ 1,54410 silver badges21 bronze badges 8
  • 2 Please provide a stackoverflow.com/help/minimal-reproducible-example. Other things to try: use an up-to-date compiler, especially openmp support has significantly improved in the last 3 years. c++filt helps you to make sense of the c++ symbol names. You might be missing some function prototypes where the offloading cannot apply implicit type conversion. – Joachim Commented Dec 29, 2024 at 5:28
  • The latest gcc I could install on ubuntu 22.04 via ppa is gcc-13, which I tried, now the error is ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'. I added -foffload=-misa=sm_35 as suggested in an online post, but the error remains. My cuda version is 11.5. – FangQ Commented Dec 29, 2024 at 16:58
  • Found several threads on this error, but no clear workaround, ref 1, 2, 3, 4. – FangQ Commented Dec 29, 2024 at 16:58
  • @Joachim, I updated my post and included the github repo github.com/fangq/umcx for my source code (~400 lines) and the commands to build/reproduce the compiler issues. let me know if you are able to see these errors and offer some suggestions how to get around it. – FangQ Commented Dec 30, 2024 at 4:29
  • 1 For the "sinf" error, I think you just need to add "-foffload-options=-lm" to the link so the device versions of libm are included. See: gcc.gnu.org/wiki/Offloading. – Mat Colgrove Commented Dec 31, 2024 at 17:33
 |  Show 3 more comments

2 Answers 2

Reset to default 1 +50

Compiling the code with clang++-18, succeeds with using mappers or the two manual map clauses:

$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -DGPU_OFFLOAD umcx.cpp --cuda-gpu-arch=sm_90
$ CUDA_VISIBLE_DEVICES=0 ./a.out cube60
simulation completed, speed 25326 photon/ms, duration 394.844000 ms, absorption fraction 17.688389%

To successfully compile with clang, I had to add more constructors for dim4:

    dim4(uint32_t x, uint32_t y, uint32_t z, uint32_t w) : x(x), y(y), z(z), w(w) {}
    dim4(uint32_t x, uint32_t y, uint32_t z) : x(x), y(y), z(z) {}
    dim4(const dim4& o) = default;
    dim4() = default;

Clang still warns that the MCX_volume struct is not trivially copyable:

umcx.cpp:447:28: warning: Type 'MCX_inputvol' (aka 'MCX_volume<int>') is not trivially copyable and not guaranteed to be mapped correctly [-Wopenmp-mapping]
  447 |     map(mapper(input), to: inputvol)  \
      |                            ^~~~~~~~

This warning might provide a hint on what goes wrong when compiling with gcc. Probably, the compiler cannot create a valid instance of the object on the device. It might be worth filing a bug for gcc.

I want to thank some of the helpful comments.

I believe I have answers to most of the questions I had previously, and want to write a short summary here.

To map data stored in embedded pointers inside a struct/class such as

struct Dataset {
    int len = 0;
    float *data = nullptr;
    Dataset() {}
    Dataset(int len0, int *data0) {...}
} readonlydata, readwritedata;

the following omp pragma works on most compilers (gcc 11 to 14, clang 16+)

map(to: readonlydata) map(to: readonlydata.data[0:readonlydata.len]) \
map(tofrom: readwritedata) map(tofrom: readonlydata.data[0:readonlydata.len])

The data pointer must be separately mapped in order to pass those to the device.

this was mostly inspired by the "Deep-Copy" OpenACC example shared by Mat Colgrove

https://developer.download.nvidia.com/assets/pgi-legacy-support/Deep-Copy-Support-in-OpenACC_PGI.pdf

It appears that OpenMP also supports using variables as array length at runtime.

Based on the OpenMP 5.1 examples, another way to map such nested dynamic data is to use declare mapper(), which does not apply to individual variable, but applies to the struct type (typedef)

typedef struct Dataset dataset;
#pragma omp declare mapper(dataset ds) map(ds, ds.data[0:ds.len])

Unfortunately, it appears that declare mapper() clause is currently not supported in either gcc or nvc.

Now, regarding gcc, clang and nvc, the completeness and robustness of their OpenMP GPU offloading features are quite uneven and overall buggy.

Among these 3 compilers, nvc is the most robust and also offers the highest gpu speed after offloading. However, it is only supported on Linux. gcc/clang can build on Mac/Windows, but both produced slow/unoptimized binaries. gcc-12 is relatively the more stable one, but the binary is also quite slow. gcc-11 can build my code, but does not run properly on some GPUs; gcc-13/14 both can build, but won't run. I have found a number of regressions that were related to those error messages.

Some commonly seen gcc error messages when building nvptx with gcc-11 to 13

  • if g++ reports a linking error that "undefined symbol sinf" or cosf or sincosf, it can be resolved by adding -foffload="-lm" in the linking command
  • if g++ (on Ubuntu 22.04, but not 24.04) reports a linking error that x86_64-linux-gnu-accel-nvptx-none-gcc-13: fatal error: cannot read spec file ‘libgomp.spec’: No such file or directory this is a known gcc bug, one can link the /usr/lib/gcc/x86_64-linux-gnu/12/accel folder to gcc/13
  • if g++ reports an error Value 'sm_30' is not defined for option 'gpu-name' this is because of the cuda version of your system, need to update to 12.x. Adding -foffload="-march=sm_35" won't help
  • gcc-13 and gcc-14 has a known regression for handling nested dynamic arrays inside struct, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113867, it produced an error libgomp: cuCtxSynchronize error: an illegal memory access was encountered libgomp: cuMemFree_v2 error: an illegal memory access was encountered libgomp: device finalization failed
  • gcc-12 can produce valid nvptx binary, but the binary is really slow. there are two problems of gcc-12 produced nvptx binaries
    • in gcc-12/13, the total launched block number ignores user's num_teams() or OMP_NUM_THEAMES setting, and is capped to a small number, which is severely underutilizing the GPU hardware; I suspect that it is related to this fix (maybe a regression?): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109875
    • the produced kernel is unoptimized and uses too many registers. In my case, gcc uses 150+ registers compared to 60ish for nvc compiled binary.
  • so far, I haven't been able to make gcc amdgcn offloading to work - gcc-12/13 can compile this binary, but it always gives libgomp: Offload data incompatible with GCN plugin (expected 3, received 2) libgomp: Cannot map target functions or variables (expected 1, have 4294967295) error when running.

as of now (Jan of 2025), gcc's GPU offloading is still quite buggy and unoptimized. nvc is the quicker solution to get the code to build and run.

转载请注明原文地址:http://conceptsofalgorithm.com/Algorithm/1736254342a197.html

最新回复(0)