try to use AOT for kernels by shssf · Pull Request #4 · IntelPython/dpnp

shssf · 2020-09-08T23:26:45Z

It loook like AOT is usable for development only.

AOT reduces the module loading time from 21sec to 0.5sec.
if use CPU only option, it compiles successfully and run on CPU but failed with GPU run

terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  Native API failed. Native API returns: -42 (CL_INVALID_BINARY) -42 (CL_INVALID_BINARY)

if use both options:

Platform name: Intel(R) OpenCL
Device name: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
Driver version: 2020.11.8.0.27
OpenCL program was successfully created from SPIR-V file /tmp/backend_iface_fptr-ec9979.spv
Using build options:  -I "/tmp"
Compilation started
Compilation done
Linking started
Linking done
Device build started
Options used by backend compiler:
Device build done
Kernel <_ZTS25custom_blas_gemm_c_kernelIlE> was successfully vectorized (8)
Kernel <_ZTS25custom_blas_gemm_c_kernelIiE> was successfully vectorized (8)
Kernel <_ZTS24custom_blas_dot_c_kernelIlE> was successfully vectorized (8)
Kernel <_ZTS24custom_blas_dot_c_kernelIiE> was successfully vectorized (8)
Kernel <_ZTSN6oneapi3dpl20__par_backend_hetero24__parallel_sort_kernel_1IJ23custom_argsort_c_kernelIdlEEEE> was successfully vectorized (8)
Kernel <_ZTSN6oneapi3dpl20__par_backend_hetero24__parallel_sort_kernel_2IJ23custom_argsort_c_kernelIdlEEEE> was successfully vectorized (8)
...
Kernel <_ZTSN6oneapi3dpl20__par_backend_hetero26__parallel_reduce_kernel_1IJ19custom_sum_c_kernelIiEEEE> was successfully vectorized (8)
Kernel <_ZTSN6oneapi3dpl20__par_backend_hetero26__parallel_reduce_kernel_2IJ19custom_sum_c_kernelIiEEEE> was successfully vectorized (8)
Done.
OpenCL program binary file was successfully created: /tmp/backend_iface_fptr-5230b0.out
Error: Device name missing.
clang++: error: gen compiler command failed with exit code 226 (use -v to see invocation)
error: command 'clang++' failed with exit status 226

… __del__ against shutdown races Closes audit items IntelPython#5 and IntelPython#29 from the prior solver review. Item IntelPython#4 (_matmat default uses a per-column matvec loop) is closed as wontfix: SciPy's scipy.sparse.linalg.LinearOperator and cupyx's analogue both ship the same hstack-of-matvecs default, so dpnp matches the reference exactly and there is no portable improvement to make without subclass-level _matmat overrides (which _CustomLinearOperator already exposes via its matmat= constructor argument). scipy/sparse/linalg/_interface.py - Set __array_ufunc__ = None on the LinearOperator base class. This is the SciPy contract: a host numpy.ndarray on the left of np_array * linop or np_array @ linop previously triggered NumPy's ufunc dispatch first, which would attempt to broadcast the operator element-wise before falling back to its reflected operator method -- producing either an opaque error or a wrong- typed result. With __array_ufunc__ = None NumPy returns NotImplemented from the ufunc protocol and Python's operator dispatch falls through cleanly to LinearOperator.__rmul__ / __rmatmul__. dpnp.ndarray itself sets __array_ufunc__ = None (see dpnp/dpnp_array.py:222) for the same reason, so the two dispatch systems now agree. scipy/sparse/_csr.py, scipy/sparse/linalg/_iterative.py - Harden __del__ in csr_matrix and in _CachedSpMV against the interpreter-shutdown race where the compiled _sparse_impl extension is garbage-collected before the matrix instance whose oneMKL handle it owns. Previous code used a single except Exception: pass which silenced two qualitatively different failure modes: 1. shutdown race -- extension gone, si._sparse_gemv_release evaluates to None or AttributeError; the handle is unrecoverable and leaving the OS to reclaim it at process exit is the only sane option; 2. genuine backend error while the interpreter is healthy -- a real bug we want to surface eventually, but raising from __del__ produces only an 'Exception ignored in:' warning and the handle is gone either way. The new code probes getattr(si, '_sparse_gemv_release', None) explicitly so case (1) takes the fast non-call path, and then splits the except into (AttributeError, TypeError) for case (1)- style residuals (queue / handle attribute access racing the shutdown) versus a final broad except for case (2). Both still return silently from __del__ -- raising is never valid here -- but the intent is now documented and a real backend regression is no longer indistinguishable from the GC race in code review. tests/third_party/cupyx/scipy_tests/sparse_tests/test_linalg.py - test_array_ufunc_opt_out: asserts the __array_ufunc__ = None marker is present on LinearOperator. Mirrors SciPy's own test suite test_interface.py::test_array_ufunc_opt_out. - test_numpy_scalar_times_linop_dispatches_to_rmul: the concrete runtime consequence -- numpy.float64(2.0) * linop must produce a scaled LinearOperator, not raise or yield an array.

try to use AOT for kernels

9a1c57d

shssf closed this Sep 14, 2020

shssf deleted the aot_prelink_kernels branch September 18, 2020 14:44

antonwolfy referenced this pull request in antonwolfy/dpnp Sep 14, 2022

att #4

011347b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try to use AOT for kernels#4

try to use AOT for kernels#4
shssf wants to merge 1 commit into
masterfrom
aot_prelink_kernels

shssf commented Sep 8, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shssf commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shssf commented Sep 8, 2020 •

edited

Loading