Pytorch Serve 를 진행하는 중 아래와 같은 에러 발생
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
- Full Log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 382, in __getattr__
func = self.__getitem__(name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 387, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/usr/local/lib/python3.8/dist-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/usr/local/lib/python3.8/dist-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/usr/local/lib/python3.8/dist-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.8/dist-packages/nvgpu/list_gpus.py", line 75, in <listcomp>
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.8/dist-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
NVML 이란
NVIDIA Management Library
NVIDIA GPU 디바이스
들의 다양한 상태를 관리하고 모니터링하는 C-based API- NVML 은 nvidia-smi 를 통해서 직접 쿼리와 커맨드를 날릴 수 있도록 제공함
- NVML 런타임 버전은 NVIDIA 디스플레이 드라이버와 함께 제공되며, SDK 는 적절한 헤더와 스텁 라이브러리 및 샘플 애플리케이션을 제공함
Pynvml 이란
- GPU 관리 및 모니터링 함수에 대한 Python 인터페이스
- 즉, NVML 라이브러리 래핑 컴포넌트임.
- 11.0.0 버전부터는 pynvml 에서 사용되는 NVML-wrappers 은 nvidia-ml-py 와 동일하게 가져가고있음!
Usage
1
2
3
4
5
6
7
8
9
10
11
12
>>> from pynvml import *
>>> nvmlInit()
>>> print(f"Driver Version: {nvmlSystemGetDriverVersion()}")
Driver Version: 11.515.48
>>> deviceCount = nvmlDeviceGetCount()
>>> for i in range(deviceCount):
... handle = nvmlDeviceGetHandleByIndex(i)
... print(f"Device {i} : {nvmlDeviceGetName(handle)}")
...
Device 0 : Tesla K40c
>>> nvmlShutdown()
- 위와 같이 사용할 수 있음.
- 근데 이번 케이스 같은 경우 아래와 같이 에러가 발생
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
>>> nvmlDeviceGetComputeRunningProcesses(handle)
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 382, in __getattr__
func = self.__getitem__(name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 387, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
Solution
- CUDA 버전 업그레이드를 하거나, pynvml 버전을 다운그레이드 해라!
- pynvml 버전을 11.4로 다운그레이드 해서 테스트 진행
- 정상동작함!
Reference
https://forums.developer.nvidia.com/t/unable-to-access-pynvml-methods/226914
https://github.com/NVIDIA/k8s-device-plugin/issues/331
https://pypi.org/project/pynvml/