問題描述
因為冬季乾燥, 家裡還養著兩隻貓, 手上常常帶著靜電. 最近常常用樹莓派跑很重的任務, 就忍不住去用手試它的溫度. (儘管有內置溫度傳感器) 結果一試就出了問題, 常常就听見一聲劈啪的靜電放電聲音, 然後用 USB 連接的磁盤就直接斷開, 如果定位到磁盤的掛載點, 就會看到下面的情況:
➜ data ls
ls: 正在讀取目錄'.': 輸入/輸出錯誤
解決這個方法最快的辦法只有重啟. 先 umount 再 mount 有沒有用我并没有去尝试.
一開始我以為是樹莓派本身的問題, 因為比較忙碌也一直沒想到解決.(不碰它就可以羅) 有一天我偶然使用朋友一台古老的 Macbook Air 時手在觸摸板上激起靜電, 結果外接的移動硬盤突然斷連, 才意識到靜電大概對 USB 3.0 有點不友好.
我找這篇文章, 大意就是靜電 (ESD) 會導致 USB 線路暫時短路, 所以 USB 電路碰到靜電就會直接斷開. 我的散熱殼和樹莓派的 USB 端口是導通的, 也怪不得靜電會影響到 USB 的連接.
要解決這個問題, 不去碰它大概是一個很消極被動的方案. 我有想過將樹莓派接地, 比方說可以從樹莓派的 GND 針腳上引出一根線接在插座的地線上, 但是在國內好像買不到這種接地插頭.
那把金屬外殼直接和附近接地的金屬表面搭一下也可以啊, 但是誒, 附近真的沒什麼接地的金屬表面, 有一個宜家的鋁合金落地燈, 但是是二眼插.
不過考慮到這種情況比較少發生, 而且發生了重啟就能解決問題, 那隻要發生之後自動重啟就好了唄. 我便打算用 watchdog 來自動重啟.
這玩意的原理是 watchdog 守護程序會不斷向主板上的 watchdog 硬件發 Keepalive 包, 一旦距離上一次信號過了設定的時間還沒收到更新信號 (在我的配置文件裡是 15 秒) , 主板上的 watchdog 就會判定設備當機強制重啟 ( Hard Reboot ) , 大概就這麼個原理.
安裝 watchdog
如果還沒有安裝的話, 首先安裝 watchdog.
# 查看 watchdog 硬件是否存在
ls -la /dev/watchdog*
# crw------- 1 root root 10, 130 1月 22 21:47 /dev/watchdog
# crw------- 1 root root 250, 0 1月 22 21:47 /dev/watchdog0
# 安裝 watchdog 驅動
apt update && apt install watchdog
# 編輯配置文件
vim /etc/watchdog.conf
這是我的配置文件:
#ping = 172.31.14.1 #ping = 172.26.1.255 #interface = eth0 #file = /var/log/messages #change = 1407 # Uncomment to enable test. Setting one of these values to '0' disables it. # These values will hopefully never reboot your machine during normal use # (if your machine is really hung, the loadavg will go much higher than 25) max-load-1 = 24 #max-load-5 = 18 #max-load-15 = 12 # Note that this is the number of pages! # To get the real size, check how large the pagesize is on your machine. #min-memory = 1 #allocatable-memory = 1 #repair-binary = /usr/sbin/repair #repair-timeout = 60 #test-binary = #test-timeout = 60 # The retry-timeout and repair limit are used to handle errors in a more robust # manner. Errors must persist for longer than retry-timeout to action a repair # or reboot, and if repair-maximum attempts are made without the test passing a # reboot is initiated anyway. #retry-timeout = 60 #repair-maximum = 1 # 以下两行必须有 watchdog-device = /dev/watchdog watchdog-timeout = 15 # Defaults compiled into the binary #temperature-sensor = #max-temperature = 90 # Defaults compiled into the binary #admin = root #interval = 1 #logtick = 1 #log-dir = /var/log/watchdog # This greatly decreases the chance that watchdog won't be scheduled before # your machine is really loaded realtime = yes priority = 1 # Check if rsyslogd is still running by enabling the following line #pidfile = /var/run/rsyslogd.pid
保存關閉, 啟動 watchdog 服務:
systemctl start watchdog
systemctl enable watchdog
執行 systemctl status watchdog
, 如果沒有問題的話, watchdog 就啟用成功了.
讓 watchdog 在磁盤出問題時重啟
但是 watchdog 是默認不管硬盤連接問題的. 要讓 watchdog 在乎硬盤有沒有故障, 得用到一個比較鮮為人知的功能, 就是自動執行測試和嘗試修復.
Debian ManPage watchdog/watchdog.8: 點擊閱讀
Linux Watchdog Daemon – Test/Repair Scripts: 點擊閱讀
根據 ManPage, watchdog 會每隔一段時間嘗試執行放在 /etc/watchdog.d/
這個目錄裡的任何可執行程序/腳本. 然後根據返回值決定行為, 以下是對 watchdog 有意義的返回值:
255 | (based on -1 as unsigned 8-bit number) Reboot the system. This is not exactly an error message but a command to watchdog. If the return code is this the watchdog will not try to run a shutdown script instead. |
254 | Reset the system. This is not exactly an error message but a command to watchdog. If the return code is this the watchdog will attempt to hard-reset the machine without attempting any sort of orderly stopping of process, unmounting of file systems, etc. |
253 | Maximum load average exceeded. |
252 | The temperature inside is too high. |
251 | /proc/loadavg contains no (or not enough) data. |
250 | The given file was not changed in the given interval. |
249 | /proc/meminfo contains invalid data. |
248 | Child process was killed by a signal. |
247 | Child process did not return in time. |
246 | Free for personal watchdog-specific use (was -10 as an unsigned 8-bit number). |
245 | Reserved for an unknown result, for example a slow background test that is still running so neither a success nor an error. |
假設腳本名為 check_temp, 內容為溫度過高返回 252, 溫度正常返回 0. 那麼 watchdog 會定時執行 check_temp test
然後獲取返回值, 若不為 0, 比方說為 252, watchdog 會嘗試執行 check_temp repair 252
, 然後獲取返回值. 如果腳本有包含修復功能, 修好後返回 0, 那麼無事發生, 若修復不成功或者沒有修復功能繼續返回 252, watchdog 在數次無果後會重啟系統.
那麼, 我只需要寫一個腳本檢測給定路徑是否可寫就可以了.
mkdir /etc/watchdog.d/
cat << EOF > /etc/watchdog.d/test_io_error.sh
#!/bin/bash
DIR="/mnt/data/"
if [ -w "$DIR" ]; then
exit 0
else
exit 255
fi
EOF
chmod +x /etc/watchdog.d/test_io_error.sh
systemctl daemon-reload
systemctl stop watchdog
systemctl start watchdog
完成!
Tips:
在 bash 中可以通過 echo $?
的方式來顯示上一個程序的返回值.
為了測試腳本, 我在確認數據安全的情況下手動拔出 USB 線, 此時 watchdog 日誌顯示如下:
watchdog[2851]: test binary /etc/watchdog.d/test_io_error.sh returned 5 = 'Input/output error'
watchdog[2851]: test binary /etc/watchdog.d/test_io_error.sh returned 5 = 'Input/output error'
watchdog[2851]: test binary /etc/watchdog.d/test_io_error.sh returned 5 = 'Input/output error'
watchdog[2851]: test binary /etc/watchdog.d/test_io_error.sh returned 5 = 'Input/output error'
watchdog[2851]: test binary /etc/watchdog.d/test_io_error.sh returned 5 = 'Input/output error'
watchdog[2851]: test binary /etc/watchdog.d/test_io_error.sh returned 5 = 'Input/output error'
...
watchdog[2851]: Retry timed-out at 61 seconds for /etc/watchdog.d/test_io_error.sh
watchdog[2851]: repair binary /etc/watchdog.d/test_io_error.sh returned 5 = 'Input/output error'
watchdog[2851]: shutting down the system because of error 5 = 'Input/output error'
然後系統就重啟啦! Nice!
不過我不太理解, 為什麼我的腳本返回值為 255 卻在日誌中顯示返回 5. 我對 bash 的理解還不甚透徹, 我希望在之後的實踐中可以解答這個問題.
4月3日補充:
現在看這個問題還是比較簡單的齁. 在 if [ -w “$DIR” ]; 这一句, 可以分为几种情况. 最好的一种当然是文件存在, 脚本返回 0. 如果测试失败, 但是测试本身成功执行了, 那么返回 255. 当磁盘直接断开的情况下, 测试本身就会失败, 因此脚本根本执行不到返回的部分, 会在测试一句直接退出并返回失败原因, 也就是故障码 5, 下面的 exit 语句都不会被执行.
Thumbnail: NCSC by Aron Vellekoop León
If using the image violates your rights, or there’s anything related to copyright laws, please contact me at [email protected], I will deal with them immediately.