文章目录
1. Watchdog是什么?2. Watchdog代码基本介绍3. 原因分析4. 实例分析例1:SystemServer Binder耗尽例2:IO阻塞例3:Thread.sleep阻塞
1. Watchdog是什么?
Android Watchdog主要是检测system server重要线程状态(main,android io,android display ,Activitymanager,android fg,android bg等线程 )的一种机制。发生Watchdog时会触发Android重启来使系统恢复。Watchdog是SystemServer的一个线程(class Watchdog extends Thread),检测system server重要线程的锁状态和消息是否阻塞,假如有线程block了60s那么就会触发watchdog timeout flow。
2. Watchdog代码基本介绍
可以在http://androidxref.com/搜索源码,看Watchdog.java文件
public class Watchdog extends Thread {
final ArrayList mHandlerCheckers = new ArrayList<>();
...
private Watchdog() {
super("watchdog");
//将前台线程加入队列
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
//将主线程加入队列
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
//将ui线程加入队列
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
//将i/o线程加入队列
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
//将display线程加入队列
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
addMonitor(new BinderThreadMonitor());
}
}
ActivityManagerService添加
@frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
3184 Watchdog.getInstance().addMonitor(this);
3185 Watchdog.getInstance().addThread(mHandler);
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
Watchdog线程循环检测
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
...
//java trace
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, getInterestingNativePids());
//sysrq
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
//dropbox
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
...
//打印LOG WATCHDOG KILLING SYSTEM PROCESS
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
// kill自己(SystemServer)重启
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
private static final class BinderThreadMonitor implements Watchdog.Monitor {
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
BinderThread耗尽,就会Blocked在这里
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast(mProcess->mExecutingThreadsCount),
static_cast(mProcess->mMaxThreads));
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
3. 原因分析
(1)手机性能影响主要一下:
cpu使用高 –查看cpu使用,看是否有问题
low memory –通过kernel log搜索lowmemorykiller看cache和free的内存是否够,是否有memory泄露
emmc问题(io文件)–block的trace应该在io操作时,可以kernellog搜索是否有emmc错误
dump log –在发生anr之前系统在dump log导致花费时间
object leakage –比如fd leak,window leakage,service leakage,receiver leakage,window leakage
(2)非性能影响,可能原因如下,看backtrace来分析
4. 实例分析
例1:SystemServer Binder耗尽
10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!
例2:IO阻塞
W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.wm.WindowManagerService on foreground thread (android.fg)
Watchdog告诉我们Monitor Checker超时了,具体在哪呢? 名为android.fg的线程在WindowManagerService的monitor()方法被阻塞了。这里隐含了两层意思:
WindowManagerService实现了Watchdog.Monitor这个接口,并将自己作为Monitor Checker的对象加入到了Watchdog的监测集中
monitor()方法是运行在android.fg线程中的。Android将android.fg设计为一个全局共享的线程,意味着它的消息队列可以被其他线程共享, Watchdog的Monitor Checker就是使用的android.fg线程的消息队列。因此,出现Monitor Checker的超时,肯定是android.fg线程阻塞在monitor()方法上。
我们打开system_server进程的traces,检索 android.fg 可以快速定位到该线程的函数调用栈:
"android.fg" prio=5 tid=25 Blocked
| group="main" sCount=1 dsCount=0 obj=0x12eef900 self=0x7f7a8b1000
| sysTid=973 nice=0 cgrp=default sched=0/0 handle=0x7f644e9000
| state=S schedstat=( 3181688530 2206454929 8991 ) utm=251 stm=67 core=1 HZ=100
| stack=0x7f643e7000-0x7f643e9000 stackSize=1036KB
| held mutexes=
at com.android.server.wm.WindowManagerService.monitor(WindowManagerService.java:13125)
- waiting to lock <0x126dccb8> (a java.util.HashMap) held by thread 91
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:204)
at android.os.Handler.handleCallback(Handler.java:815)
at android.os.Handler.dispatchMessage(Handler.java:104)
at android.os.Looper.loop(Looper.java:194)
at android.os.HandlerThread.run(HandlerThread.java:61)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
android.fg线程调用栈告诉我们几个关键的信息:
-这个线程当前的状态是Blocked,阻塞 -由Watchdog发起调用monitor(),这是一个Watchdog检查,阻塞已经超时 -waiting to lock <0x126dccb8>: 阻塞的原因是monitor()方法中在等锁<0x126dccb8> -held by thread 91: 这个锁被编号为91的线程持有,需要进一步观察91号线程的状态。
可以在traces.txt文件中检索 tid=91 来快速找到91号线程的函数调用栈信息:
"Binder_C" prio=5 tid=91 Native
| group="main" sCount=1 dsCount=0 obj=0x12e540a0 self=0x7f63289000
| sysTid=1736 nice=0 cgrp=default sched=0/0 handle=0x7f6127c000
| state=S schedstat=( 96931835222 49673449591 260122 ) utm=7046 stm=2647 core=2 HZ=100
| stack=0x7f5ffbc000-0x7f5ffbe000 stackSize=1008KB
| held mutexes=
at libcore.io.Posix.writeBytes(Native method)
at libcore.io.Posix.write(Posix.java:258)
at libcore.io.BlockGuardOs.write(BlockGuardOs.java:313)
at libcore.io.IoBridge.write(IoBridge.java:537)
at java.io.FileOutputStream.write(FileOutputStream.java:186)
at com.android.internal.util.FastPrintWriter.flushBytesLocked(FastPrintWriter.java:334)
at com.android.internal.util.FastPrintWriter.flushLocked(FastPrintWriter.java:355)
at com.android.internal.util.FastPrintWriter.appendLocked(FastPrintWriter.java:303)
at com.android.internal.util.FastPrintWriter.print(FastPrintWriter.java:466)
- locked (a com.android.internal.util.FastPrintWriter$DummyWriter)
at com.android.server.wm.WindowState.dump(WindowState.java:1510)
at com.android.server.wm.WindowManagerService.dumpWindowsNoHeaderLocked(WindowManagerService.java:12279)
at com.android.server.wm.WindowManagerService.dumpWindowsLocked(WindowManagerService.java:12266)
at com.android.server.wm.WindowManagerService.dump(WindowManagerService.java:12654)
- locked <0x126dccb8> (a java.util.HashMap)
at android.os.Binder.dump(Binder.java:324)
at android.os.Binder.onTransact(Binder.java:290)
91号线程的名字是Binder_C,它的函数调用栈告诉我们几个关键信息:
Native,表示线程处于运行状态(RUNNING),并且正在执行JNI方法在WindowManagerService.dump()方法申请了锁<0x126dccb8>,这个锁正是android.fg线程所等待的FileOutputStream.write()表示Binder_C线程在执行IO写操作,正式因为这个写操作一直在阻塞,导致线程持有的锁不能释放
聪明的你看到这或许已经能够想到解决办法了,在这个IO写操作上加一个超时机制,并且这个超时小于Watchdog的超时,不就可以让线程释放它所占有的锁了吗? 是的,这确实可以作为一个临时解决方案(Workaround),或者说一个保护机制。但我们可以再往深处想一想,这个IO写操作为什么会阻塞:
是不是IO缓冲区满了,导致写阻塞呢?是不是写操作有什么锁,导致这个write方法在等锁呢?是不是当前系统的IO负载过于高,导致写操作效率很低呢?
这都需要我们再进一步从日志中去找原因。如果已有的日志不全,找不到论据,我们还需要设计场景来验证假设,解决问题的难度陡然上升。
例3:Thread.sleep阻塞
//watchdog half
08-17 02:59:46.824 906 1384 I Watchdog: WAITED_HALF
//dumpStackTraces half watchdog(data/anr)
08-17 02:59:46.949 906 1384 I ActivityManager: dumpStackTraces pids=null nativepids=[512, 658, 643, 819, 645, 646, 663, 828, 653, 510, 481, 508, 672, 797, 798, 806, 810, 811, 812, 815, 819, 828, 1504]
08-17 02:59:46.959 906 1384 I ActivityManager: Dumping to /data/anr/anr_2020-08-17-02-59-46-955
//发生watchdog
08-17 03:00:21.784 906 1384 I watchdog: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg)
dumpStackTraces watchdog(data/anr)
08-17 03:00:21.876 906 1384 I ActivityManager: dumpStackTraces pids=null nativepids=[512, 658, 643, 819, 645, 646, 663, 828, 653, 510, 481, 508, 672, 797, 798, 806, 810, 811, 812, 815, 819, 828, 1504]
08-17 03:00:21.880 906 1384 I ActivityManager: Dumping to /data/anr/anr_2020-08-17-03-00-21-878
//dump sysrq(/proc/sysrq-trigger)
08-17 03:00:31.834 1384 1384 I SysRq : Show Blocked State
08-17 03:00:31.843 1384 1384 I SysRq : Show backtrace of all active CPUs
//dropbox (/data/system/dropbox/system_server_watchdog@xxx.txt)
08-17 03:00:31.902 906 3367 I DropBoxManagerService: add tag=system_server_watchdog isTagEnabled=true flags=0x2
@main/system log
08-17 03:00:31.966 906 1384 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg)
08-17 03:00:31.968 906 1384 W Watchdog: android.fg annotated stack trace:
08-17 03:00:31.968 906 1384 W Watchdog: at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:17574)
08-17 03:00:31.968 906 1384 W Watchdog: - waiting to lock <0x05fe6da3> (a java.lang.Object)
08-17 03:00:31.968 906 1384 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:233)
08-17 03:00:31.968 906 1384 W Watchdog: at android.os.Handler.handleCallback(Handler.java:883)
08-17 03:00:31.969 906 1384 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:100)
08-17 03:00:31.969 906 1384 W Watchdog: at android.os.Looper.loop(Looper.java:214)
08-17 03:00:31.969 906 1384 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:67)
08-17 03:00:31.969 906 1384 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
08-17 03:00:31.969 906 1384 W Watchdog: *** GOODBYE!
//kill self
08-17 03:00:31.969 906 1384 I Process : Sending signal. PID: 906 SIG: 9
//重启LOG
08-17 03:00:33.079 3407 3407 I boot_progress_start: 184709
08-17 03:00:33.635 3407 3407 I boot_progress_preload_start: 185266
08-17 03:00:34.376 3407 3407 I boot_progress_preload_end: 186006
08-17 03:00:34.682 3481 3481 I boot_progress_system_run: 186312
08-17 03:00:35.026 3481 3481 I boot_progress_pms_start: 186656
08-17 03:00:35.227 3481 3481 I boot_progress_pms_system_scan_start: 186857
08-17 03:00:36.049 3481 3481 I boot_progress_pms_data_scan_start: 187679
08-17 03:00:36.056 3481 3481 I boot_progress_pms_scan_end: 187687
08-17 03:00:36.165 3481 3481 I boot_progress_pms_ready: 187795
08-17 03:00:37.424 3481 3481 I boot_progress_ams_ready: 189054
08-17 03:00:38.361 3481 3503 I boot_progress_enable_screen: 189992
//watchdog half 和 watchdog trace表明是"Binder:906_F"线程sleep导致锁无法释放
//watchdog trace
----- pid 906 at 2020-08-17 03:00:21 -----
Cmd line: system_server
"android.fg" prio=5 tid=15 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x147812e8 self=0x7a0bb9cc00
| sysTid=1378 nice=0 cgrp=default sched=0/0 handle=0x791e0cbd50
| state=S schedstat=( 293867703 541217342 863 ) utm=23 stm=5 core=1 HZ=100
| stack=0x791dfc9000-0x791dfcb000 stackSize=1039KB
| held mutexes=
at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:17574)
- waiting to lock <0x05fe6da3> (a java.lang.Object) held by thread 117
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:233)
//"Binder:906_F"线程(906:2913),sleep导致了Blocked
"Binder:906_F" prio=5 tid=117 Sleeping
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13900728 self=0x797b178000
| sysTid=2913 nice=0 cgrp=default sched=0/0 handle=0x78c81e9d50
| state=S schedstat=( 227062849 166952101 637 ) utm=17 stm=4 core=2 HZ=100
| stack=0x78c80f3000-0x78c80f5000 stackSize=991KB
| held mutexes=
at java.lang.Thread.sleep(Native method)
- sleeping on <0x03e0cbb8> (a java.lang.Object)
at java.lang.Thread.sleep(Thread.java:440)
- locked <0x03e0cbb8> (a java.lang.Object)
at java.lang.Thread.sleep(Thread.java:356)
at com.android.server.am.ActivityManagerService.testWatchdog(ActivityManagerService.java:2916)
- locked <0x05fe6da3> (a java.lang.Object)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:4506)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2744)
at android.os.Binder.execTransactInternal(Binder.java:1021)
at android.os.Binder.execTransact(Binder.java:994)
//Binder Client端是TWatchdogTest(pid:3116 tid:3146)
08-17 03:00:32.208 3116 3146 W System.err: android.os.DeadObjectException
08-17 03:00:32.207 301 301 I binder : release 906:2913 transaction 53556 in, still active
08-17 03:00:32.207 301 301 I binder : send failed reply for transaction 53556 to 3116:3146
08-17 03:00:32.208 3116 3146 W System.err: at android.os.BinderProxy.transactNative(Native Method)
08-17 03:00:32.208 3116 3146 W System.err: at android.os.BinderProxy.transact(BinderProxy.java:510)
08-17 03:00:32.208 3116 3146 W System.err: at android.app.IActivityManager$Stub$Proxy.testWatchdog(IActivityManager.java:9331)
08-17 03:00:32.208 3116 3146 W System.err: at com.example.serviceclient.MainActivity$5.run(MainActivity.java:113)
08-17 03:00:32.208 3116 3146 W System.err: at java.lang.Thread.run(Thread.java:919)
08-17 02:58:58.155 3116 3146 D TWatchdogTest: startThread Client .
代码
明显是testWatchdog()sleep太久,导致mobj无法释放,超过60s,最后触发Watchdog @ActivityManagerService.java
private Object mobj = new Object();
@Override
public void testWatchdog() {
synchronized (mobj) {
Slog.d(TAG, "testWatchdog .");
try {
Thread.sleep(1000*1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
Slog.d(TAG, "startThread .");
}
}
//AMS monitor添加mobj锁检测
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
synchronized (mobj) {
Slog.d(TAG, "testWatchdog monitor() .");
}
}
@core/java/android/app/IActivityManager.aidl
void testWatchdog();
//MainActivity
public void startThread(String name) {
Thread mThread = new Thread(new Runnable() {
@Override
public void run() {
Log.d(TAG, "startThread Client .");
try {
ActivityManager.getService().testWatchdog();
} catch (RemoteException e) {
e.printStackTrace();
}
}
});
mThread.setName(name);
mThread.start();
}
还没有内容