文章目录

1. Watchdog是什么?2. Watchdog代码基本介绍3. 原因分析4. 实例分析例1:SystemServer Binder耗尽例2:IO阻塞例3:Thread.sleep阻塞

1. Watchdog是什么?

Android Watchdog主要是检测system server重要线程状态(main,android io,android display ,Activitymanager,android fg,android bg等线程 )的一种机制。发生Watchdog时会触发Android重启来使系统恢复。Watchdog是SystemServer的一个线程(class Watchdog extends Thread),检测system server重要线程的锁状态和消息是否阻塞,假如有线程block了60s那么就会触发watchdog timeout flow。

2. Watchdog代码基本介绍

可以在http://androidxref.com/搜索源码,看Watchdog.java文件

public class Watchdog extends Thread {

final ArrayList mHandlerCheckers = new ArrayList<>();

...

private Watchdog() {

super("watchdog");

//将前台线程加入队列

mMonitorChecker = new HandlerChecker(FgThread.getHandler(),

"foreground thread", DEFAULT_TIMEOUT);

mHandlerCheckers.add(mMonitorChecker);

//将主线程加入队列

mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),

"main thread", DEFAULT_TIMEOUT));

//将ui线程加入队列

mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),

"ui thread", DEFAULT_TIMEOUT));

//将i/o线程加入队列

mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),

"i/o thread", DEFAULT_TIMEOUT));

//将display线程加入队列

mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),

"display thread", DEFAULT_TIMEOUT));

addMonitor(new BinderThreadMonitor());

}

}

ActivityManagerService添加

@frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

3184        Watchdog.getInstance().addMonitor(this);

3185        Watchdog.getInstance().addThread(mHandler);

/** In this method we try to acquire our lock to make sure that we have not deadlocked */

public void monitor() {

synchronized (this) { }

}

Watchdog线程循环检测

@Override

public void run() {

boolean waitedHalf = false;

while (true) {

...

//java trace

final File stack = ActivityManagerService.dumpStackTraces(

!waitedHalf, pids, null, null, getInterestingNativePids());

//sysrq

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log

doSysRq('w');

doSysRq('l');

//dropbox

Thread dropboxThread = new Thread("watchdogWriteToDropbox") {

public void run() {

mActivity.addErrorToDropBox(

"watchdog", null, "system_server", null, null,

subject, null, stack, null);

}

};

dropboxThread.start();

try {

dropboxThread.join(2000);  // wait up to 2 seconds for it to return.

} catch (InterruptedException ignored) {}

...

//打印LOG WATCHDOG KILLING SYSTEM PROCESS

} else {

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);

WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);

Slog.w(TAG, "*** GOODBYE!");

// kill自己(SystemServer)重启

Process.killProcess(Process.myPid());

System.exit(10);

}

waitedHalf = false;

}

}

private static final class BinderThreadMonitor implements Watchdog.Monitor {

@Override

public void monitor() {

Binder.blockUntilThreadAvailable();

}

}

BinderThread耗尽,就会Blocked在这里

void IPCThreadState::blockUntilThreadAvailable()

{

pthread_mutex_lock(&mProcess->mThreadCountLock);

while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {

ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",

static_cast(mProcess->mExecutingThreadsCount),

static_cast(mProcess->mMaxThreads));

pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);

}

pthread_mutex_unlock(&mProcess->mThreadCountLock);

}

3. 原因分析

(1)手机性能影响主要一下:

cpu使用高 –查看cpu使用,看是否有问题

low memory –通过kernel log搜索lowmemorykiller看cache和free的内存是否够,是否有memory泄露

emmc问题(io文件)–block的trace应该在io操作时,可以kernellog搜索是否有emmc错误

dump log –在发生anr之前系统在dump log导致花费时间

object leakage –比如fd leak,window leakage,service leakage,receiver leakage,window leakage

(2)非性能影响,可能原因如下,看backtrace来分析

4. 实例分析

例1:SystemServer Binder耗尽

10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)

10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:

10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)

10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)

10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)

10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)

10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)

10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)

10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)

10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)

10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!

例2:IO阻塞

W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.wm.WindowManagerService on foreground thread (android.fg)

Watchdog告诉我们Monitor Checker超时了,具体在哪呢? 名为android.fg的线程在WindowManagerService的monitor()方法被阻塞了。这里隐含了两层意思:

WindowManagerService实现了Watchdog.Monitor这个接口,并将自己作为Monitor Checker的对象加入到了Watchdog的监测集中

monitor()方法是运行在android.fg线程中的。Android将android.fg设计为一个全局共享的线程,意味着它的消息队列可以被其他线程共享, Watchdog的Monitor Checker就是使用的android.fg线程的消息队列。因此,出现Monitor Checker的超时,肯定是android.fg线程阻塞在monitor()方法上。

我们打开system_server进程的traces,检索 android.fg 可以快速定位到该线程的函数调用栈:

"android.fg" prio=5 tid=25 Blocked

| group="main" sCount=1 dsCount=0 obj=0x12eef900 self=0x7f7a8b1000

| sysTid=973 nice=0 cgrp=default sched=0/0 handle=0x7f644e9000

| state=S schedstat=( 3181688530 2206454929 8991 ) utm=251 stm=67 core=1 HZ=100

| stack=0x7f643e7000-0x7f643e9000 stackSize=1036KB

| held mutexes=

at com.android.server.wm.WindowManagerService.monitor(WindowManagerService.java:13125)

- waiting to lock <0x126dccb8> (a java.util.HashMap) held by thread 91

at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:204)

at android.os.Handler.handleCallback(Handler.java:815)

at android.os.Handler.dispatchMessage(Handler.java:104)

at android.os.Looper.loop(Looper.java:194)

at android.os.HandlerThread.run(HandlerThread.java:61)

at com.android.server.ServiceThread.run(ServiceThread.java:46)

android.fg线程调用栈告诉我们几个关键的信息:

-这个线程当前的状态是Blocked,阻塞 -由Watchdog发起调用monitor(),这是一个Watchdog检查,阻塞已经超时 -waiting to lock <0x126dccb8>: 阻塞的原因是monitor()方法中在等锁<0x126dccb8> -held by thread 91: 这个锁被编号为91的线程持有,需要进一步观察91号线程的状态。

可以在traces.txt文件中检索 tid=91 来快速找到91号线程的函数调用栈信息:

"Binder_C" prio=5 tid=91 Native

| group="main" sCount=1 dsCount=0 obj=0x12e540a0 self=0x7f63289000

| sysTid=1736 nice=0 cgrp=default sched=0/0 handle=0x7f6127c000

| state=S schedstat=( 96931835222 49673449591 260122 ) utm=7046 stm=2647 core=2 HZ=100

| stack=0x7f5ffbc000-0x7f5ffbe000 stackSize=1008KB

| held mutexes=

at libcore.io.Posix.writeBytes(Native method)

at libcore.io.Posix.write(Posix.java:258)

at libcore.io.BlockGuardOs.write(BlockGuardOs.java:313)

at libcore.io.IoBridge.write(IoBridge.java:537)

at java.io.FileOutputStream.write(FileOutputStream.java:186)

at com.android.internal.util.FastPrintWriter.flushBytesLocked(FastPrintWriter.java:334)

at com.android.internal.util.FastPrintWriter.flushLocked(FastPrintWriter.java:355)

at com.android.internal.util.FastPrintWriter.appendLocked(FastPrintWriter.java:303)

at com.android.internal.util.FastPrintWriter.print(FastPrintWriter.java:466)

- locked (a com.android.internal.util.FastPrintWriter$DummyWriter)

at com.android.server.wm.WindowState.dump(WindowState.java:1510)

at com.android.server.wm.WindowManagerService.dumpWindowsNoHeaderLocked(WindowManagerService.java:12279)

at com.android.server.wm.WindowManagerService.dumpWindowsLocked(WindowManagerService.java:12266)

at com.android.server.wm.WindowManagerService.dump(WindowManagerService.java:12654)

- locked <0x126dccb8> (a java.util.HashMap)

at android.os.Binder.dump(Binder.java:324)

at android.os.Binder.onTransact(Binder.java:290)

91号线程的名字是Binder_C,它的函数调用栈告诉我们几个关键信息:

Native,表示线程处于运行状态(RUNNING),并且正在执行JNI方法在WindowManagerService.dump()方法申请了锁<0x126dccb8>,这个锁正是android.fg线程所等待的FileOutputStream.write()表示Binder_C线程在执行IO写操作,正式因为这个写操作一直在阻塞,导致线程持有的锁不能释放

聪明的你看到这或许已经能够想到解决办法了,在这个IO写操作上加一个超时机制,并且这个超时小于Watchdog的超时,不就可以让线程释放它所占有的锁了吗? 是的,这确实可以作为一个临时解决方案(Workaround),或者说一个保护机制。但我们可以再往深处想一想,这个IO写操作为什么会阻塞:

是不是IO缓冲区满了,导致写阻塞呢?是不是写操作有什么锁,导致这个write方法在等锁呢?是不是当前系统的IO负载过于高,导致写操作效率很低呢?

这都需要我们再进一步从日志中去找原因。如果已有的日志不全,找不到论据,我们还需要设计场景来验证假设,解决问题的难度陡然上升。

例3:Thread.sleep阻塞

//watchdog half

08-17 02:59:46.824   906  1384 I Watchdog: WAITED_HALF

//dumpStackTraces half watchdog(data/anr)

08-17 02:59:46.949   906  1384 I ActivityManager: dumpStackTraces pids=null nativepids=[512, 658, 643, 819, 645, 646, 663, 828, 653, 510, 481, 508, 672, 797, 798, 806, 810, 811, 812, 815, 819, 828, 1504]

08-17 02:59:46.959   906  1384 I ActivityManager: Dumping to /data/anr/anr_2020-08-17-02-59-46-955

//发生watchdog

08-17 03:00:21.784   906  1384 I watchdog: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg)

dumpStackTraces watchdog(data/anr)

08-17 03:00:21.876   906  1384 I ActivityManager: dumpStackTraces pids=null nativepids=[512, 658, 643, 819, 645, 646, 663, 828, 653, 510, 481, 508, 672, 797, 798, 806, 810, 811, 812, 815, 819, 828, 1504]

08-17 03:00:21.880   906  1384 I ActivityManager: Dumping to /data/anr/anr_2020-08-17-03-00-21-878

//dump sysrq(/proc/sysrq-trigger)

08-17 03:00:31.834  1384  1384 I SysRq   : Show Blocked State

08-17 03:00:31.843  1384  1384 I SysRq   : Show backtrace of all active CPUs

//dropbox (/data/system/dropbox/system_server_watchdog@xxx.txt)

08-17 03:00:31.902   906  3367 I DropBoxManagerService: add tag=system_server_watchdog isTagEnabled=true flags=0x2

@main/system log

08-17 03:00:31.966   906  1384 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg)

08-17 03:00:31.968   906  1384 W Watchdog: android.fg annotated stack trace:

08-17 03:00:31.968   906  1384 W Watchdog:     at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:17574)

08-17 03:00:31.968   906  1384 W Watchdog:     - waiting to lock <0x05fe6da3> (a java.lang.Object)

08-17 03:00:31.968   906  1384 W Watchdog:     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:233)

08-17 03:00:31.968   906  1384 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)

08-17 03:00:31.969   906  1384 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)

08-17 03:00:31.969   906  1384 W Watchdog:     at android.os.Looper.loop(Looper.java:214)

08-17 03:00:31.969   906  1384 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)

08-17 03:00:31.969   906  1384 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)

08-17 03:00:31.969   906  1384 W Watchdog: *** GOODBYE!

//kill self

08-17 03:00:31.969   906  1384 I Process : Sending signal. PID: 906 SIG: 9

//重启LOG

08-17 03:00:33.079  3407  3407 I boot_progress_start: 184709

08-17 03:00:33.635  3407  3407 I boot_progress_preload_start: 185266

08-17 03:00:34.376  3407  3407 I boot_progress_preload_end: 186006

08-17 03:00:34.682  3481  3481 I boot_progress_system_run: 186312

08-17 03:00:35.026  3481  3481 I boot_progress_pms_start: 186656

08-17 03:00:35.227  3481  3481 I boot_progress_pms_system_scan_start: 186857

08-17 03:00:36.049  3481  3481 I boot_progress_pms_data_scan_start: 187679

08-17 03:00:36.056  3481  3481 I boot_progress_pms_scan_end: 187687

08-17 03:00:36.165  3481  3481 I boot_progress_pms_ready: 187795

08-17 03:00:37.424  3481  3481 I boot_progress_ams_ready: 189054

08-17 03:00:38.361  3481  3503 I boot_progress_enable_screen: 189992

//watchdog half 和 watchdog trace表明是"Binder:906_F"线程sleep导致锁无法释放

//watchdog trace

----- pid 906 at 2020-08-17 03:00:21 -----

Cmd line: system_server

"android.fg" prio=5 tid=15 Blocked

| group="main" sCount=1 dsCount=0 flags=1 obj=0x147812e8 self=0x7a0bb9cc00

| sysTid=1378 nice=0 cgrp=default sched=0/0 handle=0x791e0cbd50

| state=S schedstat=( 293867703 541217342 863 ) utm=23 stm=5 core=1 HZ=100

| stack=0x791dfc9000-0x791dfcb000 stackSize=1039KB

| held mutexes=

at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:17574)

- waiting to lock <0x05fe6da3> (a java.lang.Object) held by thread 117

at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:233)

//"Binder:906_F"线程(906:2913),sleep导致了Blocked

"Binder:906_F" prio=5 tid=117 Sleeping

| group="main" sCount=1 dsCount=0 flags=1 obj=0x13900728 self=0x797b178000

| sysTid=2913 nice=0 cgrp=default sched=0/0 handle=0x78c81e9d50

| state=S schedstat=( 227062849 166952101 637 ) utm=17 stm=4 core=2 HZ=100

| stack=0x78c80f3000-0x78c80f5000 stackSize=991KB

| held mutexes=

at java.lang.Thread.sleep(Native method)

- sleeping on <0x03e0cbb8> (a java.lang.Object)

at java.lang.Thread.sleep(Thread.java:440)

- locked <0x03e0cbb8> (a java.lang.Object)

at java.lang.Thread.sleep(Thread.java:356)

at com.android.server.am.ActivityManagerService.testWatchdog(ActivityManagerService.java:2916)

- locked <0x05fe6da3> (a java.lang.Object)

at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:4506)

at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2744)

at android.os.Binder.execTransactInternal(Binder.java:1021)

at android.os.Binder.execTransact(Binder.java:994)

//Binder Client端是TWatchdogTest(pid:3116 tid:3146)

08-17 03:00:32.208  3116  3146 W System.err: android.os.DeadObjectException

08-17 03:00:32.207   301   301 I binder  : release 906:2913 transaction 53556 in, still active

08-17 03:00:32.207   301   301 I binder  : send failed reply for transaction 53556 to 3116:3146

08-17 03:00:32.208  3116  3146 W System.err:     at android.os.BinderProxy.transactNative(Native Method)

08-17 03:00:32.208  3116  3146 W System.err:     at android.os.BinderProxy.transact(BinderProxy.java:510)

08-17 03:00:32.208  3116  3146 W System.err:     at android.app.IActivityManager$Stub$Proxy.testWatchdog(IActivityManager.java:9331)

08-17 03:00:32.208  3116  3146 W System.err:     at com.example.serviceclient.MainActivity$5.run(MainActivity.java:113)

08-17 03:00:32.208  3116  3146 W System.err:     at java.lang.Thread.run(Thread.java:919)

08-17 02:58:58.155  3116  3146 D TWatchdogTest: startThread Client .

代码

明显是testWatchdog()sleep太久,导致mobj无法释放,超过60s,最后触发Watchdog @ActivityManagerService.java

private Object mobj = new Object();

@Override

public void testWatchdog() {

synchronized (mobj) {

Slog.d(TAG, "testWatchdog .");

try {

Thread.sleep(1000*1000);

} catch (InterruptedException e) {

e.printStackTrace();

}

Slog.d(TAG, "startThread .");

}

}

//AMS monitor添加mobj锁检测

/** In this method we try to acquire our lock to make sure that we have not deadlocked */

public void monitor() {

synchronized (this) { }

synchronized (mobj) {

Slog.d(TAG, "testWatchdog monitor() .");

}

}

@core/java/android/app/IActivityManager.aidl

void testWatchdog();

//MainActivity

public void startThread(String name) {

Thread mThread = new Thread(new Runnable() {

@Override

public void run() {

Log.d(TAG, "startThread Client .");

try {

ActivityManager.getService().testWatchdog();

} catch (RemoteException e) {

e.printStackTrace();

}

}

});

mThread.setName(name);

mThread.start();

}