CFRunloop的多线程隐患

如果你还不了解什么是runloop,可以看这里的详解深入理解RunLoop

苹果官方文档中,声明了CFRunloop是线程安全的:

Thread safety varies depending on which API you are using to manipulate your run loop. The functions in Core Foundation are generally thread-safe and can be called from any thread. If you are performing operations that alter the configuration of the run loop, however, it is still good practice to do so from the thread that owns the run loop whenever possible.

但是需要注意的是,狡猾的苹果使用了generally这个模糊的词。

从实践中来看,CFRunloop在停止runloop的阶段的某些操作是存在多线程隐患的。

不安全的CFRunloopSource

CFRunloop是线程安全的,但是加上CFRunloopSource就不一定了。比如CFSocket。

示例代码

看这样一段自定义线程的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
@interface MyThread()
@property (nonatomic, strong) NSThread *currentThread;
@property (nonatomic, assign) CFRunLoopSourceRef socketSource;
@property (nonatomic, assign) CFSocketRef socket;
@property (nonatomic, assign) CFRunLoopRef currentRunloop;
@end
@implementation MyThread
//初始化线程
- (instancetype)init {
if (self = [super init]) {
_currentThread = [[NSThread alloc] initWithTarget:self selector:@selector(runThread) object:nil];
}
return self;
}
//开启线程;此方法在使用时没有多线程调用
- (void)startThread {
[self.currentThread start];
}
//线程入口
- (void)runThread {
@autoreleasepool {
//返回runloop,可以让其他线程停止此线程
self.currentRunloop = CFRunLoopGetCurrent();
[self addSocketSource];
CFRunLoopRun();
}
NSLog(@"线程退出");
}
//此方法在使用时没有多线程调用
- (void)stopThread {
true [self removeSocketSource];
true @synchronized (_currentRunloop) {
if (_currentRunloop) {
true CFRunLoopStop(_currentRunloop);
true self.currentRunloop = NULL;
true }
}
}
//此方法在使用时没有多线程调用
- (void)addSocketSource {
int sock;
sock = socket(AF_INET6, SOCK_STREAM, 0);
CFSocketContext context = {0, (__bridge void *)(self), NULL, NULL, NULL};
self.socket = CFSocketCreateWithNative(NULL, sock, kCFSocketReadCallBack, socketCallBack, &context);
self.socketSource = CFSocketCreateRunLoopSource(NULL, self.socket, 0);
CFRunLoopAddSource(_currentRunloop, _socketSource, kCFRunLoopDefaultMode);
}
- (void)removeSocketSource {
true@synchronized (_socket) {
truetrueif (_socket) {
truetruetrue//CFSocketInvalidate可能被抛到另一个线程去执行,因此 CFSocketInvalidate 和 CFRunLoopStop可能有多线程同时调用的情况
true CFSocketInvalidate(_socket);
true CFRelease(_socket);
true self.socket = NULL;
true }
true}
}

在实践中,CFSocket是被另一个socket类管理的,所以addSocketSourceremoveSocketSource都是在另一个类中的,也就有可能出现CFSocketInvalidateCFRunLoopStop多线程同时调用的情况。

crash实例分析

看上去并没有什么问题,该加锁的地方都加锁了,而且CF开头的那几个方法都是线程安全的。但是这时候,如果出现CFSocketInvalidateCFRunLoopStop多线程同时调用的情况,就有crash的可能。例如我们项目里收到的某个crash:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Thread 0 name: Dispatch queue: com.apple.main-thread
Thread 0 Crashed:
0 CoreFoundation 0x000000018e6a9144 CFRunLoopWakeUp + 92
1 CoreFoundation 0x000000018e6a9140 CFRunLoopWakeUp + 88
2 CoreFoundation 0x000000018e6d71e8 CFSocketInvalidate + 712
3 MyApp 0x00000001000fe424 (-[MySocket stop] + 136)
4 MyApp 0x00000001000fcd50 (-[MySocket dealloc] + 56)
5 libsystem_blocks.dylib 0x000000018d6afa28 _Block_release + 144
6 libdispatch.dylib 0x000000018d65a1bc _dispatch_client_callout + 16
7 libdispatch.dylib 0x000000018d65ed68 _dispatch_main_queue_callback_4CF + 1000
8 CoreFoundation 0x000000018e77e810 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__ + 12
9 CoreFoundation 0x000000018e77c3fc __CFRunLoopRun + 1660
10 CoreFoundation 0x000000018e6aa2b8 CFRunLoopRunSpecific + 444
11 GraphicsServices 0x000000019015e198 GSEventRunModal + 180
12 UIKit 0x00000001946f17fc -[UIApplication _run] + 684
13 UIKit 0x00000001946ec534 UIApplicationMain + 208
14 DuoYiIM 0x000000010003ca58 0x100024000 + 100952 (main + 132)
15 libdyld.dylib 0x000000018d68d5b8 start + 4
Thread 0 crashed with ARM-64 Thread State:
cpsr: 0x0000000020000000 fp: 0x000000016fddab30 lr: 0x000000018e6a9140 pc: 0x000000018e6a9144
sp: 0x000000016fddaa00 x0: 0x0000000000000000 x1: 0x0000000000000000 x10: 0x0000000000000000
x11: 0x0000000000000000 x12: 0x0000000000000000 x13: 0x0000000000000000 x14: 0x0000000000000000
x15: 0x0000000000001203 x16: 0x000000000000012d x17: 0x000000018f1eef74 x18: 0x0000000000000000
x19: 0x000000017056cb50 x2: 0x0000000000001000 x20: 0x000000017056cb40 x21: 0x96e73914144e0055
x22: 0x0000000174452990 x23: 0x000000017048bae0 x24: 0x0000000000000000 x25: 0x00000000ffffffff
x26: 0xffffffffffffffff x27: 0x000000017426f1c0 x28: 0x0000000002ffffff x29: 0x000000016fddab30
x3: 0x000000000017e4a6 x4: 0x0000000000012068 x5: 0x0000000000000000 x6: 0x0000000000000036
x7: 0xffffffffffffffec x8: 0x8c8c8c8c8c8c8c8c x9: 0x000000000000000c

CFSocketInvalidate在主线程被调用了。看堆栈,在CFSocketInvalidate内部调用CFRunLoopWakeUp时,出现了crash。

看不出具体是什么原因crash,所以需要看看是在CFRunLoopWakeUp的哪里挂的。查看对应版本的CoreFoundation的汇编代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
_CFRunLoopWakeUp:
0x0000000181521b9c FF0305D1 sub sp, sp, #0x140 ; CODE XREF=_CFRunLoopAddTimer+696, _CFRunLoopTimerSetNextFireDate+592, _CFSocketInvalidate+708, __wakeUpRunLoop+276, __CFXRegistrationPost+344, -[CFPrefsSearchListSource asynchronouslyNotifyOfChangesFromDictionary:toDictionary:]+172, ___CFSocketPerformV0+1408, ___CFSocketManager+2004, ___CFSocketManager+4248, _boundPairRead+604, _boundPairReadClose+124, …
0x0000000181521ba0 FC6F11A9 stp x28, x27, [sp, #0x110]
0x0000000181521ba4 F44F12A9 stp x20, x19, [sp, #0x120]
0x0000000181521ba8 FD7B13A9 stp x29, x30, [sp, #0x130]
0x0000000181521bac FDC30491 add x29, sp, #0x130
0x0000000181521bb0 F40300AA mov x20, x0
0x0000000181521bb4 C80C10F0 adrp x8, #0x1a16bc000
0x0000000181521bb8 084140F9 ldr x8, [x8, #0x80] ; -[_CFXPreferences init]_1a16bc080
0x0000000181521bbc 080140F9 ldr x8, [x8]
0x0000000181521bc0 292013F0 adrp x9, #0x1a7928000
0x0000000181521bc4 29E90791 add x9, x9, #0x1fa ; ___CF120290
0x0000000181521bc8 A8831DF8 stur x8, [x29, #-0x28]
0x0000000181521bcc E8030032 orr w8, wzr, #0x1
0x0000000181521bd0 28010039 strb w8, [x9] ; ___CF120290
0x0000000181521bd4 E8731290 adrp x8, #0x1a639d000
0x0000000181521bd8 08F13F91 add x8, x8, #0xffc ; ___CF120293
0x0000000181521bdc 08014039 ldrb w8, [x8] ; ___CF120293
0x0000000181521be0 48000034 cbz w8, loc_181521be8
0x0000000181521be4 E3560394 bl ___THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__
loc_181521be8:
0x0000000181521be8 93420091 add x19, x20, #0x10 ; CODE XREF=_CFRunLoopWakeUp+68
0x0000000181521bec E00313AA mov x0, x19
0x0000000181521bf0 70300694 bl imp___stubs_-[NSOrderedSet sortedArrayFromRange:options:usingComparator:]//真机的系统库做了混淆,这里其实是__CFRunLoopLock
0x0000000181521bf4 882E40F9 ldr x8, [x20, #0x58]
0x0000000181521bf8 080D40B9 ldr w8, [x8, #0xc]
0x0000000181521bfc A8010034 cbz w8, loc_181521c30

crash日志中,崩溃在CFRunLoopWakeUp + 92,对应汇编地址为0x0000000181521b9c + 92=0x0000000181521bf8,在ldr w8, [x8, #0xc]的时候挂了。查看crash时寄存器的值,x8: 0x8c8c8c8c8c8c8c8c,很明显x8指向的内存已经被释放了。x8是从ldr x8, [x20, #0x58]得来的(也就是x20的地址偏移0x58后的值),而x20则是从mov x20, x0得来的,x0就是CFRunloopWakeUp的第一个参数,CFRunLoopRef结构体,所以x8就是CFRunLoopRef偏移0x58后的值。

CoreFoundation的代码是开源的,可以在这里下载:CF-1153.18

对应CFRunloopWakeUp源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void CFRunLoopWakeUp(CFRunLoopRef rl) {
CHECK_FOR_FORK();
__CFRunLoopLock(rl);
if (__CFRunLoopIsIgnoringWakeUps(rl)) {
__CFRunLoopUnlock(rl);
return;
}
kern_return_t ret;
ret = __CFSendTrivialMachMessage(rl->_wakeUpPort, 0, MACH_SEND_TIMEOUT, 0);
if (ret != MACH_MSG_SUCCESS && ret != MACH_SEND_TIMED_OUT) CRASH("*** Unable to send message to wake up port. (%d) ***", ret);
__CFRunLoopUnlock(rl);
}
CF_INLINE Boolean __CFRunLoopIsIgnoringWakeUps(CFRunLoopRef rl) {
return (rl->_perRunData->ignoreWakeUps) ? true : false;
}

CFRunloop结构体:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
struct __CFRunLoop {
CFRuntimeBase _base; //16 byte
pthread_mutex_t _lock; //64 byte
__CFPort _wakeUpPort; //mach_port_t (unsign int), 4 byte
Boolean _unused; //bool变量占用1 byte,但是需要和4字节对齐,所以也是4 byte
volatile _per_run_data *_perRunData;
pthread_t _pthread;
uint32_t _winthread;
CFMutableSetRef _commonModes;
CFMutableSetRef _commonModeItems;
CFRunLoopModeRef _currentMode;
CFMutableSetRef _modes;
struct _block_item *_blocks_head;
struct _block_item *_blocks_tail;
CFAbsoluteTime _runTime;
CFAbsoluteTime _sleepTime;
CFTypeRef _counterpart;
};
typedef struct __CFRuntimeBase {
uintptr_t _cfisa; //unsigned long 8 byte
uint8_t _cfinfo[4]; //unsigned char 4 byte
#if __LP64__
uint32_t _rc; //unsigned int 4 byte
#endif
} CFRuntimeBase;
struct pthread_mutex_t {
truelong __sig; //8 byte
truechar __opaque[56]; //56 byte
};

计算结构体size后,得出ldr x8, [x20, #0x58]就是runloop-> _perRunData。也就是在调用__CFRunLoopIsIgnoringWakeUps的时候,CFRunLoopRef已经被释放了。

分析CFSocket源码

查看CFSocketInvalidate源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
void CFSocketInvalidate(CFSocketRef s) {
CHECK_FOR_FORK();
CFRetain(s);
__CFLock(&__CFAllSocketsLock);
__CFSocketLock(s);
if (__CFSocketIsValid(s)) {
//省略部分代码...
truetrue //取出socket中的runloop数组
CFArrayRef runLoops = (CFArrayRef)CFRetain(s->_runLoops);
//CFRunloop释放操作1
CFRelease(s->_runLoops);
s->_runLoops = NULL;
//省略部分代码...
__CFSocketUnlock(s);
// Do this after the socket unlock to avoid deadlock (10462525)
for (idx = CFArrayGetCount(runLoops); idx--;) {
CFRunLoopWakeUp((CFRunLoopRef)CFArrayGetValueAtIndex(runLoops, idx));
}
//CFRunloop释放操作3
CFRelease(runLoops);
//省略部分代码...
} else {
__CFSocketUnlock(s);
}
__CFUnlock(&__CFAllSocketsLock);
CFRelease(s);
}

CFSocketInvalidate中唯一使用到CFRunLoopWakeUp的地方,就是最后遍历runloops的操作。
但是此时CFRunLoopRef还在数组里,正在被数组强引用,到了CFRunLoopWakeUp里怎么就被释放了呢?

注意,CFSocketInvalidate里遍历runloops的操作是在锁外面进行的,说明CFSocket很有可能没有管理好它的runloops数组,导致数组在遍历时被释放了。从Do this after the socket unlock to avoid deadlock (10462525)这一行注释猜测,这部分遍历操作之前应该也是在锁内的,但是会出现死锁,所以放到了锁外。苹果的bug report是不对外公开的,只在这里找到了可能相关的讨论:bug #10462525

最大的可能是出现在__CFSocketCancel里。在runloop停止的时候,也会执行remove source操作,在CFRunLoopRemoveSource里,会执行source0的cancel函数,也就是__CFSocketCancel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
void CFRunLoopRemoveSource(CFRunLoopRef rl, CFRunLoopSourceRef rls, CFStringRef modeName) \
CHECK_FOR_FORK();
Boolean doVer0Callout = false, doRLSRelease = false;
__CFRunLoopLock(rl);
if (modeName == kCFRunLoopCommonModes) {
true//省略代码...
} else {
trueCFRunLoopModeRef rlm = __CFRunLoopFindMode(rl, modeName, false);
trueif (NULL != rlm && ((NULL != rlm->_sources0 && CFSetContainsValue(rlm->_sources0, rls)) || (NULL != rlm->_sources1 && CFSetContainsValue(rlm->_sources1, rls)))) {
true CFRetain(rls);
true //省略代码...
true if (0 == rls->_context.version0.version) {
true if (NULL != rls->_context.version0.cancel) {
true doVer0Callout = true;
true }
true }
true doRLSRelease = true;
true}
//省略代码...
true}
}
__CFRunLoopUnlock(rl);
if (doVer0Callout) {
// although it looses some protection for the source, we have no choice but
// to do this after unlocking the run loop and mode locks, to avoid deadlocks
// where the source wants to take a lock which is already held in another
// thread which is itself waiting for a run loop/mode lock
rls->_context.version0.cancel(rls->_context.version0.info, rl, modeName); /* CALLOUT */
}
if (doRLSRelease) CFRelease(rls);
}

__CFSocketCancel源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static void __CFSocketCancel(void *info, CFRunLoopRef rl, CFStringRef mode) {
CFSocketRef s = (CFSocketRef)info;
__CFSocketLock(s);
if (0 == s->_socketSetCount) {
//省略代码...
if (NULL != s->_runLoops) {
//从runloops数组中移除此runloop;对原数组执行拷贝后,释放原数组
CFMutableArrayRef runLoopsOrig = s->_runLoops;
CFMutableArrayRef runLoopsCopy = CFArrayCreateMutableCopy(kCFAllocatorSystemDefault, 0, s->_runLoops);
idx = CFArrayGetFirstIndexOfValue(runLoopsCopy, CFRangeMake(0, CFArrayGetCount(runLoopsCopy)), rl);
if (0 <= idx) CFArrayRemoveValueAtIndex(runLoopsCopy, idx);
s->_runLoops = runLoopsCopy;
//CFRunloop释放操作2
CFRelease(runLoopsOrig);
}
__CFSocketUnlock(s);
}

__CFSocketCancel也有一次对CFRunloopRef的释放操作,加上CFSocketInvalidate里的2个,总共有3个释放操作。

所以,如果__CFSocketCancelCFSocketInvalidate在多线程同时执行,就有可能出现对CFSocket中的runloops数组过度释放,因此在遍历runloops的时候就会出现CFRunLoopRef被释放的情况。虽然这个crash出现的概率比较低,但是在项目里隔一段时间就会稳定出现。

所以,不是加了锁就万事大吉了,CFSocketInvalidate里在遍历数组前应该再加一个retain才能保证安全。

解决方法

  • 既然是CFSocket里的bug,那就只能避免不要出现CFSocketInvalidateCFRunloopStop多线程执行的代码。
  • 如果你的socket只在这个线程里运行,那直接调用CFRunloopStop即可,runloop会自动清理所有source。
  • 如果这个线程需要重用,那就不需要stop,而是停止socket后,在同一个线程里新建socket。

自动停止的Runloop

那么,如果把stop代码改成这样,应该就没问题了吧?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
- (void)runThread {
@autoreleasepool {
self.currentRunloop = CFRunLoopGetCurrent();
[self addRunloopSource];
[self addSocketSource];
CFRunLoopRun();
}
NSLog(@"线程退出");
}
- (void)stopThread {
if (_currentRunloop) {
true //保证removeSocketSource的操作只会在这里执行,没有多线程的情况
[self removeSocketSource];
CFRunLoopStop(_currentRunloop);
self.currentRunloop = NULL;
}
}

很遗憾,这样写还是不安全的。

原因在于removeSocketSource之后,runloop里source就全部为空了,runloop如果检测到了source为空,就会自动停止runloop循环,销毁线程。

因此如果你在另一个线程调用stopThread,在removeSocketSource之后线程就会随时停止,runloop在调用CFRunLoopStop时可能已经被释放了。

上面的写法出现crash的概率太低,但是稍微改一下就能必现:

1
2
3
4
5
6
7
8
9
10
11
- (void)stopThread {
if (_currentRunloop) {
[self removeSocketSource];
//插入一个耗时操作
sleep(2);
//必定crash
CFRunLoopStop(_currentRunloop);
self.currentRunloop = NULL;
}
}

这种情况下crash的原因其实是没做好内存管理,只要对runloop增加一次retain操作就没问题了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
- (void)runThread {
@autoreleasepool {
true //做一次retain操作
self.currentRunloop = CFRetain(CFRunLoopGetCurrent());
[self addRunloopSource];
[self addSocketSource];
CFRunLoopRun();
}
NSLog(@"线程退出");
}
- (void)stopThread {
if (_currentRunloop) {
[self removeSocketSource];
CFRunLoopStop(_currentRunloop);
CFRelease(_currentRunloop);
self.currentRunloop = NULL;
}
}

结论

在使用runloop source的时候要谨慎,尤其在处理stop的阶段。其他source可能也存在类似的问题。

一个变量有多线程操作的时候,在锁外的操作即使是只读也是不安全的,在读取之前最好再做一次retain操作,防止在读取的过程中被释放。