Debugging OOPS in Linux 2.6.15 as shipped in AirOS-5.1

I now have Quagga running and configuring multipath routes on AirOS 5.1, but shortly after the multipath routes are installed by Quagga the Linux kernel OOPSs, like this:

CPU 0 Unable to handle kernel paging request at virtual address 00000090, epc == 80150d00, ra == 8015101c
Oops[#1]:
Cpu 0
$ 0   : 00000000 7faa867c 00000028 00000020
$ 4   : 00000001 00000000 00000000 00000000
$ 8   : 815bb3c4 00000000 e959bc01 00000000
$12   : 00000002 00000000 801b1ce0 00000004
$16   : 81210180 81da6800 81da6958 81157380
$20   : 81157384 00000000 20000000 81aafc9c
$24   : 00000000 8017af8c
$28   : 81aae000 81aafbb8 81da6800 8015101c
Hi    : 00000000
Lo    : fffbb441
epc   : 80150d00 __ip_route_output_key+0x28c/0x105c     Tainted: P
ra    : 8015101c __ip_route_output_key+0x5a8/0x105c
Status: 1000fc03    KERNEL EXL IE
Cause : 00800008
BadVA : 00000090
PrId  : 00019374
Modules linked in: rssi_leds fuse usb_storage ar7240_gpio ath_pci ath_dev ath_dfs wlan_me wlan_xauth wlan_wep
wlan_tkip wlan_ccmp wlan_acl wlan_scan_sta wlan_scan_ap ath_rate_atheros ath_hal ubnt_poll wlan sd_mod pppoe
pppox ppp_mppe ppp_async ppp_generic slhc crc_ccitt vfat fat nls_iso8859_1 nls_cp437 scsi_mod nls_base ag7240_eth
michael_mic md5 des aes
Process infctld (pid: 310, threadinfo=81aae000, task=813477f0)
Stack : 81036da0 00000010 81aafc98 8136438c 00010500 00000000 00000000 00000000
        00000006 00000001 e959bc01 0a6600fa 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000001 00000001 00000000 00000002 81210180
        00000000 000002bc 81aafc9c 81aafc98 81aafd50 81aafd50 81989e40 81989e40
        ...
Call Trace:
 [<80151aec>] ip_route_output_flow+0x1c/0x78
 [<80095d34>] proc_alloc_inode+0x20/0x74
 [<80184d4c>] ip_mc_find_dev+0x50/0x16c
 [<800829e4>] alloc_inode+0x30/0x148
 [<80187188>] ip_mc_join_group+0xa8/0x1f0
 [<8015e2b8>] ip_setsockopt+0xa48/0xcf4
 [<8015d968>] ip_setsockopt+0xf8/0xcf4
 [<80077668>] __link_path_walk+0x930/0xe2c
 [<80077c50>] link_path_walk+0xec/0x1dc
 [<8004e0a4>] __alloc_pages+0x74/0x364
 [<80051840>] kmem_cache_alloc+0xa0/0xb8
 [<80061580>] get_unused_fd+0xbc/0x210
 [<80074ce8>] getname+0x34/0x148
 [<8004e0a4>] __alloc_pages+0x74/0x364
 [<80051840>] kmem_cache_alloc+0xa0/0xb8
 [<8011f38c>] lock_sock+0xc4/0xdc
 [<80080da0>] d_alloc+0x38/0x1b4
 [<800c417c>] sprintf+0x30/0x3c
 [<80065568>] get_empty_filp+0x60/0x110
 [<80120654>] sock_setsockopt+0x134/0x860
 [<801205f4>] sock_setsockopt+0xd4/0x860
 [<8011dbd0>] sock_map_fd+0x80/0x17c
 [<8011db88>] sock_map_fd+0x38/0x17c
 [<8011d2e0>] sys_setsockopt+0x80/0xc4
 [<8011d2e0>] sys_setsockopt+0x80/0xc4
 [<80061440>] filp_close+0x60/0xa4
 [<8000eb6c>] stack_done+0x20/0x40

Code: 00431021  00451021  a3a40011 <8c510068> 26320158  c2220158  24420001  e2220158  1040fffc

The OOPS above is very typical, I have verified this by running a script to reboot the router and save dmsg right after it's back online 100 times.

After much digging around the Linux source and messing with objdump to decompile the route.o file, I finally found the source line where the oops occurs, I tried all sorts of hacks to get around the problem until I noticed that the code was ifdeffed by CONFIG_IP_ROUTE_MULTIPATH_CACHED, my patch for the kernel config turns that option on:

--- clean.SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15       2010-01-07 16:14:51.000000000 +0100
+++ SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15     2010-02-10 19:21:03.000000000 +0100
@@ -260,7 +260,14 @@
 # CONFIG_IP_FIB_TRIE is not set
 CONFIG_IP_FIB_HASH=y
 # CONFIG_IP_MULTIPLE_TABLES is not set
-# CONFIG_IP_ROUTE_MULTIPATH is not set
+
+CONFIG_IP_ROUTE_MULTIPATH=y
+CONFIG_IP_ROUTE_MULTIPATH_CACHED=y
+CONFIG_IP_ROUTE_MULTIPATH_RR=y
+CONFIG_IP_ROUTE_MULTIPATH_RANDOM=y
+CONFIG_IP_ROUTE_MULTIPATH_WRANDOM=y
+CONFIG_IP_ROUTE_MULTIPATH_DRR=y
+
 # CONFIG_IP_ROUTE_VERBOSE is not set
 # CONFIG_IP_PNP is not set
 # CONFIG_NET_IPIP is not set

So I have tried a kernel without CONFIG_IP_ROUTE_MULTIPATH_CACHED, but got the same result, something that's clearly impossible as the code is supposed to be compile out. It turns out that the kernel .config file is being overwritten by the openwrt build system on every build, even if it has changes, to make a change in the kernel .config I had to change SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15 in stead.

After getting rid of CONFIG_IP_ROUTE_MULTIPATH_CACHED I was indeed unable to reproduce the problem, so I did what I should have done to begin with; Google CONFIG_IP_ROUTE_MULTIPATH_CACHED oops.

As it turns out CONFIG_IP_ROUTE_MULTIPATH_CACHED was known to be broken and it was eventually reworked completely. I'd like nothing better than to upgrade to a more modern kernel, but I can't as Atheros has UBNT under a braindead NDA so UBNT is forced to either fight Atheros or violate the GPL with their binary blobs for drivers.

I can certainly understand why UBNT doesn't want to fight with Atheros about getting rid of the NDA on the Linux drivers, as it can be hard for UBNT to see how having closed drivers can hurt them, but not being able to merge the drivers will mean that UBNT will have to maintain them in-house forever and we, the customers, are forced to use whatever ancient kernel UBNT ships.

I think this is an excellent example of why:

  1. Binary blobs are evil and a major hindrance to the use of Linux.
  2. The people who insist on foisting blobs on their customers will be the first against the wall when the revolution comes.

In the end my kernel config patch ended up being:

--- clean.SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15       2010-01-07 16:14:51.000000000 +0100
+++ SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15     2010-02-10 19:21:03.000000000 +0100
@@ -260,7 +260,10 @@
 # CONFIG_IP_FIB_TRIE is not set
 CONFIG_IP_FIB_HASH=y
 # CONFIG_IP_MULTIPLE_TABLES is not set
-# CONFIG_IP_ROUTE_MULTIPATH is not set
+
+CONFIG_IP_ROUTE_MULTIPATH=y
+# CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set
+
 # CONFIG_IP_ROUTE_VERBOSE is not set
 # CONFIG_IP_PNP is not set
 # CONFIG_NET_IPIP is not set
© Flemming Frandsen