I now have Quagga running and configuring multipath routes on AirOS 5.1, but shortly after the multipath routes are installed by Quagga the Linux kernel OOPSs, like this:
CPU 0 Unable to handle kernel paging request at virtual address 00000090, epc == 80150d00, ra == 8015101c Oops[#1]: Cpu 0 $ 0 : 00000000 7faa867c 00000028 00000020 $ 4 : 00000001 00000000 00000000 00000000 $ 8 : 815bb3c4 00000000 e959bc01 00000000 $12 : 00000002 00000000 801b1ce0 00000004 $16 : 81210180 81da6800 81da6958 81157380 $20 : 81157384 00000000 20000000 81aafc9c $24 : 00000000 8017af8c $28 : 81aae000 81aafbb8 81da6800 8015101c Hi : 00000000 Lo : fffbb441 epc : 80150d00 __ip_route_output_key+0x28c/0x105c Tainted: P ra : 8015101c __ip_route_output_key+0x5a8/0x105c Status: 1000fc03 KERNEL EXL IE Cause : 00800008 BadVA : 00000090 PrId : 00019374 Modules linked in: rssi_leds fuse usb_storage ar7240_gpio ath_pci ath_dev ath_dfs wlan_me wlan_xauth wlan_wep wlan_tkip wlan_ccmp wlan_acl wlan_scan_sta wlan_scan_ap ath_rate_atheros ath_hal ubnt_poll wlan sd_mod pppoe pppox ppp_mppe ppp_async ppp_generic slhc crc_ccitt vfat fat nls_iso8859_1 nls_cp437 scsi_mod nls_base ag7240_eth michael_mic md5 des aes Process infctld (pid: 310, threadinfo=81aae000, task=813477f0) Stack : 81036da0 00000010 81aafc98 8136438c 00010500 00000000 00000000 00000000 00000006 00000001 e959bc01 0a6600fa 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000001 00000000 00000002 81210180 00000000 000002bc 81aafc9c 81aafc98 81aafd50 81aafd50 81989e40 81989e40 ... Call Trace: [<80151aec>] ip_route_output_flow+0x1c/0x78 [<80095d34>] proc_alloc_inode+0x20/0x74 [<80184d4c>] ip_mc_find_dev+0x50/0x16c [<800829e4>] alloc_inode+0x30/0x148 [<80187188>] ip_mc_join_group+0xa8/0x1f0 [<8015e2b8>] ip_setsockopt+0xa48/0xcf4 [<8015d968>] ip_setsockopt+0xf8/0xcf4 [<80077668>] __link_path_walk+0x930/0xe2c [<80077c50>] link_path_walk+0xec/0x1dc [<8004e0a4>] __alloc_pages+0x74/0x364 [<80051840>] kmem_cache_alloc+0xa0/0xb8 [<80061580>] get_unused_fd+0xbc/0x210 [<80074ce8>] getname+0x34/0x148 [<8004e0a4>] __alloc_pages+0x74/0x364 [<80051840>] kmem_cache_alloc+0xa0/0xb8 [<8011f38c>] lock_sock+0xc4/0xdc [<80080da0>] d_alloc+0x38/0x1b4 [<800c417c>] sprintf+0x30/0x3c [<80065568>] get_empty_filp+0x60/0x110 [<80120654>] sock_setsockopt+0x134/0x860 [<801205f4>] sock_setsockopt+0xd4/0x860 [<8011dbd0>] sock_map_fd+0x80/0x17c [<8011db88>] sock_map_fd+0x38/0x17c [<8011d2e0>] sys_setsockopt+0x80/0xc4 [<8011d2e0>] sys_setsockopt+0x80/0xc4 [<80061440>] filp_close+0x60/0xa4 [<8000eb6c>] stack_done+0x20/0x40 Code: 00431021 00451021 a3a40011 <8c510068> 26320158 c2220158 24420001 e2220158 1040fffc
The OOPS above is very typical, I have verified this by running a script to reboot the router and save dmsg right after it's back online 100 times.
After much digging around the Linux source and messing with objdump to decompile the route.o file, I finally found the source line where the oops occurs, I tried all sorts of hacks to get around the problem until I noticed that the code was ifdeffed by CONFIG_IP_ROUTE_MULTIPATH_CACHED, my patch for the kernel config turns that option on:
--- clean.SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15 2010-01-07 16:14:51.000000000 +0100 +++ SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15 2010-02-10 19:21:03.000000000 +0100 @@ -260,7 +260,14 @@ # CONFIG_IP_FIB_TRIE is not set CONFIG_IP_FIB_HASH=y # CONFIG_IP_MULTIPLE_TABLES is not set -# CONFIG_IP_ROUTE_MULTIPATH is not set + +CONFIG_IP_ROUTE_MULTIPATH=y +CONFIG_IP_ROUTE_MULTIPATH_CACHED=y +CONFIG_IP_ROUTE_MULTIPATH_RR=y +CONFIG_IP_ROUTE_MULTIPATH_RANDOM=y +CONFIG_IP_ROUTE_MULTIPATH_WRANDOM=y +CONFIG_IP_ROUTE_MULTIPATH_DRR=y + # CONFIG_IP_ROUTE_VERBOSE is not set # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set
So I have tried a kernel without CONFIG_IP_ROUTE_MULTIPATH_CACHED, but got the same result, something that's clearly impossible as the code is supposed to be compile out. It turns out that the kernel .config file is being overwritten by the openwrt build system on every build, even if it has changes, to make a change in the kernel .config I had to change SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15 in stead.
After getting rid of CONFIG_IP_ROUTE_MULTIPATH_CACHED I was indeed unable to reproduce the problem, so I did what I should have done to begin with; Google CONFIG_IP_ROUTE_MULTIPATH_CACHED oops.
As it turns out CONFIG_IP_ROUTE_MULTIPATH_CACHED was known to be broken and it was eventually reworked completely. I'd like nothing better than to upgrade to a more modern kernel, but I can't as Atheros has UBNT under a braindead NDA so UBNT is forced to either fight Atheros or violate the GPL with their binary blobs for drivers.
I can certainly understand why UBNT doesn't want to fight with Atheros about getting rid of the NDA on the Linux drivers, as it can be hard for UBNT to see how having closed drivers can hurt them, but not being able to merge the drivers will mean that UBNT will have to maintain them in-house forever and we, the customers, are forced to use whatever ancient kernel UBNT ships.
I think this is an excellent example of why:
In the end my kernel config patch ended up being:
--- clean.SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15 2010-01-07 16:14:51.000000000 +0100 +++ SDK.UBNT.v5.1/openwrt/target/linux/ar71xx/config-2.6.15 2010-02-10 19:21:03.000000000 +0100 @@ -260,7 +260,10 @@ # CONFIG_IP_FIB_TRIE is not set CONFIG_IP_FIB_HASH=y # CONFIG_IP_MULTIPLE_TABLES is not set -# CONFIG_IP_ROUTE_MULTIPATH is not set + +CONFIG_IP_ROUTE_MULTIPATH=y +# CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set + # CONFIG_IP_ROUTE_VERBOSE is not set # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set