Ticket #1020 (closed defect: fixed)
aptitude sometimes spins forever when in --download-only mode
Reported by: | jdreed | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | Upstream Utopia |
Component: | -- | Keywords: | |
Cc: | Fixed in version: | ||
Upstream bug: | LP:975793 DebianBug:629266 |
Description (last modified by jdreed) (diff)
It may be relevant that when the problem does occur, it's always in the second invocation, when there aren't actually any files to download.
Change History
comment:2 Changed 13 years ago by jdreed
I forgot that granola is running sshd. Backtracing the wedged aptitude, which is
5661 ? Sl 64:06 aptitude --quiet --assume-yes --download-only dist-upgrade
#0 0x00007fbe3d2d981d in __libc_waitpid (pid=<value optimized out>, stat_loc=<value optimized out>, options=<value optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:41 #1 0x00007fbe3e7b9c63 in ExecWait(int, char const*, bool) () from /usr/lib/libapt-pkg.so.4.10 #2 0x00007fbe3e83c8be in pkgDPkgPM::RunScriptsWithPkgs(char const*) () from /usr/lib/libapt-pkg.so.4.10 #3 0x00007fbe3e844b05 in pkgDPkgPM::Go(int) () from /usr/lib/libapt-pkg.so.4.10 #4 0x00007fbe3e7d6f85 in pkgPackageManager::DoInstallPostFork(int) () from /usr/lib/libapt-pkg.so.4.10
comment:3 Changed 13 years ago by jdreed
Er, sorry, there's also:
31328 ? S 0:00 /bin/sh -c /usr/sbin/dpkg-preconfigure --apt || true 31329 ? R 0:00 /usr/bin/perl -w /usr/sbin/dpkg-preconfigure --apt
Looks like dpkg-preconfigure is been repeatedly called and failing. Over the past minute, I've seen at least 10 processes similar to the ones above. There's only ever one set in ps output, but they're appearing, terminating, and respawning, AFAICT. They do so fast enough that I can't even attach gdb in time. Anyone debugging should repeatedly run "ps auxww" a few times, grepping for dpkg, and you'll see them.
comment:4 Changed 13 years ago by jdreed
- Summary changed from cron job to ensure auto-update doesn't get wedged to auto-update sits at "Writing extended state info"
comment:5 Changed 13 years ago by jdreed
- Priority changed from high to blocker
- Milestone changed from Fall 2011 to Natty Release
This is actually a release blocker, since the machines can't be fixed without intervention, and neither I nor hotline will be visiting every single cluster machine again. If we don't have a solution tomorrow, I propose we push out the release anyway, with the following addition to auto-update that gets dropped into cron.hourly:
#!/bin/bash UPD_START=$(stat -c "%Y" /var/run/athena-nologin 2>/dev/null) [ -z "$UPD_START" ] && exit 0 NOW=$(date +"%s") ELAPSED=$(expr $NOW - $UPD_START) if [ $ELAPSED -gt 3600 ]; then pkill -f athena-auto-update # (or maybe just reboot?) fi exit 0
comment:6 Changed 13 years ago by jdreed
Er, maybe add
[ "$(machtype -L)" = "debathena-cluster" ] || exit 0
at the top there, depending on whether we pkill or reboot. (Or maybe regardless?)
I tested killing the proc on w20-575-2 when it was wedged, and rebooting is fine, since the "aptitude install" stage of auto-update will get things going again on the next invocation.
comment:7 Changed 13 years ago by jdreed
- Owner set to jdreed
- Status changed from new to accepted
Geoff identified the code that breaks, but we still don't know why it gets called.
A horrible hack was committed and pushed out in auto-update 1.31
comment:8 Changed 13 years ago by jdreed
Fixed less stupidly and more functionally in auto-update 1.32, which just got pushed out. Keeping this open until we have a fix for the actual bug. Geoff notes that this is DebianBug:629266, and I concur.
comment:9 Changed 13 years ago by jdreed
- Priority changed from blocker to high
- Summary changed from auto-update sits at "Writing extended state info" to aptitude sometimes spins forever when in --download-only mode
- Description modified (diff)
- Milestone changed from Natty Release to Fall 2011
comment:10 Changed 13 years ago by jdreed
The upstream bug appears to be going nowhere fast, and every time I try to debug this problem, I can't reproduce it. We should probably focus our efforts on a non-crappy version of athena-auto-update or something. Or consider using apt-get to do the downloading, since we really only care about aptitude for its dependency resolver. Or will that be harder?
comment:11 Changed 12 years ago by jdreed
This apparently got fixed on May 5, but we may not see it until Quantal?
comment:12 Changed 12 years ago by jdreed
I seem to be encountering this more on my Precise VM. Do we want to continue sucking it up, or try and get this SRU'd to Precise, or what?
comment:14 Changed 12 years ago by jdreed
AFAICT, I can eliminate the problem by commenting out the only line in /etc/apt/apt.conf.d/70debconf, which wants to run dpkg-preconfigure. Is it reasonable to do that during an auto-update? Certainly it's less klunky than our timeout(1) solution.
comment:15 Changed 12 years ago by jdreed
- Status changed from accepted to committed
So, I encountered a borked auto-update, and ln -nsf'd dpkg-preconfigure to /bin/true, and auto-update picked up and continued on normally. This implies it's either more subtle than the original upstream bug, or there were two bugs. I've gone ahead and inhibited pre-configuring during auto-update -- since it's unattended, it's pointless anyway.
comment:16 Changed 12 years ago by jdreed
Nope, that doesn't fix it. Apparently aptitude is just broken. "yay"
comment:17 Changed 12 years ago by jdreed
Actually, that did make it continue far enough to get to the post-invoke scripts, where it also failed. So let's just disable everything in download mode and see what happens, because why not.
comment:19 Changed 12 years ago by jdreed
Nope, aptitude is still sitting in DoInstallPostFork?, despite the fact that there's nothing to do.
comment:20 Changed 12 years ago by jdreed
And the borked version is still in Quantal. Someone should get upstream to take 0.6.7 into Quantal. Or we can wait until April 2013, whatever.
comment:22 Changed 12 years ago by jdreed
- Upstream bug changed from LP:975793 Debian:629266 to LP:975793 DebianBug:629266
comment:23 Changed 12 years ago by jdreed
- Status changed from new to closed
- Resolution set to fixed
According the LP bug, Quantal took the new version. I see no reason to switch back to aptitude for auto-update/install, however.
auto-udpate is now wedged on granola in a similar state
In each case, it fails inside "aptitude --quiet --assume-yes --download-only dist-upgrade at
Writing extended state information....
In this case, it merely wants to upgrade gdm-config, which shouldn't be a hard transaction. This possibly points to an internal error in aptitude, especially since we're just asking it to download, which is not a hard operation.
/mit/jdreed/Public/granola-update.log for what it looks like right now.