Ticket #1195 (new defect)
ssh to athena.dialup.mit.edu fails when keytab obtained doesn't match ssh machine
Reported by: | kchen | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | The Distant Future |
Component: | linerva | Keywords: | transition |
Cc: | Fixed in version: | ||
Upstream bug: | https://bugzilla.mindrot.org/show_bug.cgi?id=1008 |
Description
(geofft's writeup on debathena@… on April 17, 2012)
Comcast's residential DNS IPs (75.75.75.75 and 75.75.76.76) are anycast addresses for muliple machines, which means that when looking up *.dialup.mit.edu, which is load-balanced at the DNS level, there's no guarantee that two successive client lookups (from a stub resolver) will hit the same server, and thus are more likely than not to give you two different results, because the first DNS server you hit is happily caching your result while you talk to the second DNS server.
This means that if you open an SSH connection to athena.dialup.mit.edu and ssh then asks the Kerberos libraries to get tickets for athena.dialup.mit.edu, the Kerberos libraries are likely to canonicalize the name into another server than the existing SSH connection and give you a service ticket for a the wrong dialup, leading to various random but probable failures setting up a Kerberos connection.
Combined with the underlying issue behind Trac #315 where a failed keyex aborts the connection, as opposed to falling back to another keyex method (like checking the server's RSA key), this manifests as athena.dialup giving "Connection closed" messages more often than not from a Debathena box on a Comcast residential connection if the user has active tickets.
I recommend we give some thought to one or more of
1) giving up on GSSAPIKeyExchange (#315 is almost excuse enough, but
combining it with #787 and this issue makes it something of a real
problem), at least until we can teach SSH to do keyex fallback
2) changing athena.dialup from DNS-level load balancing to IP-level, by
assigning a distinct IP to athena.dialup and having every athena.dialup
host include the host/athena.dialup.mit.edu key in its keytab in
addition to its own (and turning off the GSSAPIStrictAcceptorCheck);
this is basically the configuration of the scripts.mit.edu pool, except
that ssh connections aren't actually load-balanced.
This would also have some UI benefits for non-GSSAPI users, since they
would only get a host key prompt/warning for the single IP once.
Change History
comment:2 Changed 12 years ago by kchen
- Summary changed from Kerberized ssh to athena.dialup.mit.edu fails when keytab obtained doesn't match ssh machine to ssh to athena.dialup.mit.edu fails when keytab obtained doesn't match ssh machine
comment:3 Changed 12 years ago by geofft
Here's what appears to be the upstream bug for this issue: https://bugzilla.mindrot.org/show_bug.cgi?id=1008
There are a couple of patches there, all with caveats.
comment:4 Changed 12 years ago by adehnert
- Upstream bug set to https://bugzilla.mindrot.org/show_bug.cgi?id=1008
comment:6 Changed 12 years ago by geofft
That upstream patch has already been included in distros for a while; it's just a matter of adding GSSAPITrustDNS yes to debathena-ssh-client-config. Given that we don't set rdns = false in our krb5.conf (the default is true), and given that GSSAPI on Debathena means Kerberos, it doesn't seem particularly harmful to make SSH itself do the canonicalization since the Kerberos library will do so, anyway.
Does making this change sound good to everyone? (For what it's worth, remctl also hard-codes the moral equivalent of GSSAPITrustDNS yes.)
comment:7 Changed 12 years ago by andersk
I’m very skeptical about the idea of loosening a default security setting, no matter what arguments you have that other different commands may already have analogously loose default settings. Is this even still an issue now that NetworkManager does DNS caching in precise and higher?
comment:8 follow-up: ↓ 12 Changed 12 years ago by kchen
This is no longer an issue for me personally because of Precise and its default caching resolver, and because my new router also forces a caching resolver on me. There is an edge case that probably doesn't matter in practice, though, which is when the 30 second TTL is expiring, this issue could come up.
comment:9 Changed 12 years ago by geofft
I’m very skeptical about the idea of loosening a default security setting, no matter what arguments you have that other different commands may already have analogously loose default settings.
The argument is that the _same_ command has an analogously loose default setting -- hostnames already get canonicalized by the GSSAPI/Kerberos layer, and the only reason SSHTrustDNS doesn't default to yes is that you might want to turn off canonicalization at the GSSAPI layer and wouldn't expect SSH to then go and turn it back on for you.
I'm happy to revert this proposed change as soon as MIT Kerberos stops defaulting rdns to true, or as soon as debathena-kerberos-config overrides it.
comment:10 Changed 12 years ago by ghudson
Setting rdns=false does not substantially improve the security of Kerberos. With rdns=false, we still do forward resolution, which allows an attacker to spoof the result using a cname records.
I don't really foresee MIT krb5 changing the rdns default. We hate the reverse resolution step because of the way it affects the usability of new deployments, but we think changing the default would cause significant problems for some existing deployments.
Our long-term plan for this involves two significant changes. First, we want to make the KDC able to perform canonicalization of host-based service principals using its own NSS configuration (which could involve a local resolver backed by a securely updated copy of the zone file). Second, we want the KDC to be able to tell the client as AS-REP time that it supports canonicalization; the client would then refrain from doing any NSS-based canonicalization of service principals when making TGS requests with those credentials. I don't think Debathena will be able to take advantage of this for quite a while, though.
comment:11 Changed 12 years ago by geofft
https://bugzilla.redhat.com/show_bug.cgi?id=863350
may or may not be interesting.
comment:12 in reply to: ↑ 8 Changed 12 years ago by adehnert
Replying to kchen:
This is no longer an issue for me personally because of Precise and its default caching resolver, and because my new router also forces a caching resolver on me. There is an edge case that probably doesn't matter in practice, though, which is when the 30 second TTL is expiring, this issue could come up.
Presumably the failure mode in this case is "rare mysterious error message that ~always goes away with a second try", rather than anything more persistent than that? (I guess the failure mode for this bug is ~always "mysterious error that goes away with half a dozen tries", so maybe that's not a huge improvement.)
Has there been any discussion of Geoff's option (2) with Ops?