michael orlitzky

There was an attempt to save Linux filesystem ACLs

posted 2019-03-16

(Wherein we find a proposal for ACLs in POSIX that are better than POSIX ACLs.)

What's a Linux filesystem ACLs?

Access Control Lists (ACLs) are a flexible way to grant permissions on files within an operating system. There are three main types you have probably encountered:

Windows file permissions: If you right-click on a file, choose Properties, and then the Security tab, the stuff you see there is the access control list. Some people can do some stuff, other people can't do other stuff. These are presumably described somewhere in the Windows Access Control documentation.
macOS file permissions: These borrow from another filesystem, version 4 of the Network File System (NFSv4). As a result, they are called “NFSv4 ACLs.” This is not the hard part of the article. They are specified in RFC 7530.
POSIX.1e ACLs: These are usually called just “POSIX ACLs,” because they are not a part of the POSIX standard. They appeared in a draft called POSIX.1e, but that draft was never ratified. Nevertheless, they're what's implemented for Linux. So if you've used filesystem ACLs on Linux, you used these guys.

The first two are essentially equivalent. The design of NFSv4 ACLs was based on Windows ACLs, and both of them are better than the POSIX.1e ACLs that we currently have on Linux. Moreover the Windows and NFSv4 ACLs are interoperable: permissions can be accurately mapped back and forth between Windows and macOS systems. Not so on Linux, because POSIX.1e ACLs aren't expressive enough. To remedy that, the RichACLs project aims to bring NFSv4 ACLs to Linux.

The problem

Nobody uses POSIX ACLs on Linux, for two reasons:

they're too complicated
they don't work

And both of these problems are the result of one dumb design decision, to abuse the group permission bits to store something other than group permissions. The specification for POSIX ACLs starts out great. If you want to grant some user permission to a file, then you add an ACL to that file that says what he can do. For backwards-compatibility with the standard UNIX permission bits, the owner and other permission bits get interpreted as special ACLs that do exactly what the permission bits did:

The permissions specified by the file owner class permission bits correspond to the permissions associated with the ACL_USER_OBJ entry… The permissions specified by the file other class permission bits correspond to the permissions associated with the ACL_OTHER entry.

So far so good. If I sat down right now to write an ACL specification, that might be what I would come up with. And then,

The permissions specified by the file group class permission bits correspond to the permissions associated with the ACL_GROUP_OBJ entry or the permissions associated with the ACL_MASK entry if the ACL contains an ACL_MASK entry.

Derp. That says that the group permission bits might not be group permission bits if an invisible ACL entry is present. In practice, an ACL_MASK entry is always present if the file has an ACL, so the group permission bits always represent a “permissions mask” rather than group permissions on files with ACLs. But, not all files have ACLs. Thus the meaning of the group permission bits changes when the file acquires an ACL.

This mistake is so mistaken that it has a name. In database land, it's called a “polymorphic association,” and is the focus of Chapter 7 of the book SQL Antipatterns by Bill Karwin. If you're not familiar with the term, “antipattern” means that it's the opposite of something you should do. I can also cite my (Second Edition) copy of Code Complete by Steve McConnell, which says, in the section titled Using Each Variable for Exactly one Purpose using big kindergarten letters,

Use each variable for one purpose only… avoid variables with hidden meanings… even if the double use is clear to you, it won't be to someone else.

The Common Weakness Enumeration project calls this innovation a Multiple Interpretation Error or an Interpretation Conflict.

They're too complicated

A thorough understanding of filesystem permissions is essential to your security, because on “everything is a file” UNIX, they are your security. However, under threat of the group-bits mask, your existing knowledge of UNIX permissions is no longer valid. This is terrifying for novices, who don't want to learn a new set of complicated rules. This is terrifying for experts, who know that a system has to be easily understood to be secure. And the complexity comes not only from the group bits: almost half of the access check algorithm in the acl(5) man page is special cases for the goddamned mask!

They don't work

Since POSIX ACLs redefine the meaning of the group permission bits, any tool that treats group permission bits like group permission bits is going to dick up your ACLs. For example, the cp program breaks default ACLs, because it tries to ensure that the target's group permission bits (which are no longer group permission bits, after the default ACL is applied) match the source. The end result is that all of your default ACLs on the target directory get reduced to whatever the group bits allowed on the source file:

user $ mkdir acl
user $ cd acl
user $ setfacl --default -m user:apache:rwx .
user $ cp /etc/profile ./
user $ getfacl --omit-header ./profile
user:apache:rwx #effective:r--
group::r-x #effective:r--
mask::r--
other::r--

This is usually merely annoying, but can also lead to security vulnerabilities. For example, a program might remove a mask thinking that it's only loosening the group permissions, when in reality all permissions are loosened. The CWE's description of an Interpretation Conflict sums this up nicely:

Product A handles inputs or steps differently than Product B, which causes A to perform incorrect actions based on its perception of B's state.

Yup.

Other fundamental utilities like mkdir and tar exhibit the same problem, and they can't be fixed. Each program would need to understand how to undo the ACL mask in a way that doesn't compromise security. This can be done—the apply-default-acl program implements it—but it's far too much security-sensitive code to copy & paste into every program that calls chmod(2). No one's going to do it so long as only a tiny fraction of users use POSIX ACLs. And only a tiny fraction of users will ever use POSIX ACLs, because POSIX ACLs are useless if I can't use cp to copy files into a directory with ACLs.

How RichACLs will fix everything

RichACLs are a brand-new implementation of the superior NFSv4-style ACLs on Linux, that nobody is going to use for two reasons:

they're even more complicated than POSIX ACLs
they don't work either

RichACLs incorporate all of the nice features of Windows and NFSv4 ACLs, but they also borrow you-know-what from POSIX ACLs. And they've gone full retard: with RichACLs, all of the traditional permission bits can act as masks, and everything is controlled by metadata. Watch the richacl(7) man page try to explain this shit:

RichACLs consist of a number of ACL entries, three file masks, and a set of flags specifying attributes of the ACL as a whole (by contrast with the per-ACL-entry flags described below)…

The owner, group, and other file masks further control which permissions the ACL grants, subject to the masked (m) and write_through (w) ACL flags: when the permissions of a file or directory are changed with chmod(2), the file masks are set based on the new file mode, and the masked and write_through ACL flags are set. Likewise, when a new file or directory inherits an ACL from its parent directory, the file masks are set to the intersection between the permissions granted by the inherited ACL and the mode parameter as given to open(2), mkdir(2), and similar, and the masked ACL flag is set. In both cases, the file masks limit the permissions that the ACL will grant…

masked (m)

When set, the file masks define upper limits on the permissions the ACL may grant. When not set, the file masks are ignored.

write_through (w)

When this flag and the masked flag are both set, the owner and other file masks define the actual permissions granted to the file owner and to others instead of defining an upper limit. When the masked flag is not set, the write_through flag has no effect.

If you have no idea what you just read: good, you are perhaps a sane and rational individual. I don't actually know what the fuck is going on, but I'm pretty sure it's more complicated than it used to be with the POSIX ACLs that were already too complicated. At the moment, the richacl git repository contains a separate 18KiB richaclex(7) man page that “…shows how they interact with the POSIX file permission bits.” Okay.

The complexity would be fine if it needed to be implemented only a few times, by technical people. But with ACLs, either

every user needs to read and understand those man pages, or
the default ACLs (created by the system administrator) need to work.

If a lawyer and a paralegal want to share some documents, do you think they're going to be able to read and understand those man pages? Because the only thing I did take away from the word salad in the man page is that calling chmod will still dick up your default ACLs: “when the permissions of a file or directory are changed with chmod(2), the file masks are set based on the new file mode, and the masked and write_through ACL flags are set.” So RichACLs won't work either, and no one will use them.

Why mask in the first place?

Casey Schaufler, who was the technical editor on the POSIX.1e draft, gave a talk at the 2018 linux.conf.au conference titled The Twisting, Turning, Narrow Road That Is Security. In it, he describes the rationale behind the group-bits mask. It's worth reproducing in full.

The initial proposal was that, if you had an access control list, you used the access control list. Period. End of sentence. If you had the mode bits but no access control list, you used the mode bits. Everybody would have been happy there; access control lists would have been very simple. But, we were in an era of compatibility, and so… we didn't do that…

Backward-compatibility is a real nuisance on occasion. One of the members of the team said, here's what we have to be able to do:

Do a stat() to get the mode bits [of some file]

Set the mode to zero [on that file], so that nobody can access it

Set the mode bits back [to what they were originally]

If you have an access control list, that behavior still needs to be supported. So chmod 0 has to turn off access, and then chmod back to what it was before has to give you the exact same access you had before, even if you have an access control list. Because that's the way people write programs. It would really be nice if, on occasion, if we could change something…

So we ended up with this interesting thing called a mask… in support of this one little use scenario here.

Wow, maybe that guy is dead?

Keep in mind that RichACLs aren't a proposed amendment to the POSIX standard. This leads to a conflict between ACL permissions and the traditional permission bits. RichACLs will be used on systems where most of the other software expects POSIX semantics. So insofar as possible, RichACLs do have to respect the permission bits, because that's what POSIX currently says. Specifically, POSIX.1-2017 insists that any additional access control mechanism (such as RichACLs) must treat the permission bits as an upper bound:

An additional access control mechanism shall only further restrict the access permissions defined by the file permission bits.

To paraphrase, we are kind of fucked so long as RichACLs are a vague “additional access control mechanism.” If, for example, cp calls chmod(2) to clear the “other” bits, then we have to honor that, regardless of what any ACLs say.

What to do about it?

There are three issues that need to be addressed:

The whole concept of a “mask” is far too confusing.
Treating the permission bits as a mask creates even more confusion, and makes default ACLs useless in practice.
We need to remain compatible with POSIX.

So here's what we do:

Delete all of the mask crap from RichACLs.
Delete all of the special mode bits handling from RichACLs.
For the RichACL access control algorithm, do what we should have done all along: if there's a RichACL, use it; if not, use the mode bits. Everybody will be happy.
The previous items would violate POSIX, so this is crucial: standardize the behavior in a new POSIX draft. If it's in POSIX, it doesn't violate POSIX. Bam.

A simple, predictable, standard, solution that actually works. The mode bits and masks are not mentioned anywhere in the NFSv4 ACL specification, so we still have a faithful implementation of that standard. Oh, and this is exactly how macOS implements them. Copy that shit and be done with it.

The RichACLs implementation is currently an unofficial patch to the Linux kernel, so there is still time to get it right.