Trees with Access Controls

Why?

I have a website I'm working on to help PCVs keep in touch. The project originates out of a XSLT project I developed to keep track of everyone's name, hometown and whatnot. The project was really useful and we are the only country in West Africa where volunteers can quickly look up the contact information of their compatriots.

I'm headed home and I have neither the time nor the intrest to keep updating the XML sources for the roster. I would also like to make some of the things that I have developed available to volunteers in other countries.

The Problems

The main issue at this point is privacy. There are a couple of issues:

Making your site assignment public is grounds for getting kicked out of the Peace Corps. (It is a security concern: the terrorists will surely come and end you if they but could find out where you were.) People have had website infractions cited in their administrative separations before^[1,2].
Many people do not like having their email publicly accessible on the web as it can increase unwanted mail.
The pieces of information that people consider private varies. One volunteer did not want people to know his birthday. Another maintains two websites: one for his family and another, less formal one, for his personal enjoyment.
This information will not be maintained entierly by individuals. Rule #1 of organizing a large group of people is that a certain number of them are going to be lazy. This is particularly true in an environment where some people access the internet once every three weeks. There will have to be some administrators who add and update information for their region or the whole country.

The Solution

I would like to give people control over the privacy of their information. Things that they would like to make public, let them make public. Things that they only want people in their country to know, let them do that as well.

The most logical way to do this would be a system of ACLs. Let them group pieces of information and specify how public it is. The issue with ACLs is I could make a huge list of ACLs for every piece of information, but that would be a pain to manage and would take too much time to check.

ACLs are seen most frequently in filesystems. NTFS has them as do certain versions of ext3. I attempted an earlier version of this same project attempting to use the ACLs already present in LDAP, but the implementation in OpenLDAP proved too difficult to work with as well as placing to great a limit on where I could deploy the program.

Hierarchies and Technology

Much of this information is hierarchical. Concepts go from the broad to the specific. Consider a theoretical structure for Peace Corps information: (which might not be optimal, but just for consideration)

People
- Countries
  - Country
    - Name (Mauritania)
    - Staff
      - Person
        
        First Name (Obidiah)
        
        Last Name (Shaw)
        
        Known As (Obie)
        
        Position (Country Director)
    - Volunteers
      - Person
        
        First Name (William)
        
        Last Name (Holcomb)
        
        Known As (W)
        
        Sector (ICT)
        
        User ID (wholcomb)
        
        Status (active)
      - Person
        
        First Name (Miriam)
        
        Last Name (Edwards)
        
        Known As (Pussycat)
        
        Sector (SED)
        
        User ID (mae)
        
        Status (active)

This allows the organization and grouping of pieces of data in a sensical manner.

Composite Trees

So, I want to stick things in a tree and I want to give people control. One thing that becomes slightly more complex however is I want for these trees to be extremely flexible. If volunteer one wants the whole world to know their name, one wants to their middle name be private and another wants it all to be secret; I want to give them that freedom.

Given that the ACLs are going to be a part of the tree, this means that, for the name for example, either the whole name will be in a single node and I will do special checking on the node, or, more simply, each component of the name has its own node and ACLs operate at the node level.

This is how the current XML based implementation works which uses the xNL schema. For example, in the XML sources my name is stored as:

<xnl:PersonName xmlns:xnl="rn:oasis:names:tc:ciq:xsdschema:xNL:2.0">
  <xnl:FirstName>William</xnl:FirstName>
  <xnl:MiddleName>James</xnl:MiddleName>
  <xnl:LastName>Holcomb</xnl:LastName>
  <xnl:KnownAs>Will</xnl:KnownAs>
</xnl:PersonName>

Another advantage to putting each data item it its own node it is solves problems with quantities. For instance if the name is in a single table with the columns first, middle and last. What about people with no middle name? Let the column be null. Fine. What about people with two middle names? Put the names in a space searated list? Ok, fair enough, now add a nickname column. Now try to put in someone with two different nicknames. I can't just put them in there separated by spaces. I wouldn't be able to tell which words where a part of which name. I could add a special character like a newline to separate names. Now let's say that Bob has two nicknames: "Flash" and "Camel Fucker." The first he is fine with going in his contact info, but the second he only wants available to his friends. The combined field now fails whereas if the tree simply had multiple MiddleName nodes none of the problems appear in the first place.

XML preserves ordering of the children of a node, so the middle names of a person can be put in correct order and that will be maintained.

Access Control

So I now want to specify access to the tree. What sorts of access should be controllable?

Data Nodes
- Read:
- Edit:
- Delete:
Grouping Nodes:
- Traversal: can a user access the children of this node?
- Browse: can a user list the accessible children of this node? (Children that a user cannot read are invisible. This is how Apache does its directory indexes and it helps protect privacy. If my node had a BoyfriendName child, you don't need to know the fellow's name to figure out I'm gay.)
- List: can a user list all the children of this node? (implies Browse)
- Edit: can a user change the properties of this node?
- Add: can a user add children to this node?
- Delete: can a user delete children from this node?

The other aspect is how to specify the groups of users that a permission applies to. What are the sorts of creiteria I might like?

No one but Jane can read my birthday
Anyone who served in my country or any currently active Peace Corps volunteer can see my site site assignemnt
Anyone in the Peace Corps, either volunteers or staff, can see my email adddress
The whole world can see my first name
Any authenticated users can see my last name
Only my group of friends can see the address of my LiveJournal

The most common metaphor for managing permissions is to apply them either to specific users or static groups of users. That means either I have to write the program to automatically maintain lots of groups or I have to create an interface for users to do it. If I make it so users are responsible for it this goes against rule #1 as almost certainly there will be things that don't get done in a timely manner.

A much better idea is dynamic groups. This is used in certain LDAP servers where instead of having a group specified by a list, it is specified by a query. This is significantly more computationally expensive, but it increases flexibility by orders of magnitude.

So far as queries, the users are already going to be in a tree. I'll just put their username in there along with the other info rather than having a separate table for that information. So I need a query for selecting nodes from a tree. This query language already exists with several implmentations: XPath.

First some conventions… The basic layout of the site tree would be:

I also need to be able to reference different nodes. $self is my user node, $guest is the person requesting access. Also, XPath is made to select XML nodes whose names may not contain spaces, so node names here will be converted by going to lowercase and putting underscores in the place of spaces.

The default permissions on the root which are inherited by all the children until they are overridden are: Read: *, Write: /groups/group[@name = "Administrators"]/people.

Read: $guest[@user_id = "jane"]: No one but Jane can read my birthday
Read: $guest/../local-name() = "volunteers" and ($self/../ or $guest[@status = "active"]): Anyone who served in my country or any currently active Peace Corps volunteer can see my site assignment
Read: $self/../../*/*: Anyone in the Peace Corps in my country, either volunteers or staff, can see my email adddress
Read: *: The whole world can see my first name
Read: $guest[@user_id]: Any authenticated users can see my last name
Read: $self/groups/group[@name = "Friends"]/people: Only my group of friends can see the address of my LiveJournal

There are a couple different types of results:

Node Sets: $guest[@user_id = "jane"] or $self/../: these are queries which produce node sets. (The first produces a node set containing a single node.) These are considered successful matches if they contain the $guest node.
Booleans: $guest/../local-name() = "volunteers": this produces a boolean value, true or false. The criteia for its success should be apparent.

The usage of attribute selectors is non-normative in these pathes. In normal XML a node has attributes associated with it. The nodes in this tree don't, so the normal attribute syntax is a selector for a child data node. For example person[@first_name = "Will"]/user_id is the same as person/first_name[value() = "Will"]/../user_id.

Cascading

I want for access control to be inherited by children. The problem is I don't want all things to be inherited equally. For example, my two root permissions: Read: * and Write: /groups/group[@name = "Administrators"]/people. The read permission for everyone I want to be overridden by users permissions. The write permission for administrators however I don't want for users to be able to revoke.

The simplest mechanism I can come up with is to give rules a priority:

Authoritative: Administrator assigned; these rules take precedence over any others
Important: User assignable rules that are should be considered a priority. For example if I have a group of close friends I want to give read access to all my information regardless of the premissions I set
Normal: The default priority level
Default: Not the default level, rather rules that are applied as a default if no other pertinent rules could be found

So my read rule would be a default and the write rule would be authoritative. My select would simply be to order the rules first by priority and second by their position on the way to the root. Then I just go through rules until I hit one whose selector matches.

One other property that rules will need is whether they are to allow or deny the matched permission.

Trees

So, I would like to do the same basic tree, but put it into a relational database. Take this org chart for example:

Basic Database

Say I have all the employees in a database: (These tests were run on MySQL.)

create table employees (id int auto_increment primary key, name text);
insert into employees values(DEFAULT, "Amy"),(DEFAULT, "Ben");
insert into employees values(DEFAULT, "Bob"),(DEFAULT, "Barb");
insert into employees values(DEFAULT, "Carl"),(DEFAULT, "Carol");
insert into employees values(DEFAULT, "Dan");
select * from employees;

Which produces:

id	name
1	Amy
2	Ben
3	Bob
4	Barb
5	Carl
6	Carol
7	Dan

Adjacency Matrix

The most basic way of doing a tree in a relational database is an adjacency matrix. For our example it would be:

create table adjacency (parent int not null references employees(id),
                        child int not null references employees(id));
insert into adjacency values(1, 2),(1, 3),(1, 4);
insert into adjacency values(3, 5),(3, 6),(6, 7);
select * from adjacency;

parent	child
1	2
1	3
1	4
3	5
3	6
6	7

The Root

I can now iterate my way through the table. To get the root of the table is is:

select employees.*
  from employees left join adjacency on employees.id = adjacency.child
  where adjacency.child is null;

Which produces:

id	name
1	Amy

Children

To get the children just take the employee id we want the children of:

select employees.*
  from employees left join adjacency on employees.id = adjacency.child
  where adjacency.parent = 1;

Which produces:

id	name
2	Ben
3	Bob
4	Barb

Parents

To go the other direction is pretty much the same thing:

select employees.*
  from employees left join adjacency on employees.id = adjacency.parent
  where adjacency.child = 2;

Which produces:

id	name
1	Amy

Limitations

This representation has the advantage of being easy to conceptualize and update. It has the disadvantage of requiring many repeated queries to walk the tree if you are not starting at the root. (Normally this works best for applications where you return the entire tree and construct it in memory before displaying it.)

Nested Sets

A representation that is a bit more powerful at the cost of being more complex is called "nested sets" and comes from Joe Celko. In this represenation two numbers are associated with a node. To get the numbers you do a depth first traversal with a counter that is incremented at every step. The two numbers are the count when the node was entered and exited:

So to create this representation it would be:

create table traversal (element_id int not null references employees(id),
                        enter int unique, exit int unique);
insert into traversal values(1, 1, 14),(2, 2, 3),(3, 4, 11),(4, 12, 13);
insert into traversal values(5, 5, 6),(6, 7, 10),(7, 8, 9);
select * from traversal;

element_id	enter	exit
1	1	14
2	2	3
3	3	11
4	4	13
5	5	6
6	7	10
7	8	9

Doing the queries that are possible with the adjacency matrix are possible with this representation:

The Root

The root is just the element with the minimum enter count:

select employees.* from employees join traversal on id = element_id
  order by enter limit 1;

Which produces:

id	name
1	Amy

Direct Children

The strength of this algorithm is in getting branches of a tree. Getting the direct children is a little complicated:

select employees.*
  from traversal as t1, traversal as t2,
       employees join traversal as t3 on id = t3.element_id,
  where t3.enter between t1.enter and t1.exit
  and t1.enter between t2.enter and t2.exit
  and t2.element_id = 3 group by id having count(*) = 2;

Which produces:

id	name
5	Carl
6	Carol

This query seems a bit complicated, well, this query is a bit complicated. It is actually a modified form of another interesting query:

Levels

This representation lets me easily see the tiers in the tree with this query:

select employees.*, count(*) as level
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter between t1.enter and t1.exit group by id;

Which produces:

id	name	level
1	Amy	1
2	Ben	2
3	Bob	2
4	Barb	2
5	Carl	3
6	Carol	3
7	Dan	4

Branches

The real point to this representation is getting branches. For example to get all the people who work under Bob:

select employees.*
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter between t1.enter and t1.exit and t1.element_id = 3;

Which produces:

id	name
3	Bob
5	Carl
6	Carol
7	Dan

If I have a tree where I will frequently be working from internal nodes and will be getting branches, this can significantly reduce the number of queries I do.

Direct Parent

Getting the direct parent is easier than getting the children:

select employees.*
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter < t1.enter and t2.exit > t1.exit and t1.element_id = 5
  order by t2.enter desc limit 1;

Which produces:

id	name
3	Bob

Path to Root

In the theme of branches, getting the path to the root simply a matter of taking the limit off the previous statement:

select employees.*
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter <= t1.enter and t2.exit >= t1.exit and t1.element_id = 5
  order by t2.enter;

Which produces:

id	name
1	Amy
3	Bob
5	Carl

Child Count

One final interesting query is one which tells you how many children a node has. This can be done with a reversal of the query for level.

select employees.*, count(*) - 1 as "child count"
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t1.enter between t2.enter and t2.exit group by id;

If the indexes have been properly maintained they may also be used to get the child count. This eliminates the join and should be about as fast as possible:

select employees.*, cast((exit - enter) / 2 as unsigned) as "child count"
  from employees join traversal on id = element_id;

Which produces:

id	name	child count
1	Amy	6
2	Ben	0
3	Bob	3
4	Barb	0
5	Carl	0
6	Carol	1
7	Dan	0

Database Representation

How does this look in the database however? I'm pretty sure it should simply be a table for each leaf node that contains data and then some nodes which are only there for grouping. For example, my roommate and myself as an example:

People
- Name
  - First (William)
  - Middle (James)
  - Last (Holcomb)
- Name
  - First (Miriam)
  - Middle (Annette)
  - Last (Edwards)

create table traversal (id int auto_increment primary key, name text,
                        enter int unique, exit int unique);
insert into traversal values(DEFAULT, "People", 1, 18);
insert into traversal values(DEFAULT, "Name", 2, 9),(DEFAULT, "Name", 10, 17);
create table first (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "First", 3, 4);
insert into first values (LAST_INSERT_ID(), "William");
insert into traversal values(DEFAULT, "First", 11, 12);
insert into first values (LAST_INSERT_ID(), "Miriam");
create table middle (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "Middle", 5, 6);
insert into middle values (LAST_INSERT_ID(), "James");
insert into traversal values(DEFAULT, "Middle", 13, 14);
insert into middle values (LAST_INSERT_ID(), "Annette");
create table last (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "Last", 7, 8);
insert into last values (LAST_INSERT_ID(), "Holcomb");
insert into traversal values(DEFAULT, "Last", 15, 16);
insert into last values (LAST_INSERT_ID(), "Edwards");
select first.name as first, middle.name as middle, last.name as last
  from traversal as t join (first, middle, last)
       on (first.id = t.id or middle.id = t.id or last.id = t.id)
  group by first.name, middle.name, last.name;

Which produces:

first	middle	last
Miriam	Annette	Edwards
Miriam	Annette	Holcomb
Miriam	James	Edwards
Miriam	James	Holcomb
William	Annette	Edwards
William	Annette	Holcomb
William	James	Edwards
William	James	Holcomb

Which is funny, but not especially useful… ☺ The actual select needs to keep the branches separate:

select first.name as first, middle.name as middle, last.name as last
  from traversal as t1 join first using(id),
       traversal as t2 join middle using(id),
       traversal as t3 join last on using(id), traversal as t4
  where t1.enter between t4.enter and t4.exit
    and t2.enter between t4.enter and t4.exit
    and t3.enter between t4.enter and t4.exit and t4.id = 3;

Which produces:

first	middle	last
William	James	Holcomb

That's really a long statement. The construction is very regular and doing it programatiacally is no problem, it cannot be scalable though. That is a four table join just for a simple representation of a name. I may add another couple fields for a name, then for an address there would be another five or six then other info… I'd easily have a fifty table join which is just absurd.

Actually, I'm not sure why I wanted to put each field in its own table. Why not do:

create table node_types (id int auto_increment primary key, name text);
create table traversal
  (id int auto_increment primary key, type int not null references node_types(id),
   enter int unique, exit int unique);
insert into node_types values(DEFAULT, "People");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 1, 18);
insert into node_types values(DEFAULT, "Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 2, 9),
                            (DEFAULT, LAST_INSERT_ID(), 10, 17);
create table strings (id int not null references traversal(id), value text);
insert into node_types values(DEFAULT, "First Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 3, 4);
insert into strings values (LAST_INSERT_ID(), "William");
insert into traversal values(DEFAULT,
   (select id from node_types where name = "First Name"), 11, 12);
insert into strings values (LAST_INSERT_ID(), "Miriam");
insert into node_types values(DEFAULT, "Middle Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 5, 6);
insert into strings values (LAST_INSERT_ID(), "James");
insert into traversal values(DEFAULT,
   (select id from node_types where name = "Middle Name"), 13, 14);
insert into strings values (LAST_INSERT_ID(), "Annette");
insert into node_types values(DEFAULT, "Last Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 7, 8);
insert into strings values (LAST_INSERT_ID(), "Holcomb");
insert into traversal values(DEFAULT,
   (select id from node_types where name = "Last Name"), 15, 16);
insert into strings values (LAST_INSERT_ID(), "Edwards");
select t1.enter, t1.exit, node_types.name, strings.value
  from traversal as t1, traversal as t2, node_types, strings
  where t1.id = strings.id and t1.type = node_types.id
    and t1.enter between t2.enter and t2.exit and t2.id = 2;

Which produces:

enter	exit	name	value
3	4	First Name	William
5	6	Middle Name	James
7	8	Last Name	Holcomb

That looks pretty good to me. I will need to add additional tables for things like dates and numbers and whatnot, but this limits the profligation of tables to the number of data types.

Implementation: Making It Happen

Now comes that fatal step that is so often the death of my projects; coding. Not that I can't, just I often get distracted and wander off before finishing. One thing I know I don't feel like doing is an implmentation of XPath; it is tedious string parsing bits for the most part and it is a pretty sophistocated language. Conveniently there is an existing PHP implementation. It doesn't use a standard DOM document (it is a very limited subset with some additions for performance), but it is constructed from a SAX parse and I can fake that pretty easily.

To start off with, I had to take the original class and reformat the comments. They were too long for my screen and the wrapping made them pretty much unreadable. I started to reformat the comments for Doxygen and found that other than the initial comment they were already done.

Ugh, while generating the documentation, I realized this:

6264  31980 276819 XPath.class.orig.php

The file is over 6kloc. That's huge for me. A big project for me is maybe 500 lines. Part of it is because the code is well commented and also because the three classes are all in the same file. The XPathEngine class however is 4300 lines. My brain is simply not big enough to work effectively with that much information at a time.

The class structure is also not especially condusive to work. As I mentioned there are three classes:

XPathBase: provides utility functions (string parsing, message display, etc.)
XPathEngine: handles parsing of the XML document, parsing of the XPath query and searching the tree
XPath: allows operations on the XML tree (inserts, deletes, replacements) based on XPath queries

In my opinion the functions of these three classes are fairly distinct, but they are a single inheritance hierarchy: XPathBase → XPathEngine → XPath.

I think that if it is not too hard, I want to shuffle the functionality into:

PHPXPathDocument: Abstract class used to access the DOM document intersection used by the XPathEngine
FileDocument: Implementation of PHPXPathDocument from a file, since I will need to test
DatabaseDocument: Implementation of PHPXPathDocument accessing my database tree
XPath: An object representing an XPath. I don't know yet the extent this can be compiled to speed repetitions of the same query
XPathEngine: Utility class taking an XPath and PHPXPathDocument and returning a String, Number, Boolean or NodeSet
NodeSet: The other return values of an XPath query are PHP literal types, but this will be an Object based on the DOM specification