Trees with Access Controls

Why?

I have a website I'm working on to help PCVs keep in touch. The project originates out of a XSLT project I developed to keep track of everyone's name, hometown and whatnot. The project was really useful and we are the only country in West Africa where volunteers can quickly look up the contact information of their compatriots.

I'm headed home and I have neither the time nor the intrest to keep updating the XML sources for the roster. I would also like to make some of the things that I have developed available to volunteers in other countries.

The Problems

The main issue at this point is privacy. There are a couple of issues:

The Solution

I would like to give people control over the privacy of their information. Things that they would like to make public, let them make public. Things that they only want people in their country to know, let them do that as well.

The most logical way to do this would be a system of ACLs. Let them group pieces of information and specify how public it is. The issue with ACLs is I could make a huge list of ACLs for every piece of information, but that would be a pain to manage and would take too much time to check.

ACLs are seen most frequently in filesystems. NTFS has them as do certain versions of ext3. I attempted an earlier version of this same project attempting to use the ACLs already present in LDAP, but the implementation in OpenLDAP proved too difficult to work with as well as placing to great a limit on where I could deploy the program.

Hierarchies and Technology

Much of this information is hierarchical. Concepts go from the broad to the specific. Consider a theoretical structure for Peace Corps information: (which might not be optimal, but just for consideration)

This allows the organization and grouping of pieces of data in a sensical manner.

Composite Trees

So, I want to stick things in a tree and I want to give people control. One thing that becomes slightly more complex however is I want for these trees to be extremely flexible. If volunteer one wants the whole world to know their name, one wants to their middle name be private and another wants it all to be secret; I want to give them that freedom.

Given that the ACLs are going to be a part of the tree, this means that, for the name for example, either the whole name will be in a single node and I will do special checking on the node, or, more simply, each component of the name has its own node and ACLs operate at the node level.

This is how the current XML based implementation works which uses the xNL schema. For example, in the XML sources my name is stored as:

<xnl:PersonName xmlns:xnl="rn:oasis:names:tc:ciq:xsdschema:xNL:2.0">
  <xnl:FirstName>William</xnl:FirstName>
  <xnl:MiddleName>James</xnl:MiddleName>
  <xnl:LastName>Holcomb</xnl:LastName>
  <xnl:KnownAs>Will</xnl:KnownAs>
</xnl:PersonName>

Another advantage to putting each data item it its own node it is solves problems with quantities. For instance if the name is in a single table with the columns first, middle and last. What about people with no middle name? Let the column be null. Fine. What about people with two middle names? Put the names in a space searated list? Ok, fair enough, now add a nickname column. Now try to put in someone with two different nicknames. I can't just put them in there separated by spaces. I wouldn't be able to tell which words where a part of which name. I could add a special character like a newline to separate names. Now let's say that Bob has two nicknames: "Flash" and "Camel Fucker." The first he is fine with going in his contact info, but the second he only wants available to his friends. The combined field now fails whereas if the tree simply had multiple MiddleName nodes none of the problems appear in the first place.

XML preserves ordering of the children of a node, so the middle names of a person can be put in correct order and that will be maintained.

Access Control

So I now want to specify access to the tree. What sorts of access should be controllable?

The other aspect is how to specify the groups of users that a permission applies to. What are the sorts of creiteria I might like?

The most common metaphor for managing permissions is to apply them either to specific users or static groups of users. That means either I have to write the program to automatically maintain lots of groups or I have to create an interface for users to do it. If I make it so users are responsible for it this goes against rule #1 as almost certainly there will be things that don't get done in a timely manner.

A much better idea is dynamic groups. This is used in certain LDAP servers where instead of having a group specified by a list, it is specified by a query. This is significantly more computationally expensive, but it increases flexibility by orders of magnitude.

So far as queries, the users are already going to be in a tree. I'll just put their username in there along with the other info rather than having a separate table for that information. So I need a query for selecting nodes from a tree. This query language already exists with several implmentations: XPath.

First some conventions… The basic layout of the site tree would be:

I also need to be able to reference different nodes. $self is my user node, $guest is the person requesting access. Also, XPath is made to select XML nodes whose names may not contain spaces, so node names here will be converted by going to lowercase and putting underscores in the place of spaces.

The default permissions on the root which are inherited by all the children until they are overridden are: Read: *, Write: /groups/group[@name = "Administrators"]/people.

There are a couple different types of results:

The usage of attribute selectors is non-normative in these pathes. In normal XML a node has attributes associated with it. The nodes in this tree don't, so the normal attribute syntax is a selector for a child data node. For example person[@first_name = "Will"]/user_id is the same as person/first_name[value() = "Will"]/../user_id.

Cascading

I want for access control to be inherited by children. The problem is I don't want all things to be inherited equally. For example, my two root permissions: Read: * and Write: /groups/group[@name = "Administrators"]/people. The read permission for everyone I want to be overridden by users permissions. The write permission for administrators however I don't want for users to be able to revoke.

The simplest mechanism I can come up with is to give rules a priority:

  1. Authoritative: Administrator assigned; these rules take precedence over any others
  2. Important: User assignable rules that are should be considered a priority. For example if I have a group of close friends I want to give read access to all my information regardless of the premissions I set
  3. Normal: The default priority level
  4. Default: Not the default level, rather rules that are applied as a default if no other pertinent rules could be found

So my read rule would be a default and the write rule would be authoritative. My select would simply be to order the rules first by priority and second by their position on the way to the root. Then I just go through rules until I hit one whose selector matches.

One other property that rules will need is whether they are to allow or deny the matched permission.

Trees

So, I would like to do the same basic tree, but put it into a relational database. Take this org chart for example:

Your browser doesn't seem to handle the <object> tag

Basic Database

Say I have all the employees in a database: (These tests were run on MySQL.)

create table employees (id int auto_increment primary key, name text);
insert into employees values(DEFAULT, "Amy"),(DEFAULT, "Ben");
insert into employees values(DEFAULT, "Bob"),(DEFAULT, "Barb");
insert into employees values(DEFAULT, "Carl"),(DEFAULT, "Carol");
insert into employees values(DEFAULT, "Dan");
select * from employees;

Which produces:

idname
1Amy
2Ben
3Bob
4Barb
5Carl
6Carol
7Dan

Adjacency Matrix

The most basic way of doing a tree in a relational database is an adjacency matrix. For our example it would be:

create table adjacency (parent int not null references employees(id),
                        child int not null references employees(id));
insert into adjacency values(1, 2),(1, 3),(1, 4);
insert into adjacency values(3, 5),(3, 6),(6, 7);
select * from adjacency;
parentchild
12
13
14
35
36
67

The Root

I can now iterate my way through the table. To get the root of the table is is:

select employees.*
  from employees left join adjacency on employees.id = adjacency.child
  where adjacency.child is null;

Which produces:

idname
1Amy

Children

To get the children just take the employee id we want the children of:

select employees.*
  from employees left join adjacency on employees.id = adjacency.child
  where adjacency.parent = 1;

Which produces:

idname
2Ben
3Bob
4Barb

Parents

To go the other direction is pretty much the same thing:

select employees.*
  from employees left join adjacency on employees.id = adjacency.parent
  where adjacency.child = 2;

Which produces:

idname
1Amy

Limitations

This representation has the advantage of being easy to conceptualize and update. It has the disadvantage of requiring many repeated queries to walk the tree if you are not starting at the root. (Normally this works best for applications where you return the entire tree and construct it in memory before displaying it.)

Nested Sets

A representation that is a bit more powerful at the cost of being more complex is called "nested sets" and comes from Joe Celko. In this represenation two numbers are associated with a node. To get the numbers you do a depth first traversal with a counter that is incremented at every step. The two numbers are the count when the node was entered and exited:

Your browser doesn't seem to handle the <object> tag

So to create this representation it would be:

create table traversal (element_id int not null references employees(id),
                        enter int unique, exit int unique);
insert into traversal values(1, 1, 14),(2, 2, 3),(3, 4, 11),(4, 12, 13);
insert into traversal values(5, 5, 6),(6, 7, 10),(7, 8, 9);
select * from traversal;
element_identerexit
1114
223
3311
4413
556
6710
789

Doing the queries that are possible with the adjacency matrix are possible with this representation:

The Root

The root is just the element with the minimum enter count:

select employees.* from employees join traversal on id = element_id
  order by enter limit 1;

Which produces:

idname
1Amy

Direct Children

The strength of this algorithm is in getting branches of a tree. Getting the direct children is a little complicated:

select employees.*
  from traversal as t1, traversal as t2,
       employees join traversal as t3 on id = t3.element_id,
  where t3.enter between t1.enter and t1.exit
  and t1.enter between t2.enter and t2.exit
  and t2.element_id = 3 group by id having count(*) = 2;

Which produces:

idname
5Carl
6Carol

This query seems a bit complicated, well, this query is a bit complicated. It is actually a modified form of another interesting query:

Levels

This representation lets me easily see the tiers in the tree with this query:

select employees.*, count(*) as level
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter between t1.enter and t1.exit group by id;

Which produces:

idnamelevel
1Amy1
2Ben2
3Bob2
4Barb2
5Carl3
6Carol3
7Dan4

Branches

The real point to this representation is getting branches. For example to get all the people who work under Bob:

select employees.*
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter between t1.enter and t1.exit and t1.element_id = 3;

Which produces:

idname
3Bob
5Carl
6Carol
7Dan

If I have a tree where I will frequently be working from internal nodes and will be getting branches, this can significantly reduce the number of queries I do.

Direct Parent

Getting the direct parent is easier than getting the children:

select employees.*
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter < t1.enter and t2.exit > t1.exit and t1.element_id = 5
  order by t2.enter desc limit 1;

Which produces:

idname
3Bob

Path to Root

In the theme of branches, getting the path to the root simply a matter of taking the limit off the previous statement:

select employees.*
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t2.enter <= t1.enter and t2.exit >= t1.exit and t1.element_id = 5
  order by t2.enter;

Which produces:

idname
1Amy
3Bob
5Carl

Child Count

One final interesting query is one which tells you how many children a node has. This can be done with a reversal of the query for level.

select employees.*, count(*) - 1 as "child count"
  from traversal as t1, employees join traversal as t2 on id = t2.element_id
  where t1.enter between t2.enter and t2.exit group by id;

If the indexes have been properly maintained they may also be used to get the child count. This eliminates the join and should be about as fast as possible:

select employees.*, cast((exit - enter) / 2 as unsigned) as "child count"
  from employees join traversal on id = element_id;

Which produces:

idnamechild count
1Amy6
2Ben0
3Bob3
4Barb0
5Carl0
6Carol1
7Dan0

Database Representation

How does this look in the database however? I'm pretty sure it should simply be a table for each leaf node that contains data and then some nodes which are only there for grouping. For example, my roommate and myself as an example:

create table traversal (id int auto_increment primary key, name text,
                        enter int unique, exit int unique);
insert into traversal values(DEFAULT, "People", 1, 18);
insert into traversal values(DEFAULT, "Name", 2, 9),(DEFAULT, "Name", 10, 17);
create table first (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "First", 3, 4);
insert into first values (LAST_INSERT_ID(), "William");
insert into traversal values(DEFAULT, "First", 11, 12);
insert into first values (LAST_INSERT_ID(), "Miriam");
create table middle (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "Middle", 5, 6);
insert into middle values (LAST_INSERT_ID(), "James");
insert into traversal values(DEFAULT, "Middle", 13, 14);
insert into middle values (LAST_INSERT_ID(), "Annette");
create table last (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "Last", 7, 8);
insert into last values (LAST_INSERT_ID(), "Holcomb");
insert into traversal values(DEFAULT, "Last", 15, 16);
insert into last values (LAST_INSERT_ID(), "Edwards");
select first.name as first, middle.name as middle, last.name as last
  from traversal as t join (first, middle, last)
       on (first.id = t.id or middle.id = t.id or last.id = t.id)
  group by first.name, middle.name, last.name;

Which produces:

firstmiddlelast
MiriamAnnetteEdwards
MiriamAnnetteHolcomb
MiriamJamesEdwards
MiriamJamesHolcomb
WilliamAnnetteEdwards
WilliamAnnetteHolcomb
WilliamJamesEdwards
WilliamJamesHolcomb

Which is funny, but not especially useful… ☺ The actual select needs to keep the branches separate:

select first.name as first, middle.name as middle, last.name as last
  from traversal as t1 join first using(id),
       traversal as t2 join middle using(id),
       traversal as t3 join last on using(id), traversal as t4
  where t1.enter between t4.enter and t4.exit
    and t2.enter between t4.enter and t4.exit
    and t3.enter between t4.enter and t4.exit and t4.id = 3;

Which produces:

firstmiddlelast
WilliamJamesHolcomb

That's really a long statement. The construction is very regular and doing it programatiacally is no problem, it cannot be scalable though. That is a four table join just for a simple representation of a name. I may add another couple fields for a name, then for an address there would be another five or six then other info… I'd easily have a fifty table join which is just absurd.

Actually, I'm not sure why I wanted to put each field in its own table. Why not do:

create table node_types (id int auto_increment primary key, name text);
create table traversal
  (id int auto_increment primary key, type int not null references node_types(id),
   enter int unique, exit int unique);
insert into node_types values(DEFAULT, "People");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 1, 18);
insert into node_types values(DEFAULT, "Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 2, 9),
                            (DEFAULT, LAST_INSERT_ID(), 10, 17);
create table strings (id int not null references traversal(id), value text);
insert into node_types values(DEFAULT, "First Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 3, 4);
insert into strings values (LAST_INSERT_ID(), "William");
insert into traversal values(DEFAULT,
   (select id from node_types where name = "First Name"), 11, 12);
insert into strings values (LAST_INSERT_ID(), "Miriam");
insert into node_types values(DEFAULT, "Middle Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 5, 6);
insert into strings values (LAST_INSERT_ID(), "James");
insert into traversal values(DEFAULT,
   (select id from node_types where name = "Middle Name"), 13, 14);
insert into strings values (LAST_INSERT_ID(), "Annette");
insert into node_types values(DEFAULT, "Last Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 7, 8);
insert into strings values (LAST_INSERT_ID(), "Holcomb");
insert into traversal values(DEFAULT,
   (select id from node_types where name = "Last Name"), 15, 16);
insert into strings values (LAST_INSERT_ID(), "Edwards");
select t1.enter, t1.exit, node_types.name, strings.value
  from traversal as t1, traversal as t2, node_types, strings
  where t1.id = strings.id and t1.type = node_types.id
    and t1.enter between t2.enter and t2.exit and t2.id = 2;

Which produces:

enterexitnamevalue
34First NameWilliam
56Middle NameJames
78Last NameHolcomb

That looks pretty good to me. I will need to add additional tables for things like dates and numbers and whatnot, but this limits the profligation of tables to the number of data types.

Implementation: Making It Happen

Now comes that fatal step that is so often the death of my projects; coding. Not that I can't, just I often get distracted and wander off before finishing. One thing I know I don't feel like doing is an implmentation of XPath; it is tedious string parsing bits for the most part and it is a pretty sophistocated language. Conveniently there is an existing PHP implementation. It doesn't use a standard DOM document (it is a very limited subset with some additions for performance), but it is constructed from a SAX parse and I can fake that pretty easily.

To start off with, I had to take the original class and reformat the comments. They were too long for my screen and the wrapping made them pretty much unreadable. I started to reformat the comments for Doxygen and found that other than the initial comment they were already done.

Ugh, while generating the documentation, I realized this:

6264  31980 276819 XPath.class.orig.php

The file is over 6kloc. That's huge for me. A big project for me is maybe 500 lines. Part of it is because the code is well commented and also because the three classes are all in the same file. The XPathEngine class however is 4300 lines. My brain is simply not big enough to work effectively with that much information at a time.

The class structure is also not especially condusive to work. As I mentioned there are three classes:

  1. XPathBase: provides utility functions (string parsing, message display, etc.)
  2. XPathEngine: handles parsing of the XML document, parsing of the XPath query and searching the tree
  3. XPath: allows operations on the XML tree (inserts, deletes, replacements) based on XPath queries

In my opinion the functions of these three classes are fairly distinct, but they are a single inheritance hierarchy: XPathBaseXPathEngineXPath.

I think that if it is not too hard, I want to shuffle the functionality into: