I have a website I'm working on to help PCVs keep in touch. The project originates out of a XSLT project I developed to keep track of everyone's name, hometown and whatnot. The project was really useful and we are the only country in West Africa where volunteers can quickly look up the contact information of their compatriots.
I'm headed home and I have neither the time nor the intrest to keep updating the XML sources for the roster. I would also like to make some of the things that I have developed available to volunteers in other countries.
The main issue at this point is privacy. There are a couple of issues:
I would like to give people control over the privacy of their information. Things that they would like to make public, let them make public. Things that they only want people in their country to know, let them do that as well.
The most logical way to do this would be a system of ACLs. Let them group pieces of information and specify how public it is. The issue with ACLs is I could make a huge list of ACLs for every piece of information, but that would be a pain to manage and would take too much time to check.
ACLs are seen most frequently in filesystems. NTFS has them as do certain versions of ext3. I attempted an earlier version of this same project attempting to use the ACLs already present in LDAP, but the implementation in OpenLDAP proved too difficult to work with as well as placing to great a limit on where I could deploy the program.
Much of this information is hierarchical. Concepts go from the broad to the specific. Consider a theoretical structure for Peace Corps information: (which might not be optimal, but just for consideration)
This allows the organization and grouping of pieces of data in a sensical manner.
So, I want to stick things in a tree and I want to give people control. One thing that becomes slightly more complex however is I want for these trees to be extremely flexible. If volunteer one wants the whole world to know their name, one wants to their middle name be private and another wants it all to be secret; I want to give them that freedom.
Given that the ACLs are going to be a part of the tree, this means that, for the name for example, either the whole name will be in a single node and I will do special checking on the node, or, more simply, each component of the name has its own node and ACLs operate at the node level.
This is how the current XML based implementation works which uses the xNL schema. For example, in the XML sources my name is stored as:
<xnl:PersonName xmlns:xnl="rn:oasis:names:tc:ciq:xsdschema:xNL:2.0">
<xnl:FirstName>William</xnl:FirstName>
<xnl:MiddleName>James</xnl:MiddleName>
<xnl:LastName>Holcomb</xnl:LastName>
<xnl:KnownAs>Will</xnl:KnownAs>
</xnl:PersonName>
Another advantage to putting each data item it its own node it is solves problems with quantities. For instance if the name is in a single table with the columns first
, middle
and last
. What about people with no middle name? Let the column be null. Fine. What about people with two middle names? Put the names in a space searated list? Ok, fair enough, now add a nickname column. Now try to put in someone with two different nicknames. I can't just put them in there separated by spaces. I wouldn't be able to tell which words where a part of which name. I could add a special character like a newline to separate names. Now let's say that Bob has two nicknames: "Flash" and "Camel Fucker." The first he is fine with going in his contact info, but the second he only wants available to his friends. The combined field now fails whereas if the tree simply had multiple MiddleName
nodes none of the problems appear in the first place.
XML preserves ordering of the children of a node, so the middle names of a person can be put in correct order and that will be maintained.
So I now want to specify access to the tree. What sorts of access should be controllable?
Read
:Edit
:Delete
:Traversal
: can a user access the children of this node?Browse
: can a user list the accessible children of this node? (Children that a user cannot read are invisible. This is how Apache does its directory indexes and it helps protect privacy. If my node had a BoyfriendName
child, you don't need to know the fellow's name to figure out I'm gay.)List
: can a user list all the children of this node? (implies Browse
)Edit
: can a user change the properties of this node?Add
: can a user add children to this node?Delete
: can a user delete children from this node?The other aspect is how to specify the groups of users that a permission applies to. What are the sorts of creiteria I might like?
The most common metaphor for managing permissions is to apply them either to specific users or static groups of users. That means either I have to write the program to automatically maintain lots of groups or I have to create an interface for users to do it. If I make it so users are responsible for it this goes against rule #1 as almost certainly there will be things that don't get done in a timely manner.
A much better idea is dynamic groups. This is used in certain LDAP servers where instead of having a group specified by a list, it is specified by a query. This is significantly more computationally expensive, but it increases flexibility by orders of magnitude.
So far as queries, the users are already going to be in a tree. I'll just put their username in there along with the other info rather than having a separate table for that information. So I need a query for selecting nodes from a tree. This query language already exists with several implmentations: XPath.
First some conventions… The basic layout of the site tree would be:
I also need to be able to reference different nodes. $self is my user node, $guest is the person requesting access. Also, XPath is made to select XML nodes whose names may not contain spaces, so node names here will be converted by going to lowercase and putting underscores in the place of spaces.
The default permissions on the root which are inherited by all the children until they are overridden are: Read: *, Write: /groups/group[@name = "Administrators"]/people
.
Read: $guest[@user_id = "jane"]
: No one but Jane can read my birthdayRead: $guest/../local-name() = "volunteers" and ($self/../ or $guest[@status = "active"])
: Anyone who served in my country or any currently active Peace Corps volunteer can see my site assignmentRead: $self/../../*/*
: Anyone in the Peace Corps in my country, either volunteers or staff, can see my email adddressRead: *
: The whole world can see my first nameRead: $guest[@user_id]
: Any authenticated users can see my last nameRead: $self/groups/group[@name = "Friends"]/people
: Only my group of friends can see the address of my LiveJournalThere are a couple different types of results:
Node Sets
: $guest[@user_id = "jane"]
or $self/../
: these are queries which produce node sets
. (The first produces a node set containing a single node.) These are considered successful matches if they contain the $guest
node.Booleans
: $guest/../local-name() = "volunteers"
: this produces a boolean value, true
or false
. The criteia for its success should be apparent.The usage of attribute selectors is non-normative in these pathes. In normal XML a node has attributes associated with it. The nodes in this tree don't, so the normal attribute syntax is a selector for a child data node. For example person[@first_name = "Will"]/user_id
is the same as person/first_name[value() = "Will"]/../user_id
.
I want for access control to be inherited by children. The problem is I don't want all things to be inherited equally. For example, my two root permissions: Read: *
and Write: /groups/group[@name = "Administrators"]/people
. The read permission for everyone I want to be overridden by users permissions. The write permission for administrators however I don't want for users to be able to revoke.
The simplest mechanism I can come up with is to give rules a priority:
Authoritative
: Administrator assigned; these rules take precedence over any othersImportant
: User assignable rules that are should be considered a priority. For example if I have a group of close friends I want to give read access to all my information regardless of the premissions I setNormal
: The default priority levelDefault
: Not the default level, rather rules that are applied as a default if no other pertinent rules could be foundSo my read rule would be a default
and the write rule would be authoritative
. My select would simply be to order the rules first by priority and second by their position on the way to the root. Then I just go through rules until I hit one whose selector matches.
One other property that rules will need is whether they are to allow
or deny
the matched permission.
So, I would like to do the same basic tree, but put it into a relational database. Take this org chart for example:
Say I have all the employees in a database: (These tests were run on MySQL.)
create table employees (id int auto_increment primary key, name text);
insert into employees values(DEFAULT, "Amy"),(DEFAULT, "Ben");
insert into employees values(DEFAULT, "Bob"),(DEFAULT, "Barb");
insert into employees values(DEFAULT, "Carl"),(DEFAULT, "Carol");
insert into employees values(DEFAULT, "Dan");
select * from employees;
Which produces:
id | name |
---|---|
1 | Amy |
2 | Ben |
3 | Bob |
4 | Barb |
5 | Carl |
6 | Carol |
7 | Dan |
The most basic way of doing a tree in a relational database is an adjacency matrix. For our example it would be:
create table adjacency (parent int not null references employees(id),
child int not null references employees(id));
insert into adjacency values(1, 2),(1, 3),(1, 4);
insert into adjacency values(3, 5),(3, 6),(6, 7);
select * from adjacency;
parent | child |
---|---|
1 | 2 |
1 | 3 |
1 | 4 |
3 | 5 |
3 | 6 |
6 | 7 |
I can now iterate my way through the table. To get the root of the table is is:
select employees.*
from employees left join adjacency on employees.id = adjacency.child
where adjacency.child is null;
Which produces:
id | name |
---|---|
1 | Amy |
To get the children just take the employee id we want the children of:
select employees.*
from employees left join adjacency on employees.id = adjacency.child
where adjacency.parent = 1;
Which produces:
id | name |
---|---|
2 | Ben |
3 | Bob |
4 | Barb |
To go the other direction is pretty much the same thing:
select employees.*
from employees left join adjacency on employees.id = adjacency.parent
where adjacency.child = 2;
Which produces:
id | name |
---|---|
1 | Amy |
This representation has the advantage of being easy to conceptualize and update. It has the disadvantage of requiring many repeated queries to walk the tree if you are not starting at the root. (Normally this works best for applications where you return the entire tree and construct it in memory before displaying it.)
A representation that is a bit more powerful at the cost of being more complex is called "nested sets" and comes from Joe Celko. In this represenation two numbers are associated with a node. To get the numbers you do a depth first traversal with a counter that is incremented at every step. The two numbers are the count when the node was entered and exited:
Your browser doesn't seem to handle the <object> tag
So to create this representation it would be:
create table traversal (element_id int not null references employees(id),
enter int unique, exit int unique);
insert into traversal values(1, 1, 14),(2, 2, 3),(3, 4, 11),(4, 12, 13);
insert into traversal values(5, 5, 6),(6, 7, 10),(7, 8, 9);
select * from traversal;
element_id | enter | exit |
---|---|---|
1 | 1 | 14 |
2 | 2 | 3 |
3 | 3 | 11 |
4 | 4 | 13 |
5 | 5 | 6 |
6 | 7 | 10 |
7 | 8 | 9 |
Doing the queries that are possible with the adjacency matrix are possible with this representation:
The root is just the element with the minimum enter count:
select employees.* from employees join traversal on id = element_id
order by enter limit 1;
Which produces:
id | name |
---|---|
1 | Amy |
The strength of this algorithm is in getting branches of a tree. Getting the direct children is a little complicated:
select employees.*
from traversal as t1, traversal as t2,
employees join traversal as t3 on id = t3.element_id,
where t3.enter between t1.enter and t1.exit
and t1.enter between t2.enter and t2.exit
and t2.element_id = 3 group by id having count(*) = 2;
Which produces:
id | name |
---|---|
5 | Carl |
6 | Carol |
This query seems a bit complicated, well, this query is a bit complicated. It is actually a modified form of another interesting query:
This representation lets me easily see the tiers in the tree with this query:
select employees.*, count(*) as level
from traversal as t1, employees join traversal as t2 on id = t2.element_id
where t2.enter between t1.enter and t1.exit group by id;
Which produces:
id | name | level |
---|---|---|
1 | Amy | 1 |
2 | Ben | 2 |
3 | Bob | 2 |
4 | Barb | 2 |
5 | Carl | 3 |
6 | Carol | 3 |
7 | Dan | 4 |
The real point to this representation is getting branches. For example to get all the people who work under Bob:
select employees.*
from traversal as t1, employees join traversal as t2 on id = t2.element_id
where t2.enter between t1.enter and t1.exit and t1.element_id = 3;
Which produces:
id | name |
---|---|
3 | Bob |
5 | Carl |
6 | Carol |
7 | Dan |
If I have a tree where I will frequently be working from internal nodes and will be getting branches, this can significantly reduce the number of queries I do.
Getting the direct parent is easier than getting the children:
select employees.*
from traversal as t1, employees join traversal as t2 on id = t2.element_id
where t2.enter < t1.enter and t2.exit > t1.exit and t1.element_id = 5
order by t2.enter desc limit 1;
Which produces:
id | name |
---|---|
3 | Bob |
In the theme of branches, getting the path to the root simply a matter of taking the limit off the previous statement:
select employees.*
from traversal as t1, employees join traversal as t2 on id = t2.element_id
where t2.enter <= t1.enter and t2.exit >= t1.exit and t1.element_id = 5
order by t2.enter;
Which produces:
id | name |
---|---|
1 | Amy |
3 | Bob |
5 | Carl |
One final interesting query is one which tells you how many children a node has. This can be done with a reversal of the query for level.
select employees.*, count(*) - 1 as "child count"
from traversal as t1, employees join traversal as t2 on id = t2.element_id
where t1.enter between t2.enter and t2.exit group by id;
If the indexes have been properly maintained they may also be used to get the child count. This eliminates the join and should be about as fast as possible:
select employees.*, cast((exit - enter) / 2 as unsigned) as "child count"
from employees join traversal on id = element_id;
Which produces:
id | name | child count |
---|---|---|
1 | Amy | 6 |
2 | Ben | 0 |
3 | Bob | 3 |
4 | Barb | 0 |
5 | Carl | 0 |
6 | Carol | 1 |
7 | Dan | 0 |
How does this look in the database however? I'm pretty sure it should simply be a table for each leaf node that contains data and then some nodes which are only there for grouping. For example, my roommate and myself as an example:
create table traversal (id int auto_increment primary key, name text,
enter int unique, exit int unique);
insert into traversal values(DEFAULT, "People", 1, 18);
insert into traversal values(DEFAULT, "Name", 2, 9),(DEFAULT, "Name", 10, 17);
create table first (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "First", 3, 4);
insert into first values (LAST_INSERT_ID(), "William");
insert into traversal values(DEFAULT, "First", 11, 12);
insert into first values (LAST_INSERT_ID(), "Miriam");
create table middle (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "Middle", 5, 6);
insert into middle values (LAST_INSERT_ID(), "James");
insert into traversal values(DEFAULT, "Middle", 13, 14);
insert into middle values (LAST_INSERT_ID(), "Annette");
create table last (id int primary key references traversal(id), name text);
insert into traversal values(DEFAULT, "Last", 7, 8);
insert into last values (LAST_INSERT_ID(), "Holcomb");
insert into traversal values(DEFAULT, "Last", 15, 16);
insert into last values (LAST_INSERT_ID(), "Edwards");
select first.name as first, middle.name as middle, last.name as last
from traversal as t join (first, middle, last)
on (first.id = t.id or middle.id = t.id or last.id = t.id)
group by first.name, middle.name, last.name;
Which produces:
first | middle | last |
---|---|---|
Miriam | Annette | Edwards |
Miriam | Annette | Holcomb |
Miriam | James | Edwards |
Miriam | James | Holcomb |
William | Annette | Edwards |
William | Annette | Holcomb |
William | James | Edwards |
William | James | Holcomb |
Which is funny, but not especially useful… ☺ The actual select needs to keep the branches separate:
select first.name as first, middle.name as middle, last.name as last
from traversal as t1 join first using(id),
traversal as t2 join middle using(id),
traversal as t3 join last on using(id), traversal as t4
where t1.enter between t4.enter and t4.exit
and t2.enter between t4.enter and t4.exit
and t3.enter between t4.enter and t4.exit and t4.id = 3;
Which produces:
first | middle | last |
---|---|---|
William | James | Holcomb |
That's really a long statement. The construction is very regular and doing it programatiacally is no problem, it cannot be scalable though. That is a four table join just for a simple representation of a name. I may add another couple fields for a name, then for an address there would be another five or six then other info… I'd easily have a fifty table join which is just absurd.
Actually, I'm not sure why I wanted to put each field in its own table. Why not do:
create table node_types (id int auto_increment primary key, name text);
create table traversal
(id int auto_increment primary key, type int not null references node_types(id),
enter int unique, exit int unique);
insert into node_types values(DEFAULT, "People");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 1, 18);
insert into node_types values(DEFAULT, "Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 2, 9),
(DEFAULT, LAST_INSERT_ID(), 10, 17);
create table strings (id int not null references traversal(id), value text);
insert into node_types values(DEFAULT, "First Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 3, 4);
insert into strings values (LAST_INSERT_ID(), "William");
insert into traversal values(DEFAULT,
(select id from node_types where name = "First Name"), 11, 12);
insert into strings values (LAST_INSERT_ID(), "Miriam");
insert into node_types values(DEFAULT, "Middle Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 5, 6);
insert into strings values (LAST_INSERT_ID(), "James");
insert into traversal values(DEFAULT,
(select id from node_types where name = "Middle Name"), 13, 14);
insert into strings values (LAST_INSERT_ID(), "Annette");
insert into node_types values(DEFAULT, "Last Name");
insert into traversal values(DEFAULT, LAST_INSERT_ID(), 7, 8);
insert into strings values (LAST_INSERT_ID(), "Holcomb");
insert into traversal values(DEFAULT,
(select id from node_types where name = "Last Name"), 15, 16);
insert into strings values (LAST_INSERT_ID(), "Edwards");
select t1.enter, t1.exit, node_types.name, strings.value
from traversal as t1, traversal as t2, node_types, strings
where t1.id = strings.id and t1.type = node_types.id
and t1.enter between t2.enter and t2.exit and t2.id = 2;
Which produces:
enter | exit | name | value |
---|---|---|---|
3 | 4 | First Name | William |
5 | 6 | Middle Name | James |
7 | 8 | Last Name | Holcomb |
That looks pretty good to me. I will need to add additional tables for things like dates and numbers and whatnot, but this limits the profligation of tables to the number of data types.
Now comes that fatal step that is so often the death of my projects; coding. Not that I can't, just I often get distracted and wander off before finishing. One thing I know I don't feel like doing is an implmentation of XPath; it is tedious string parsing bits for the most part and it is a pretty sophistocated language. Conveniently there is an existing PHP implementation. It doesn't use a standard DOM document (it is a very limited subset with some additions for performance), but it is constructed from a SAX parse and I can fake that pretty easily.
To start off with, I had to take the original class and reformat the comments. They were too long for my screen and the wrapping made them pretty much unreadable. I started to reformat the comments for Doxygen and found that other than the initial comment they were already done.
Ugh, while generating the documentation, I realized this:
6264 31980 276819 XPath.class.orig.php
The file is over 6kloc. That's huge for me. A big project for me is maybe 500 lines. Part of it is because the code is well commented and also because the three classes are all in the same file. The XPathEngine
class however is 4300 lines. My brain is simply not big enough to work effectively with that much information at a time.
The class structure is also not especially condusive to work. As I mentioned there are three classes:
XPathBase
: provides utility functions (string parsing, message display, etc.)XPathEngine
: handles parsing of the XML document, parsing of the XPath query and searching the treeXPath
: allows operations on the XML tree (inserts, deletes, replacements) based on XPath queriesIn my opinion the functions of these three classes are fairly distinct, but they are a single inheritance hierarchy: XPathBase
→ XPathEngine
→ XPath
.
I think that if it is not too hard, I want to shuffle the functionality into:
PHPXPathDocument
: Abstract class used to access the DOM document intersection used by the XPathEngine
FileDocument
: Implementation of PHPXPathDocument
from a file, since I will need to testDatabaseDocument
: Implementation of PHPXPathDocument
accessing my database treeXPath
: An object representing an XPath. I don't know yet the extent this can be compiled to speed repetitions of the same queryXPathEngine
: Utility class taking an XPath
and PHPXPathDocument
and returning a String
, Number
, Boolean
or NodeSet
NodeSet
: The other return values of an XPath
query are PHP literal types, but this will be an Object
based on the DOM specification