Showing posts with label data model. Show all posts
Showing posts with label data model. Show all posts

Wednesday, November 18, 2015

Data Modeling - The True Meaning of Entities

When modeling data an entity refers to a unit that has properties which we are modeling. Some of those properties of the entity are intrinsic (natural properties) and some are extrinsic (externally applied/relative properties). If we model our entities such that the intrinsic properties belong to a root model (Person) and the extrinsic properties belong to subclasses of the root model (Employee) we can establish a proper representation of actual relationships of the entities in our model.

To illustrate, let's take a common example of modeling a person who is an employee at a specific company.

The person has some natural properties: Name, DOB, Gender, etc. These properties describe the entity in its base state. As an employee, the person may have additional properties that relate specifically to their state as an employee: id, email address, extension, location, title, etc. This person will also have some personal properties that may be interesting to model: email(s), phone number(s), address(es), family, etc.

If we are modeling the data as a relational model, none of the columns in the data tables should be nullable. Nullable columns are a clear indicator that the data can be further normalized and that it does not likely represent the entities accurately.

To achieve the goal of creating our model, we need to be able to create the proper relationships:

Person - Email (1:*)
Person - Address (*:*)

These relationships bring up interesting questions (and leave ambiguity in the model).

If someone has more than one address and one primary address is a requirement (assuming we are not modeling homeless), then how does a many-to-many relationship enforce the many-to-one relationship that must exist to model the persons primary residence?  Address may be either of two types: primary xor non-primary. We could set a bit flag to identify primary. In order to enforce the *:1 (an address can be primary for more than one person) constraint with the primary, we would need an FK on the Person table and a reference to the address table. In the *:* relationship above, how can that FK be constrained to the PrimaryAddress type? The answer lay in table-per-type (TPT). We want to get to this:

 ...
Person - Address (*:*)
Pesrson - PrimaryAddress (*:1)

To implement the TPT for this model, we would need an Address table a PrimaryAddress table and a AdditionalAddress table. The Address would contain the details of the address and the PK. But each typed table would contain an FK to the Address table which is also its own PK. The Person record would contain a FK ref to the PrimaryAddress. An additional xref table would be needed to link the Address table to the Person table in general, this would be a *:* relationship. This PersonAddress table can better represent the fact that more than one Person can live at the same Address and a Person may have more than one Address. The FK to the PrimaryAddress will enforce and show that the Person must have a PrimaryAddress and the 1:1 relationship between PrimaryAddress and Address will establish which Address is that Person's primary Address.


It is important to recognize the boundaries of the entity since any entity may serve as the root. It is equally important to establish distinctions between different types/classifications of entities. Here are a few examples of these relationships:

Person (base type) -> [Employee, Parent, Volunteer, Student, Teacher] (sub-types)
Contact (base type) -> [Email, Phone -> [Internal], Skype, Signal -> [Tree, Smoke, Call]] (sub-types)
Relationship (base type) -> [Friend -> [Best, Acquaintance], Family -> [Sibling -> [Brother, Sister], Parent -> [Mother, Father], Cousin, ...] (sub-types)

In this way we can establish meaningful relationships in the data model, while accurately representing the entities we are modeling.

Wednesday, July 9, 2014

SOLID DB principles

In my last post, I rattled off a little bit about creating a db model that conforms to SOLID development principles. I presented the idea to some colleagues and it seemed to go over well. Today I will attempt to expound upon the principle and how a good database design can conform to each piece of the SOLID acronym.

S - single responsibility - each table should have a single responsibility. This means no god tables with nearly infinite nullable columns that can capture more than one type (sub-type actually) of thing.

O - open-closed - open to extension, closed to modification. Create linking tables or type tables to extend a base table, don't change the base table. Add views to create reusable components. Always create list/type tables so that new types of data can be linked to metadata. Reuse data when possible (no is no). If you really, really need to change the attributes frequently, consider EAV - but with caution.

L- Liskov Substitution - Comment is a base table, PostComment is a sub table. They map to types in object-land. Learn TPT, TPH etc.

I - interface segregation - there aren't really interfaces. Just make sure again that your tables, and perhaps views and other objects such as stored procedures/functions (especially these) aren't doing more than one thing.

D - dependency inversion (DI) - this applies more to the DAL than anything. Using DI allows for switching the database technology later, or even spread your data model across technologies. Either way all other code/modules/layers should be agnostic of the data store - abstract that baby away! One could go so far as to write adapters that adapt the db access technology to a custom interface should one want to switch from something like Entity Framework to NHibernate. EmberJS does something like that in the ember-data library. Consumers make all calls through ember-data objects, but can map the calls to whatever their storage technology requires. I truly can't think of any way to apply DI in SQL, perhaps noSQL implementations will have varying abilities to do so with map/reduce functions. I would say that the best way to practice good DI with SQL is to avoid any sort of business logic in stored procedures. I have seen some pretty tricky DI being simulated in stored procedure chains, perhaps some EXEC calls on dynamic SQL and things like that. I avoid those sorts of things like the plague these days unless it's generated SQL via EntityFramework.

Anyways, that's my attempt to SOLIDify database design. I feel kind of floppy about the whole DI rambling, if anyone has some good ideas about that one feel free to chip in!

Happy Coding!

Monday, July 7, 2014

SOLID data relationships

One approach I like to take when designing a database schema is to focus on the relationships before adding columns that are not involved in the relationships (in a relational database that is).

The structure and integrity are the most important component of a good database design. One approach is to model the data structures first. If you are working in an object oriented language you could build your objects first with minimal properties. By doing this and thinking about SOLID development principles, we can keep our database design in alignment with our code development practices.

in C#:
public class Person
{
public int PersonID {get;} {set;}
public virtual ICollection<Task> Tasks {get;} {set;}
}
or even using "class/table diagrams":

Person
-PersonID int PK
-Tasks Task[]

Task
-TaskID int PK
-Assignees Person[]

in this model we can already see that a person can have many tasks and a task can be assigned to many people. We can also see that the task class will not map directly to relational tables. There is only one task, this task can be assigned to multiple people. Therefore we will need a linking table to represent the many-to-many relationship between people and tasks. Likely we would want to collect more data about that task assignment, so probably the intermediary table should have it's own model.

TaskAssignment
-TaskAssignmentID int PK
-TaskID int FK
-PersonID int FK
-details...

Person
-PersonID int PK
-TaskAssignments TaskAssignment[]

Tasks
-TaskID int PK
-TaskAssignments TaskAssignment[]

now we can track each assignment and later add props like CompletedDate, hook up another link table to comments so you can save multiple comments with metadata related to each one.

While many-to-many relationships with a single comments table would work, there is a slight issue with data-integrity. A single comment should relate to a single TaskAssignment, and there should be zero-to-many Comments for a TaskAssignment. Or anything we'd like to tie comments to. Note that we might be tempted create a combo PK on TaskAssignments using TaskID and PersonID and do away with TaskAssignmentID, but what if it's a long running task and a person can be assigned to it again?

One way is to add a comments table each time we need comments to relate to another entity. Our query, should we want to view all comments related to some specific root would be really complicated. Another approach would be to have a single Comments entity with a CommentType attribute and whenever we need to add comments to some other table we can add an entity that relates to a specific comment and the target table.

Comments
-CommentID int PK
-CommentType CommentType FK
-details...

PersonComments
-CommentID int PK, FK
-PersonID int FK

TaskComments
-CommentID int PK, FK
-TaskID int FK

TaskAssignmentComments
-CommentID...etc

now we don't have to violate Open-Closed principle to add comments, our comments can be typed in code using TPT inheritance (Comments is the abstract base), and we maintain data-integrity.

The CommentType is so that we can view the Comments table in a query and know what type of entity each one relates to without joining all ___Comments tables in the query. We will have to if we want to know which entity each comment relates to. In .NET, EntityFramework helps significantly with this.

Hope anyone who gets this far will have some neurons firing with regards to structuring a database to support SOLID development practices.