Friday, August 29, 2014

Efficient SQL searching

If you don't want to use Fulltext indexing, or you need to order your search results by some criteria (which columns take precedence) here is a way that this can be accomplished with decent efficiency.

Suppose we have a People table in a Sql Server database and we want to find people by last name, then first name, then email address.

we can write the query as:

WITH lastNameBegins AS
(
    SELECT id,
    1 criteriaNum,
    ROW_NUMBER() OVER (ORDER BY lastName) AS rowNum
    FROM People
    WHERE lastName like @searchTerm + '%'
),
firstNameBegins AS
(
    SELECT id,
    2 criteriaNum,
    ROW_NUMBER() OVER (ORDER BY firstName) AS rowNum
    FROM People
    WHERE firstName like @searchTerm + '%'
),
emailBegins AS
(
    SELECT id,
    3 criteriaNum,
    ROW_NUMBER() OVER (ORDER BY email) AS rowNum
    FROM People
    WHERE email like @searchTerm + '%'
),
SELECT id, name
FROM People as p
JOIN (
    SELECT id, MIN(criteriaNum) criteriaNum, MIN(rowNum) rowNum
    FROM (
        SELECT * FROM lastNameBegins
        UNION
        SELECT * FROM firstNameBegins
        UNION
        SELECT * FROM emailBegins
    ) AS a
    GROUP BY id
) AS o
ON p.id = o.id
ORDER BY criteriaNum, rowNum;

And now you have a search that is ordered by criteria. The CTEs defined the ranking of the criteria, we union those together, there could be duplicates so we group those out and take the min vals for our ordering. I suggest limiting the size of the inner set 'a' for efficiency's sake.

Monday, August 25, 2014

Break from the Switch-Case

If you find yourself reacting to states via switch-case statements, consider using a GoF pattern called the State Pattern
http://www.codeproject.com/Articles/489136/UnderstandingplusandplusImplementingplusStateplusP
described in C#.

Basically, it uses state objects that know how to transition to other states. The main object has a reference to the current state. Possibly a state manager could offer more extensibility. A state manager would manage the transition between states instead of the state object itself. Another option could be to create a manager that wraps the main object, and the state object(s). Then a momento could contain the previous states. Whatever the situation dictates, break from switch-case.

Friday, August 22, 2014

Articles on Technical Debt

Here is a great article about the true cost of Technical Debt by Aaron Erickson. The author gives a good example of the cost over time and relates it to Financial Debt in a way that should make sense to management and appropriately supports the metaphor.

I read another article of his that talks about killing the project in favor of small releases. That was great too and can be found on the same site.

When I read the example of commissions, I immediately thought of a commission service that could be called from anywhere on the intranet to calculate commissions for the company. I thought of reports and of SSRS reports and realized how far behind it would be to write reports with a technology that does not support services as data sources.

Wouldn't it be great to write some business rule in a first class code language and have all consumers call that centralized logic to do the processing? It would be contained in a service of course, not in a DLL or some other form of shared code; if it were, each place that uses that code would have to update in order to get the new rule. So your change would still involve some Technical Debt a la Architectural Debt (your architecture introduces more future work required to change some core business logic that can change). IDesign refers to this likelihood of change as volatility.

This volatility factor is a natural occurrence in software, whether recognized or not.  Martin Fowler points out that this type of ignorance is wreckless, it is likened to going on a shopping spree and maxing out credit cards that you don't know you have to pay back. Then you will get the calls from the bill collectors. "The business needs to change xyz factor to yyz factor and now!" So now you don't sleep for a week and everyone is run ragged and angry because of this inexperience that cost just the right price.

In the case of Aaron's article, huge deals can run afoul and what was once an exiting merger can turn into a costly technical nightmare. I would like to hear of an example where this was the case. Office Depot and Office Max merged recently and stated the projected costs and benefits of the synergies of the merger. I wonder how that is going from a technical standpoint...

If I find anything, I'll surely have to blog on it.

Wednesday, August 20, 2014

Data Modelling and DB design


In a Sql Server Pro article from 2005
written by Michelle A. Poolet I encountered some great advice and a great example on how to normalize an existing table that is in first normal form (all the data in one table).

Not to pick on the author here, nor criticize, but I will go further with the normalization process for this database and take it from second normal form (where the author ended up) and go closer to third normal form by creating lookup tables for wood type, finish, size, and a whole other root for shipping. As the author states, we don't know enough about the business process to do so in the real world. I will do this to further illustrate the potential of doing so (after collecting proper usage details of course).

The example is for a custom furniture shop, which means that they use wood and other supplies to build furniture. They probably have some standard wood types that they use and possibly keep stock of some common types. If the database were expanded to include wood type, a lookup table would be really useful to link supply to demand.

I know, I know...build for current use cases, not future use cases. I hear you loud and clear. But let's just assume that they have some standard wood types for sake of this exercise. And let's assume that we've done our due dilligence and we are ready to take it to the next level.

If wood type alone becomes a lookup table, the root table will lose a varchar(30) [2B + nB, n=len(x)] worth of data and gain a smallint [2B] worth. Assuming that the type name is on avg 15 chars, that means a savings of 17B/record or 72 - 17 = 55. Now we can fit 25% more rows per page, about 137 rows per page. If we offloaded ALL of the varchar cols to other tables, we would have even better numbers here.

When we pull off those values into other tables, we will have to perform joins to them, but as long as the record count in those is small (I'm thinking most of it is standard anyways so they might be lookup tables that can be added to for new vals) they should be loaded into the cache and well reused. If anything, they may not be used every time users access the data. Again I don't have the use cases, but I'm making reasonable suggestions about them.

I look at any kind of notes or comments fields as something that should always be it's own table. Refer to the ProcessOrderNotes column in the ProcessOrderStatus table in Figure2 of the article. Each comment can be related in a number of ways to other tables either directly via an FK to the note, or to multiple entities via linking tables. In this way, the comments or notes become easier to query across if someone wants to search for something specific within a note. Eventually, this will be a use case so why not structure the data model to be more adaptable to begin with?

Think open-closed principle here where the model is open to extension, closed to modification. Of you have a new table that requires notes, you can just add an FK or a new linking table depending on what is needed. Now the app code can go on about calling it's addNote function, get the PK and write it to the table as an FK as needed. No reason to reinvent that. Well, suppose that all notes now need a timestamp and username. Guess what? Just add a table related to the Notes table, update the code to save to that each time the note is updated. Or store a new note record each time for auditing purposes and show them all - use cases?

Now I'm getting carried away, so I best stop here.

Friday, August 15, 2014

RMDB Surrogate v Natural Key

An integral part of transactional relational databases is the Primary Key (PK). A PK is a value that defines the data row and that related tables use to relate via a Foreign Key (FK).

Four standard criterion are commonly used to measure a PK.

1. Is it unique?
2. Is it immutable?
3. Is it minimal?
4. Is it never null?

A key can consist of one or more columns in a table as long as the combination meets the criteria.

My recommendation is that you stick to a Surrogate Key if you don't know. Here's why-

Natural keys made under an assumption that some business rule will not change can cause significant problems later when it does.

Most arguments for natural keys place emphasis on ease of use. If you find yourself joining multiple columns because your PK is a combo key then you lose that value. Create views to ease your suffering from joining a few more tables in your queries.


The exception I would make is for SOME lookups. Esp when a well defined code is used as the natural key. Bit even then, shit changes all the time. Suppose you assumed that the US state codes were immutable. Or telephone area codes. You would have to write scripts to update all those FKs that use 'Calif' or use some hacked in solution to fix your data and update all the apps that consume the data to support the hack. Contain volatility in your data!

Process v Procedure

Today I learned the difference between a business process and a business procedure. I looked through a few sources on the web and found that a process is made up of many procedures and tasks, and can cross departments. A procedure is a set of instructions to complete one unit of work. I would equate these to separation of concerns in software.

Think of an enterprise application environment where there are several standard concerns such as - application access security, sending emails, logging errors, reporting on data, saving transactional data, sending data to an external destination, etc, etc. These are re-usable procedures. A process collects them into something meaningful.

Since most common concerns can be thought of as procedures. It would be helpful to think of procedures as abstractions when describing a process.

Consider a task of explaining the process to a non-technical person, or someone outside your organization. The details about the procedures are specific to your domain, also they could change without the overall process changing. Is it a process or procedure? Could it be changed without affecting the path from point A to point B? Can parts of the process be standardized or abstracted into procedures?

For a more concrete example, a procedure could be "communicate with someone". There may be some criterion that will change the communications type but the end result is communication. Perhaps the communication is an email, could be a phone call, a text, in person, etc. The point is that a communication needs to take place, and the details are not a concern of a process.

Perhaps some Protocol will define what steps to follow for the communications procedure. Perhaps the communications needs to be saved for auditing purposes. That would limit the possibilities of forms of communication. In any case, those are procedural details. The process is about linking procedures.

Wednesday, August 13, 2014

Process Documentation Style

Finding that I post more on planning and management than on technical topics. This stems from current efforts more than anything.
The main thing I want to cover is that a process is repeatable. If it's not understood and documented, then there ain't no process. Or, as I just found out, it's the components of the business process that do not exist.

http://m.businessdictionary.com/definition/process.html

However, I would state that a process that is different for each individual or instance of input, cannot be defined by a single meaningful process. In order to understand a current process in such state, much effort is required to understand the process from each point of view.

Unless a consistent process has been defined, the process is to collect whatever procedures make sense at the time in order to achieve the outcome.

In some cases, perhaps in a results-oriented environment, ad hoc may be the best approach. This style of process management relies on the individual more than the structure of the process.

There are certain risks involved in taking such an approach including un-repeatable results and inconsistent data for analysis. Another drawback is the increased risk in taking on technical debt, depending on the individual.

The benefits are that the end result can be achieved without following a possibly complex and lengthy chain of procedures - results are dependent upon the individuals involved.

A well-defined process can be implemented in a process-oriented environment. This style is suitable for consistent and repeatable behavior, as long as people are following process.

The major risks are that a process may be ill-defined, overbearing, or not well understood. It may serve to stifle creativity as well as lead people to go off process if it is not understood, is too complex to follow, or if time to results is inadequate to the situation.

The benefits can be great once data is tracked and proper analysis techniques can be applied. Resources can be directed according to analysis, and long-term planning can be better achieved.

Possibly, one of the most important aspects of defining a process would be to keep the long-term goals in mind. If the process becomes heavy for the sake of the process, it may do more damage than good.

Sometimes, tools can limit the performance of a process. Appropriate tooling should be applied to the appropriate process; the process should not be limited or hindered in effectiveness by the tools.