# Normalization

In relational database theory, normalization is the process of restructuring the logical data model of a database to eliminate redundancy, organize data efficiently, and reduce repeating data and to reduce the potential for anomalies during data operations. Data normalization also may improve data consistency and simplify future extension of the logical data model. The formal classifications used for describing a relational database's level of normalization are called normal forms (**NF**).

A non-normalized database can suffer from data anomalies: A non-normalized database may store data representing a particular referent in multiple locations. An update to such data in some but not all of those locations results in an update anomaly, yielding inconsistent data. A normalized database prevents such an anomaly by storing such data (i.e. data other than primary keys) in only one location.

A non-normalized database may have inappropriate dependencies, i.e. relationships between data with no functional dependencies. Adding data to such a database may require first adding the unrelated dependency. A normalized database prevents such insertion anomalies by ensuring that database relations mirror functional dependencies.

Similarly, such dependencies in non-normalized databases can hinder deletion. That is, deleting data from such databases may require deleting data from the inappropriate dependency. A normalized database prevents such deletion anomalies by ensuring that all records are uniquely identifiable and contain no extraneous information.

# Normal forms

**Normal Forms**

Edgar F. Codd originally defined the first three normal forms. The first normal form requires that tables be made up of a primary key and a number of atomic fields, and the second and third deal with the relationship of non-key fields to the primary key. These have been summarized as requiring that all non-key fields be dependent on "the key, the whole key and nothing but the key". In practice, most applications in 3NF are fully normalized. However, research has identified potential update anomalies in 3NF databases. BCNF is a further refinement of 3NF that attempts to eliminate such anomalies. The fourth and fifth normal forms (4NF and 5NF) deal specifically with the representation of many-many and one-many relationships. Sixth normal form (6NF) only applies to temporal databases.

# First normal form (1NF)

**1. First Normal form(1NF)**

First normal form (1NF) lays the groundwork for an organized database design: Ensure that each table has a primary key: minimal set of attributes which can uniquely identify a record. It states that the domain of an attribute must include only atomic values and the value of any attribute in a tuple must be single value from the domain of that attribute. It doesn’t allow nested relation. Data that is redundantly duplicated across multiple rows of a table is moved out to a separate table.

**Atomicity: **Each attribute must contain a single value, not a set of values.

E.g.: Consider a Relation Person. The person will have the attributes SSN, Name, Age, Address and College_Degree. Person

Now we can analyze this relation. Check what the possible values of each attributes are -

Here SSN and Age will have only one value for a person. But The college_Degree will have more than one value. And the address and Name of person can be divided into more than one attributes. Hence this relation is not in 1NF. So let us change this relation schema into 1NF by dividing this relation into two relations.

Name→ FName, MInit, LName

Address→ ApartmentNo, City

Person_Residence

College_Degree

SSN Name Address Age College_Degree

SSN FName LName MInit ApartmentNo City

SSN UG PG

First, the table must be in 1NF, plus, we want to make sure that every Non-Primary-Key attribute (field) is fully functionally dependent upon the ENTIRE Primary-Key for its existence. This rule ONLY applies when you have a multi-part (concatenated) Primary Key (PK).

Inventory: In this inventory table, Description combined with Supplier is our PK. This is because we have two of the same product that comes from different suppliers.

There are two non-key fields. So, we can ask the questions: If we know just Description, can we find out Cost? No, because we have more than one supplier for the same product. If we know just Supplier can we find out Cost? No, because we need to know the Item as well. Therefore, Cost is fully, functionally dependent upon the ENTIRE PK (Description-Supplier) for its existence. If we know just Description, can we find out Supplier Address? No, because we have more than one supplier for the same product. If we know just Supplier, Can we find out Supplier Address? Yes. The Address does not depend upon the Description of the item.

Therefore, Supplier Address is NOT functionally dependent upon the ENTIRE PK (Description-Supplier) for its existence. We must get rid of Supplier Address from this table.

Inventory

Supplier

Description Supplier Cost Supplier_Address

Description Supplier Cost

Name Supplier_Address

At this point, since it is the "Supplier" table, we can rename the "Supplier" filed to

"Name." Name is the PK for this new table.

# Second Normal Form

**Second Normal Form :**

A relation schema R is in second normal form (2NF) if every nonprime attribute A in R is not partially dependent on any key of R.

# Third normal form

**Third normal form (3NF)**

For 3NF, first, the table must be in 2NF, plus, we want to make sure that the non-key fields are dependent upon ONLY the PK, and not on any other field in the table. This is very similar to 2NF, except that now you are comparing the non-key fields to OTHER non-key fields.

For database to be in third normal form:

1. The database must meet all the requirements of the second normal form.

2. Any field which is dependent not only on the primary key but also on another field is moved out to a separate table.

Book

Again, just ask the questions:

By knowing #Pages we cannot find out Author's affiliation. But by knowing Author's Name, we can find out Author's affiliation Number. Therefore, Auth_Affil_No is functionally dependent upon Author's Name, not the PK for its existence.

Book

Author

Name Auth_Name #Pages Auth_Affil_No

Name Auth_Name Auth_Name ##PPaaggeess

Name Auth_Affil_No

**General Definition**

A relation schema R is in 3NF if, whenever a nontrivial functional dependency X→A

holds in R,

Either a) X is a Super key Or

b) Y is a prime attribute of R.

**i.e. A relation schema R is in 3NF if every nonprime attribute of R meets both of the following terms:**

1. It is fully functionally dependent on every key of R.

2. It is nontransitively dependent on every key of R.

# Boyce-Codd normal form (BCNF)

A row is in BCNF if and only if every determinant is a candidate key. The second and third normal forms assume that all attributes not part of the candidate keys depend on the candidate keys but does not deal with dependencies within the keys.

BCNF deals with such dependencies.

*A relation R is said to be in BCNF if whenever X -> A holds in R, and A is not in X, then X is a candidate key for R*.

BCNF covers very specific situations where 3NF misses interdependencies between non key attributes. It should be noted that most relations that are in 3NF are also in BCNF.

Infrequently, a 3NF relation is not in BCNF and this happens only if:

(a) the candidate keys in the relation are composite keys (that is, they are not single attributes),

(b) there is more than one candidate key in the relation, and

(c) the keys are not disjoint, that is, some attributes in the keys are common.

The BCNF differs from the 3NF only when there are more than one candidate keys and the keys are composite and overlapping. Consider for example, the relationship *enrol (sno, sname, cno, cname, date-enrolled)*

Let us assume that the relation has the following candidate keys:

*(sno, cno)*

*(sno, cname)*

*(sname, cno)*

*(sname, cname)*

(we have assumed *sname *and *cname *are unique identifiers). The relation is in 3NF but not in BCNF because there are dependencies

*sno -> sname*

*cno -> cname*

where attributes that are part of a candidate key are dependent on part of another candidate key. Such dependencies indicate that although the relation is about some entity or association that is identified by the candidate keys.

e.g. *(sno, cno)*, there are attributes that are not about the whole thing that the keys identify. For example, the above relation is about an association (enrolment) between students and subjects and therefore the relation needs to include only one identifier to identify students and one identifier to identify subjects. Providing two identifiers about students *(sno, sname) *and two keys about subjects *(cno, cname) *means that some information about students and subjects that is not needed is being provided. This provision of information will result in repetition of information and the anomalies. If we wish to include further information about students and courses in the database, it should not be done by includng the information in the present relation but by creating new relations that represent information about entities *student *and *subject*.

These difficulties may be overcome by decomposing the above relation in the following three relations:

*(sno, sname)*

*(cno, cname)*

*(sno, cno, date-of-enrolment)*

We now have a relation that only has information about students, another only about subjects and the third only about enrolments. All the anomalies and repetition of information have been removed.

**
**

# Multivalued Depedency

In a relational model, if all of the information about an entity is to be represented in one relation, it will be necessary to repeat all the information other than the multivalue attribute value to represent all the information that we wish to represent. This results in many tuples about the same instance of the entity in the relation and the relation having a composite key (the entity id and the mutlivalued attribute). Of course the other option suggested was to represent this multivalue information in a separate relation. The situation of course becomes much worse if an entity has more than one multivalued attributes and these values are represented in one relation by a number of tuples for each entity instance. The multivalued dependency relates to this problem when more than one multivalued attributes exist. Consider the following relation that represents an entity employee that has one mutlivalued attribute *proj*:

*emp (e#, dept, salary, proj)*

We have so far considered normalization based on functional dependencies;

dependencies that apply only to single-valued information. For example, *e# -> dept *implies only one *dept *value for each value of *e#*. Not all information in a database is single-valued, for example, *proj *in an employee relation may be the list of all projects that the employee is currently working on. Although *e# *determines the list of all projects that an employee is working on, *e# -> proj *is not a functional dependency.

We can more clearly analyze the multivalued dependency by the following example.

*programmer (emp_name, qualifications, languages)*

This relation includes two multivalued attributes of entity *programmer*; *qualifications *and *languages*. There are no functional dependencies.

The attributes qualifications and languages are assumed independent of each other. If we were to consider *qualifications *and *languages *as separate entities, we would have two relationships (one between *employees *and *qualifications *and the other between *employees *and programming *languages*). Both the above relationships are many-to-many i.e. one programmer could have several qualifications and may know several programming languages. Also one qualification may be obtained by several programmers and one programming language may be known to many programmers.

Functional dependency *A -> B *relates one value of *A *to one value of *B *while multivalued dependency *A ->> B *defines a relationship in which a set of values of attribute *B *are determined by a single value of *A*.

Now, more formally, *X ->> Y *is said to hold for *R(X, Y, Z) *if *t1 *and *t2 *are two tuples in *R *that have the same values for attributes *X *and therefore with *t1[x] = t2[x] *then *R *also contains tuples *t3 *and *t4 *(not necessarily distinct) such that:

*t1[x] = t2[x] = t3[x] = t4[x]*

*t3[Y] = t1[Y] *and *t3[Z] = t2[Z]*

*t4[Y] = t2[Y] *and *t4[Z] = t1[Z]*

In other words if *t1 *and *t2 *are given by

*t1 = [X, Y1, Z1]*, and

*t2 = [X, Y2, Z2]*

then there must be tuples *t3 *and *t4 *such that

*t3 = [X, Y1, Z2]*, and

*t4 = [X, Y2, Z1]*

We are therefore insisting that every value of *Y *appears with every value of *Z *to keep the relation instances consistent. In other words, the above conditions insist that *X alone determines Y and Z *and there is no relationship between *Y *and *Z *since *Y *and *Z *appear in every possible pair and hence these pairings present no information and are of no significance.

# Fourth Normal Form

Fourth normal form (or 4NF) requires that there must be no non-trivial multivalued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multivalued dependencies are functional dependencies. The 4NF removes unwanted data structures: multivalued dependencies.

**Definition:** A relation schema R is in 4NF with respect to a set of dependencies F, if, for every non trivial multivalued dependency X ->>Y in F+, X is a superkey for R.

# Properties Of Relational Decompositions

**Properties Of Relational Decompositions**

Decomposition Property: A relation must satisfy the following two properties during decomposition.

**i. Lossless**:- A lossless-join dependency is a property of decomposition, which ensures that spurious rows are generated when relations are united through a natural join operation. i.e. The information in an instance r of R must be preserved in the instances r1, r2, r3, …..rk where ri = ΠRi (r)

Decomposition is lossless with respect to a set of functional dependencies F if, for every relation instance r on R satisfying F, R=ΠR1 (r) * ΠR2 (r) * . . . . . . . ΠRn (r)

**ii. Dependency Preserving Property: - **If a set of functional dependencies hold on R it should be possible to enforce F by enforcing appropriate dependencies on each r1 .

Decomposition D= (R1, R2, R3, ………, Rk) of schema R preserves a set of dependencies

F if,

(ΠR1 (F) U ΠR2 (F) U . . . . . . . . . . . ΠRn (F)) +=F+ΠRi(F) is the projection of F onto Ri.

i.e Any FD that logically follows from F must also logically follows from the union of projection of F onto Ri ‘S . Then D is called dependency preserving.

# Join Dependency and Fifth Normal Form

**Join Dependency and Fifth Normal Form**

Join dependency is the term used to indicate the property of a relation schema that cannot be decomposed losslesly into two relation schema, but can be decomposed losslesly into three or more simpler relation schema. It means that a table, after it has been decomposed into three or more smaller tables must be capable of being joined again on common keys to form the original table.

**Fifth normal form**

Fifth normal form (5NF and also PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and the candidate keys imply every join dependency in it.

# Goals of Normalization

**Goals of Normalization**

- Let
*R*be a relation scheme with a set*F*of functional dependencies. - Decide whether a relation scheme
*R*is in “good” form. - In the case that a relation scheme
*R*is not in “good” form, decompose it into a set of relation scheme {*R*1*, R*2*, ..., R n*} such that each relation scheme is in good form the decomposition is a lossless-join decomposition Preferably, the decomposition should be dependency preserving.

**
**

# Further Normal Forms

**Join dependencies **generalize multivalued dependencies lead to **project-join normal form (**PJNF) (also called **fifth normal form**)

- A class of even more general constraints, leads to a normal form called
**domainkey normal form**. - Problem with these generalized constraints: are hard to reason with, and no set of sound and complete set of inference rules exists.
- Hence rarely used.
- We have assumed schema
*R*is given *R*could have been generated when converting E-R diagram to a set of tables.*R*could have been a single relation containing*all*attributes that are of interest (called**universal relation**).- Normalization breaks
*R*into smaller relations. *R*could have been the result of some ad hoc design of relations, which we then test/convert to normal form.**.**

****

In short, Normalization of a Database is achieved by following a set of rules called 'forms' in creating the database.

These rules are 5 in number (with one extra one stuck in-between 3&4) and they are:

**1st Normal Form or 1NF:** Each Column Type is Unique.

**2nd Normal Form or 2NF: **The entity under consideration should already be in the 1NF and all attributes within the entity should depend solely on the entity's unique identifier.

**3rd Normal Form or 3NF: **The entity should already be in the 2NF and no column entry should be dependent on any other entry (value) other than the key for the table. If such an entity exists, move it outside into a new table.

Now if these 3NF are achieved, the database is considered normalized. But there are three more 'extended' NF for the elitist.

These are:

**BCNF (Boyce & Codd): **The database should be in 3NF and all tables can have only one primary key.

**4NF: **Tables cannot have multi-valued dependencies on a Primary Key.

**5NF: **There should be no cyclic dependencies in a composite key.

**
**