Question:

Can a database contain two identical records without a negative effect on the integrity of the database? Why or why not?

Response:

I think this can be a complex question and needs to be qualified a bit. Not all databases are relational, a database could be comprised of a single table and single field with multiple rows which contain binary responses, something like “agree or disagree” responses to a question for something like sentiment analysis. An example here would be a question posed on a website where users are asked to “agree or disagree”. The user’s response is stored in a database table, and a query is used to count the “agree and disagree” responses.

An example of this is represented by the following sql statements:

sqlite> — create table
sqlite> create table sentiment (answer text);
sqlite>
sqlite> — insert data into table
sqlite> insert into sentiment (answer) values (‘agree’);
sqlite> insert into sentiment (answer) values (‘agree’);
sqlite> insert into sentiment (answer) values (‘agree’);
sqlite> insert into sentiment (answer) values (‘agree’);
sqlite> insert into sentiment (answer) values (‘agree’);
sqlite> insert into sentiment (answer) values (‘agree’);
sqlite> insert into sentiment (answer) values (‘disagree’);
sqlite> insert into sentiment (answer) values (‘disagree’);
sqlite> insert into sentiment (answer) values (‘disagree’);
sqlite>
sqlite> — query all records
sqlite> select * from sentiment;
agree
agree
agree
agree
agree
agree
disagree
disagree
disagree
sqlite>
sqlite> — query total number of responses
sqlite> select count(answer) from sentiment;
9
sqlite> — query total number or agree responses
sqlite> select count(answer) from sentiment where answer = (‘agree’);
6
sqlite> — query total number or disagree responses
sqlite> select count(answer) from sentiment where answer = (‘disagree’);
3

Above a DB table called “sentiment” is created with one field “answer”. Inserts represent data being inserted into the table “sentiment” and field “answer” to create records. The data is then used to calculate the total number of respondents, the number of respondents that agree and the number or respondents that disagree.

This is a simple example of a DB which stores data which can be mined to gather sentiment regarding the question posed to the user. In the example above there were nine total respondents, six who agree and three who disagree. I this case duplicate records are acceptable and expected with the goal of recording all responses and tabulating a count of each response type, database integrity is not affected negatively.

When the question is asked in the context of a relational database (RDBMS) the assumption that we can make is that there are relationships which are created between tables and these relationships likely rely on a unique identifier (a primary key) to ensure that records can be uniquely identified. The confusion that can be created when duplicate records exist can be demonstrated by a SQL update.

sqlite> — example of poorly designed db table with duplicate records
sqlite>
sqlite> — create table foo
sqlite> create table foo (name text, age integer);
sqlite>
sqlite> — insert data into table
sqlite> insert into foo (name,age) values (‘John’,10);
sqlite> insert into foo (name,age) values (‘Joe’,20);
sqlite> insert into foo (name,age) values (‘Jane’,30);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
John|10
Joe|20
Jane|30
sqlite> select * from foo where (name) = (‘John’);
John|10
sqlite>
sqlite> — create duplicate records
sqlite> insert into foo (name,age) values (‘John’,10);
sqlite> insert into foo (name,age) values (‘Joe’,20);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
John|10
Joe|20
Jane|30
John|10
Joe|20
sqlite> select * from foo where (name) = (‘John’);
John|10
John|10
sqlite>
sqlite> — create new record;
sqlite> insert into foo (name,age) values (‘Bob’,30);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
John|10
Joe|20
Jane|30
John|10
Joe|20
Bob|30
sqlite> select * from foo where (age) = (30);
Jane|30
Bob|30
sqlite>
sqlite> — update John’s age to 40
sqlite> update foo set (age) = (40) where (name) = (‘John’);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
John|40
Joe|20
Jane|30
John|40
Joe|20
Bob|30
sqlite> select * from foo where (name) = (‘John’);
John|40
John|40
sqlite>

Above we can see a table called “foo” consisting of two fields “name” and “age” is created, this table has names and ages added to it, with the record “John | 10” and “Joe | 20” being duplicated. An update is made to the database to change John’s age from 10 to 40. This update impacts all John’s records because there is not a unique identifier in the record which can be used. While in the previous example where I stored information for sentiment analysis I showed that it is possible to have a database where integrity is not impacted by duplicate records, in general, this is a poor design choice and can be easily fixed with the addition of a primary key.

Below you will see the subtle but powerful difference that a primary key offers.

sqlite> — proprely designed db table avoids duplicate records
sqlite>
sqlite> — create table foo with autoincrementing primary key
sqlite> create table foo (id integer primary key autoincrement, name text, age integer);
sqlite>
sqlite> — insert data into table
sqlite> insert into foo (name,age) values (‘John’,10);
sqlite> insert into foo (name,age) values (‘Joe’,20);
sqlite> insert into foo (name,age) values (‘Jane’,30);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
1|John|10
2|Joe|20
3|Jane|30
sqlite> select * from foo where (name) = (‘John’);
1|John|10
sqlite>
sqlite> — create duplicate records
sqlite> insert into foo (name,age) values (‘John’,10);
sqlite> insert into foo (name,age) values (‘Joe’,20);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
1|John|10
2|Joe|20
3|Jane|30
4|John|10
5|Joe|20
sqlite> select * from foo where (name) = (‘John’);
1|John|10
4|John|10
sqlite>
sqlite> — create new record;
sqlite> insert into foo (name,age) values (‘Bob’,30);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
1|John|10
2|Joe|20
3|Jane|30
4|John|10
5|Joe|20
6|Bob|30
sqlite> select * from foo where (age) = (30);
3|Jane|30
6|Bob|30
sqlite>
sqlite> — update John’s age to 40 where id = N
sqlite> update foo set (age) = (40) where (id) = (4);
sqlite>
sqlite> — query db table
sqlite> select * from foo;
1|John|10
2|Joe|20
3|Jane|30
4|John|40
5|Joe|20
6|Bob|30
sqlite> select * from foo where (name) = (‘John’);
1|John|10
4|John|40
sqlite>

I the above example additional field “id” is added as a primary key, this is not a user entered field but a field that is auto generated and guarantees that each record is unique and uniquely identifiable. This subtle design change allows the manipulation of a John with the ID = 4. While the sentiment example I gave works as is and there is no posed threat to data integrity, the addition of a unique id as a primary key would be a welcomed and desirable design change.

My apologies for all the SQL but I thought the best way to convey my thoughts would be to use examples. I think the easy answer here would have been just to say NO, that a database can NOT contain two identical records without a negative effect on the integrity of the database, but it think the answer is “it depends”. With this said I think it is a best practice to have a way to uniquely identify database records because the inability to manipulate data programmatically can create some serious issues. Additionally, schema extensions or changes are very difficult when a primary key does not exist. Finally the inability to uniquely identify records or elements greatly impacts the ability to apply security paradigms.

References

Pfleeger, C. P., Pfleeger, S. L., & Margulies, J. (2015). Security in computing (5th ed.). Upper Saddle River: Prentice Hall.

SQLite Home Page. (n.d.). Retrieved March 28, 2017, from https://www.sqlite.org/

SQL code used in above examples:

— example of poorly designed db table with duplicate records

— create table foo
create table foo (name text, age integer);

— insert data into table
insert into foo (name,age) values (‘John’,10);
insert into foo (name,age) values (‘Joe’,20);
insert into foo (name,age) values (‘Jane’,30);

— query db table
select * from foo;
select * from foo where (name) = (‘John’);

— create duplicate records
insert into foo (name,age) values (‘John’,10);
insert into foo (name,age) values (‘Joe’,20);

— query db table
select * from foo;
select * from foo where (name) = (‘John’);

— create new record;
insert into foo (name,age) values (‘Bob’,30);

— query db table
select * from foo;
select * from foo where (age) = (30);

— update John’s age to 40
update foo set (age) = (40) where (name) = (‘John’);

— query db table
select * from foo;
select * from foo where (name) = (‘John’);

— properly designed db table avoids duplicate records

— create table foo with autoincrementing primary key
create table foo (id integer primary key autoincrement, name text, age integer);

— insert data into table
insert into foo (name,age) values (‘John’,10);
insert into foo (name,age) values (‘Joe’,20);
insert into foo (name,age) values (‘Jane’,30);

— query db table
select * from foo;
select * from foo where (name) = (‘John’);

— create duplicate records
insert into foo (name,age) values (‘John’,10);
insert into foo (name,age) values (‘Joe’,20);

— query db table
select * from foo;
select * from foo where (name) = (‘John’);

— create new record;
insert into foo (name,age) values (‘Bob’,30);

— query db table
select * from foo;
select * from foo where (age) = (30);

— update John’s age to 40 where id = N
update foo set (age) = (40) where (id) = (4);

— query db table
select * from foo;
select * from foo where (name) = (‘John’);

— sentiment analysis

— create table
create table sentiment (answer text);

— insert data into table
insert into sentiment (answer) values (‘agree’);
insert into sentiment (answer) values (‘agree’);
insert into sentiment (answer) values (‘agree’);
insert into sentiment (answer) values (‘agree’);
insert into sentiment (answer) values (‘agree’);
insert into sentiment (answer) values (‘agree’);
insert into sentiment (answer) values (‘disagree’);
insert into sentiment (answer) values (‘disagree’);
insert into sentiment (answer) values (‘disagree’);

— query all records
select * from sentiment;

— query total number of responses
select count(answer) from sentiment;
— query total number or agree responses
select count(answer) from sentiment where (answer) = (‘agree’);
— query total number or disagree responses
select count(answer) from sentiment where (answer) = (‘disagree’);