Blog

CVS account change in Eclipse

We were setting up new development environment for our team using eclipse and checked out the code from CVS. At the end we faced an issue when we tested how other developers will change the CVS account to their own one. Eclipse was not allowing us to change the CVS account. Then I found that we have to change the account in CVS meta files. The command I used to replace the account was :

find . -regex .*CVS/Root -print0 | xargs -0 perl -p -i.orig -e “s/olduser/newuser/;”

This replaced the CVS user name in all CVS meta files in the whole project hierarchy 🙂

Multiple MySQL on single host

Sometimes we need to run multiple MySQL servers on single machine. That is mostly required in testing environments to test different aspects with different configurations. In this way one can test server without affecting others. So, if you want to run multiple MySQL you can use MySQL Sandbox which eases the whole process of installing and configuring the server. Here how will you do it.

First of all you need to install MySQL Sandbox. You can download it from https://launchpad.net/mysql-sandbox.

Then you need tar balls of MySQL server. You can download it from MySQL site.

After installing MySQL Sandbox you can run following script to install MySQL.

make_sandbox /path/to/mysql-X.X.XX-osinfo.tar.gz

This script will tell you some information like port, user name, and password which you can use to login to MySQL after installation. After confirmation it will install and run MySQL. That’s it! You are up and running.

If you want to install another MySQL you can just run the following command.

make_sandbox /path/to/mysql-X.X.XX-osinfo.tar.gz –check_port

The –check_port option checks the first available port so it can install and run on that port. By default it will use the MySQL version as port. For example if you have MySQL version 4.1.20 it will run MySQL on port 4120. And if it is not available then it will try 4121.

MySQL Sandbox provides other useful scripts to manage the server. So installing and running multiple MySQL, even different versions, is that easy 🙂

You can find the complete documentation at http://forge.mysql.com/wiki/MySQL_Sandbox#Single_server_sandbox.

Why code bad?

I was reading the book Clean Code – A Handbook of Agile Software Craftsmanship, Rober C. Martin Series. A section in chapter one caught my attention and remind me of code and reasons that I got for writing bad code and designing bad solution. Now I think this is a good argument against it which I am quoting here from the book.

“Have you ever waded through a mess so grave that it took weeks to do what should have taken hours? Have you seen what should have been a one-line change, made instead in hundreds of different modules? These symptoms are all too common. Why does this happen to code? Why does good code rot so quickly into bad code? We have lots of explanations for it. We complain that the requirements changed in ways that thwart the original design. We bemoan the schedules that were too tight to do things right.
We blather about stupid managers and intolerant customers and useless marketing types and telephone sanitizers. But the fault, dear Dilbert, is not in our stars, but in ourselves. We are unprofessional.
This may be a bitter pill to swallow. How could this mess be our fault? What about the requirements? What about the schedule? What about the stupid managers and the useless marketing types? Don’t they bear some of the blame?
No. The managers and marketers look to us for the information they need to make promises and commitments; and even when they don’t look to us, we should not be shy about telling them what we think. The users look to us to validate the way the requirements
will fit into the system. The project managers look to us to help work out the schedule. We are deeply complicit in the planning of the project and share a great deal of the responsibility for any failures; especially if those failures have to do with bad code! “But wait!” you say. “If I don’t do what my manager says, I’ll be fired.” Probably not. Most managers want the truth, even when they don’t act like it. Most managers want good code, even when they are obsessing about the schedule. They may defend the schedule and requirements with passion; but that’s their job. It’s your job to defend the code with equal passion.
To drive this point home, what if you were a doctor and had a patient who demanded that you stop all the silly and-washing in preparation for surgery because it was taking too much time?2 Clearly the patient is the boss; and yet the doctor should absolutely refuse to comply. Why? Because the doctor knows more than the patient about the risks of disease and infection. It would be unprofessional (never mind criminal) for the doctor to comply with the patient.
So too it is unprofessional for programmers to bend to the will of managers who don’t understand the risks of making messes.”

Breaking problem into code

Let us consider an example. If you have two boxes say BoxA and BoxB, and there are few balls of different colors, let’s say 5 balls (red, blue, green, yellow, and white) in BoxA, and you want to move one ball, say red ball from BoxA to BoxB. What will be the steps? If you can make a flow chart of it then its good, but if you find it difficult to make a flow chart of it then you need to do some hard work to become a programmer.

I asked this to someone and guess how he solved it. Here is his solution. Take out all balls from BoxA, then pick the red ball, put it in BoxB, and then put all remaining balls back in BoxA.

Fine! it worked. But why would you take out all balls if you can only take out red ball and move it to next box?

Let’s move this to programming side. In the above mentioned scenario we will have two database tables, boxes, and balls. boxes will have boxid as PK, name, and other details. Keep it simple so we don’t get lost in other details. balls table has ballid as PK, name, boxid as FK, and other details. boxes table has two records with 1 and BoxA as its boxid and name. balls table will have 5 records with id between 1 to 5 and red, blue, green, yellow, and white as their names, and all balls will have 1 in boxid field which is FK to boxes table.

Now if we adopt the first solution then it will run following queries (these are pseudo queries not actual SQL queries):

– select all balls where boxid=1 (so we can have a list of all balls before deleting them from database)
– delete from balls where boxid=1
– insert into balls values ballid=1, name=red ball, boxid=2
– insert into balls values
(ballid=2, name=blue ball, boxid=1)
(ballid=3, name=green ball, boxid=1)
(ballid=4, name=yellow ball, boxid=1)
(ballid=5, name=white ball, boxid=1)

Or even worse if we use multiple insert queries. It will run from 4 to 7 queries.

Now let’s see what happens if we just perform operation on the red ball and don’t touch other balls.

– update balls set boxid=2 where ballid=1; /* ball with id 1 is the red ball */

Wow, just one query and we’re done! We saved 3 to 6 queries. Imagine this scenario for a high traffic website or any other application, we can improve the performance with a big difference by just implementing correct logic to solve a problem.

Database indexes and locks

I was discussing with my friend on the issue I have discussed in my post Referential Integrity (http://mjawaid.wordpress.com/2009/04/01/referential-integrity/) and Mapping tables (http://mjawaid.wordpress.com/2009/04/02/mapping-tables/). My friend told me the scenario that when we create multiple indexes on the table we will get deadlocks. The reason he told me was the bucket lock, or in other words gap lock. Actually what server does is that due to multiple indexes when you lock any one record it locks multiple records, since it searches indexes and all the records it encounters during search, it locks them. That what I understood he was trying to say. I wasn’t convinced and thought of trying to create few tables with indexes and test them. For test I used MySQL4 and InnoDB engine, since the issue we were facing was on it.

So I created five tables indxtest1, indxtest2, indxtest3, indxtest4, and indxtest5.

CREATE TABLE `indxtest1` (

`id` bigint(20) NOT NULL auto_increment,

`name` varchar(255) default NULL,

`fk` bigint(20) default NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC

CREATE TABLE `indxtest2` (

`id` bigint(20) NOT NULL auto_increment,

`name` varchar(255) default NULL,

`fk` bigint(20) default NULL,

PRIMARY KEY (`id`),

KEY `NewIndex1` (`fk`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC

CREATE TABLE `indxtest3` (

`id` bigint(20) NOT NULL auto_increment,

`name` varchar(255) default NULL,

`fk` bigint(20) default NULL,

PRIMARY KEY (`id`),

KEY `NewIndex1` (`id`,`fk`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC

CREATE TABLE `indxtest4` (

`id` bigint(20) NOT NULL auto_increment,

`name` varchar(255) default NULL,

`fk` bigint(20) NOT NULL default ‘0’,

PRIMARY KEY (`id`,`fk`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC

CREATE TABLE `indxtest5` (

`id` bigint(20) NOT NULL auto_increment,

`name` varchar(255) default NULL,

`fk` bigint(20) default NULL,

PRIMARY KEY (`id`),

KEY `FK_indxtest4` (`fk`),

CONSTRAINT `FK_indxtest4` FOREIGN KEY (`fk`) REFERENCES `indxtest5parent` (`id`) ON DELETE SET NULL

) ENGINE=InnoDB DEFAULT CHARSET=latin1 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC

And another table indxtest5parent as parent for indxtest5.

CREATE TABLE `indxtest5parent` (

`id` bigint(20) NOT NULL auto_increment,

`test` varchar(255) default NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC

Let me explain what is the difference between the tables. All tables except the indxtest5parent contain three fields: id, name, and fk. Don’t confuse fk with a foreign key. Id is the primary key in all tables. The main difference between the tables is the indexing on the fk field.

The indxtest1 has no index on fk.

The indxtest2 has an index on fk, so fk is indexed.

The indxtest3 has a multi-field-index on id and fk combined, in addition to primary key index on id.

The indxtest4 has a composite primary key id, fk. Therefore there is a primary key index on id and fk i.e multi-field-index.

The indxtest5 has a foreign key fk mapping to id field of indxtest5parent. So it has foreign key index on fk.

After that I inserted some data in these tables, around just 10 records. I inserted that few records since I wasn’t testing performance with huge data, instead I was just testing that how records are searched in table with indexes, multiple indexes, and without indexes, which is useful in knowing how records are locked, implicitly when updating or explicitly.

So all tables look almost like this after inserting data:

id

name

fk

1

One

10

2

Two

12

3

three

13

4

Four

14

5

Five

15

6

Six

15

7

seven

14

8

eight

13

9

Nine

12

10

Ten

10

Now run the following queries on all tables:

select * from indxtest1 where id = 1;

select * from indxtest2 where id = 1;

select * from indxtest3 where id = 1;

select * from indxtest4 where id = 1;

select * from indxtest5 where id = 1;

These all queries will result in same output, i.e the first record of the table. This is very simple, and since result was filtered using the primary key in the where clause so it scanned only one record during search. We can see this by running following queries:

explain select * from indxtest1 where id = 1;

explain select * from indxtest2 where id = 1;

explain select * from indxtest3 where id = 1;

explain select * from indxtest4 where id = 1;

explain select * from indxtest5 where id = 1;

You will notice that in result the rows column will show 1, that means only one record was scanned during the search.

Now let’s filter the result using the fk field in the where clause:

select * from indxtest1 where fk = 10;

select * from indxtest2 where fk = 10;

select * from indxtest3 where fk = 10;

select * from indxtest4 where fk = 10;

select * from indxtest5 where fk = 10;

All these queries will return the same result, two records with id in 1 and 10. But how many records were scanned during search? To find out the answer run the following queries:

explain select * from indxtest1 where fk = 10;

explain select * from indxtest2 where fk = 10;

explain select * from indxtest3 where fk = 10;

explain select * from indxtest4 where fk = 10;

explain select * from indxtest5 where fk = 10;

In the rows column you will notice that for the indxtest1 table it scanned 10 records. That is reasonable since there was no indexing. Now let’s see for indxtest2 table, 2 records were scanned. This is also reasonable since fk was indexed. So far so good. Now for indxtest3 table 10 records were scanned. Hmm… ok we will discuss this in a moment. Let’s check other queries first. For indxtest4 table 10 records were scanned, and for indxtest5 table only 2 records were scanned. Result for indxtest5 table is also reasonable since it has a foreign key index on it.

Now what are the cases with indxtest3 and indxtest4? If you notice both tables have a multi-column index. According to MySQL documentation:

MySQL uses multiple-column indexes in such a way that queries are fast when you specify a known quantity for the first column of the index in a WHERE clause, even if you do not specify values for the other columns. (http://dev.mysql.com/doc/refman/4.1/en/multiple-column-indexes.html)

It is clearly stated in the documentation that when second column is specified, MySQL will not use the index, or even if second column is used with first column with OR condition, it will not use the index. That’s why queries on indxtest3 and indxtest4 scanned all 10 records during search/select.

Now what is the effect of indexing on locking? According to MySQL documentation:

A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement. (http://dev.mysql.com/doc/refman/4.1/en/innodb-locks-set.html)

And what is record lock?

Record lock: This is a lock on an index record. (http://dev.mysql.com/doc/refman/4.1/en/innodb-record-level-locks.html)

Few more points from MySQL documentation (http://dev.mysql.com/doc/refman/4.1/en/innodb-locks-set.html):

1 – For SELECT ... FOR UPDATE or SELECT ... IN SHARE MODE, locks are acquired for scanned rows.

2 – SELECT ... FROM ... FOR UPDATE sets exclusive next-key locks on all index records the search encounters.

3 – UPDATE ... WHERE ... sets an exclusive next-key lock on every record the search encounters.

4 – DELETE FROM ... WHERE ... sets an exclusive next-key lock on every record the search encounters.

Conclusion

So, according to MySQL documentation, either we are explicitly locking records (first two points) or locks are implicit (last two points), locks will be acquired on records the search encounters. So if indexing is proper no locks will be acquired on rows on which we don’t want to. Even MySQL documentation says that:

It is important to create good indexes so that your queries do not unnecessarily need to scan many rows. (http://dev.mysql.com/doc/refman/4.1/en/innodb-locks-set.html)

Scanning many records will result in lock on those records due to which deadlocks can occur, performance can be degraded, or anything bad can happen.

If you are going to test the above scenario then you will also notice the performance differences between the queries if you have reasonable number of records in tables.

Many-to-many relationship

“A junction table, sometimes also known as a “Bridge Table”, “Join Table”, “Map Table”, or “Link Table”, is a table that contains common fields from two tables. It is on the many side of a one-to-many relationship with the other two tables. Junction tables are employed when dealing with many-to-many relationships in a database.” (http://en.wikipedia.org/wiki/Junction_table)

As described in the definition, mapping tables are used when there is a many-to-many relationship between two tables. In this case, a third mapping table is used to map the relation between those two tables.

But I have seen this in one of my employer products where they were using mapping tables in lot of places. The most common and used was in financial application, where data was very crucial. I’ll explain this with an example for education domain. Let’s take classic example of students and classes. A student can be in only one class and in one class there can be multiple students. Therefore, the relationship between student and class is many-to-one. Or from other side relationship between class and student is one-to-many.

So, ideally we will have a table structure with two tables say, STUDENT and CLASS. CLASS will have a PK say CLASSNUM and fields for other class details. Similarly STUDENT will have a PK say STUDENTNUM and fields for other student details. One more field that STUDENT will have is FK to CLASSNUM field.

This is the most obvious design of the scenario. But in the above mentioned application this was not the case. We had a mapping table between STUDENT and CLASS. Yes! You have read it correctly. We had a STUDENTCLASSDETAILS table between STUDENT and CLASS, having a composite key of STUDENTNUM and CLASSNUM. But what is the reason for a mapping table here? What I have heard, (since I was not the designer of that DB at that time) that this mapping table has been introduced to improve performance and resolve locking issues.

Well, let’s discuss how it can improve performance. Now consider the scenario where there are no relationships maintained in the database i.e. no referential integrity (Read my other blog on not having referential integrity – http://mjawaid.wordpress.com/2009/04/01/referential-integrity). So there are no FKs in any table. Now we have a STUDENT with STUDENTNUM (PK), other fields, and CLASSNUM (note that it is not FK, and is not indexed either). CLASS has CLASSNUM (PK), and other fields. STUDENTCLASSDETAILS has STUDENTNUM and CLASSNUM as PK i.e. composite key and these are not FKs.

Now if two users are performing any operation on same record then first transaction locks some records in the STUDENT table then another transaction will not be able to read that students information. In this case the STUDENTCLASSDETAILS will allow us to read the information, at least student number and his class, since this information is most accessible. This way the second transaction will not wait for the other transaction resulting in improved performance and first transaction will acquire the lock successfully.

Now let’s see what are the problems with this approach?

First is the maintenance. When moving student from one class to another (when promoting to next level) two tables have to be updated, resulting in performance hit. Second, mapping table allows student to be inserted in multiple classes and we have witnessed this, resulting in loss of data integrity which is very crucial in financial applications and one can lost a big amount of money just due to silly mistakes. I’ll quote a statement from another forum here:

The designer of an application has a fiduciary responsibility to his employer/client and needs to ensure that data is as acurate as possible. To not enforce referential integrity is to tempt fate. Employees get fired for building systems that contain bad data leading to bad business decisions. Consultants get sued.

Pasted from <http://www.access-programmers.co.uk/forums/archive/index.php/t-33531.html>


Although the person made a comment on not having referential integrity but the underlined statement has the crux of the topic. So the conclusion is when designing an application or database, also consider other scenarios and pros and cons of the approach being adopted, not only one scenario. In other words if one approach is solving your problem then also consider what other problems we can face due to it, and prepare for those as well.

Referential Integrity

When I first looked at the database of an application for one of my employers, I was bit surprised with the database design that was made by previous software engineers/analysts. Two things that most surprised me were no integrity constraints and mapping tables between tables having one-to-many relationship. I’ll focus on the first one here but second one will also be part of it because it is also related to data integrity. The reason I got for not using integrity constraints was performance. Before commenting on it and going further into the details let me quote from few resources about not having integrity constraints.

“Q) When not to use referential integrity?

Ans) The short answer is Never. The designer of an application has a fiduciary responsibility to his employer/client and needs to ensure that data is as acurate as possible. To not enforce referential integrity is to tempt fate. Employees get fired for building systems that contain bad data leading to bad business decisions. Consultants get sued.”

Taken from http://www.access-programmers.co.uk/forums/archive/index.php/t-33531.html

“So if these rules are being examined for each and every database transaction, what is that doing to my system performance and response time? The answer is that it depends. Several things such as the volume of transactions and the types of constraints defined will affect performance. If you define cascading deletes across nine related tables, you are going to see a lag in response time while the database determines how many rows must be deleted from each table. This will also multiply the performance hit of other database features, such as journaling. So keep in mind that, yes, there is a cost for referential integrity in terms of system performance.

On the other hand, you might experience an improvement in performance, since the rules and relationships are enforced at the system level in the database. Instructions executed at this level run more efficiently than similar logic placed in a high-level language. Just as you would weigh the pros and cons of creating additional indices over your database, you should also consider the factors associated with adding referential integrity.

Taken from http://www.itjungle.com/mpo/mpo101002-story03.html

Why Disable Constraints?

During day-to-day operations, constraints should always be enabled. In certain situations, temporarily disabling the integrity constraints of a table makes sense for performance reasons. For example:

  • When loading large amounts of data into a table using SQL*Loader
  • When performing batch operations that make massive changes to a table (such as changing each employee number by adding 1000 to the existing number)
  • When importing or exporting one table at a time

Temporarily turning off integrity constraints can speed up these operations.”

Taken from http://download.oracle.com/docs/cd/B14117_01/appdev.101/b10795/adfns_co.htm

And the most important one:

http://rapidapplicationdevelopment.blogspot.com/2007/07/referential-integrity-data-modeling.html

The whole article is worth reading. It explains data modeling mistakes and the number one is not having referential integrity. I’ll quote its conclusion here:

Conclusion

Well, hopefully I’ve convinced you to avoid the urge to be a lazy data modeler, design for the future, use a data modeling tool, and drop constraints during bulk load operations. In short, always use referential integrity. But if not, hopefully you’ll at least understand when people curse your name several years from now. :)”

Well, these give you a fine idea whether one should use referential integrity constraints (RIC) or not. Even if it is a performance reason, one should use RIC since ultimately data integrity has to be checked either at application level or system level and as said above instructions executed at system level are much faster and efficient. And, if there is no data integrity check then your data is at risk, as in the above mentioned case where on production system, a record in one table was mapped to multiple due to mapping table and no integrity constraints. We had several issues on production system due to this including deadlocks and the one just mentioned. There are several bugs posted on the bugzilla for that project which are examples of this.

Some coding tips!

This article highlights some common pitfalls that are being made by developers. I’ll elaborate it with a simple example.

$obj2->FuncOne( $obj1->GetData() );
$obj2->FuncTwo( $obj1->GetData() );

The result of GetData is passed to FuncOne and FuncTwo in two subsequent calls. What is the benefit of this approach? Simple answer is to save memory, by not using any variable to save the result. This argument was very strong until there were memories with very less capacity. But now even home user has a system with memory in GBs. So this practice is not good in today’s world.

So, what’s the drawback of this approach? Performance degradation. But how can it degrades the performance? Let me explain a bit. Consider a situation where GetData runs a query on database to fetch some data from multiple tables by joining them. Joins thereselves are heavy by nature. So when GetData is called twice it will run query twice, and suppose that this code snippet is a part of a heavy process that can be called by multiple users on the web or in an enterprise application, just imagine what will happen to the database and application itself. The performance of the application will be degraded. Users of your application will get frustrated and at the end you will lose business.

Now let’s look at it from another perspective. This approach will also increase the CPU workload. When GetData will be called it will jump from one branch instruction to another and before that it has to save the current address to the stack so it can pop it back when it returns back to the caller function. This has to be done every time when function is called. And when function performs heavy computation it needs more memory and processing power, increasing the footprint of your application and execution time. So you’re wasting your servant’s (CPU) time and energy by assigning it the same task twice.

Other drawbacks can be that code is more error prone and is difficult to debug and troubleshoot, and code maintenance is high, especially when you have to modify the code to meet new requirements.

You can make your application and code much better and efficient by adhering to few simple best practices. In this case the rule is that

“If result of a function is needed more than once then don’t call that function multiple times. Save the result of that function in a variable and use that variable instead. “

In view of this, above code could be written like this.

my $result = $obj1->GetData();
$obj2->FuncOne( $result );
$obj2->FuncTwo( $result );

There is one extra line of code in above example but is more efficient and is much more readable than the previous one.

Let’s see another example.

my $result = $obj1->GetData();
$result = { %$result, %$someData };
$obj2->FuncOne( $result );
$obj2->FuncTwo( $obj1->GetData() );

In this example developer fetches the result, appends another previously got data to it and passes it to FuncOne. Then he needs the same result, so he called GetData one more time. Now again, we can write this code in much efficient manner.

my $result = $obj1->GetData();
my $result2 = $result;

$result = { %$result, %$someData };

$obj2->FuncOne( $result );
$obj2->FuncTwo( $result2 );

Here we have saved another call to GetData, which might be performing some heavy computation or running a heavy query with joining multiple tables in it.