Thursday, November 25, 2010

[mysql] Selecting Random Selection Of Rows From A Database Table

The begining of this story is a PHPNuke module. The module has a block that picks a random item from a database table and displays it. The content of the table never changes. Here is how the programmer of the block does it:

He knows that there are 6899 rows because the last id in the table has a value of 6899. So with the "rand" PHP function, he picks a random number from 1 to 6899 and then queries the database table with a WHERE clause that checks if the value of the id is equal to the picked number:

srand(time());
$random = (rand(1,6899));
$res = mysql_query ("select * from my_table where id=$random");


The problem here is that because of some deleted rows (he has forgotten he deleted them) in the table, the id is not a continuous sequence from 1 to 6899 and that's why some times the picked number (as a value of the id) not exists in the table and the PHPNuke block displays an empty box.

The RAND function in MySQL is precisely there to prevent this kind of problem. Instead of picking a random id with a PHP code, let the server do it for you:

SELECT * FROM my_table ORDER BY RAND() LIMIT 1;

This query returns a random row from the table my_table, no matter how many rows there are and no matter if the id is a continuous sequence of numbers or not. With this query you are sure you get a random row every single time.

Conclusion:
Never reinvent the wheel.

Wednesday, November 24, 2010

[mssql] Selecting Random Selection Of Rows From A Database Table

I found a nice tip for selecting random rows from within a SQL Server 2000 database. Well actually, pseudorandom. Since my undergraduate thesis was on the topic of pseudorandom number generation, I might as well be precise. For some reason, my non-geek friends find it awfully funny when I mention pseudorandom numbers.

I digress. In order to select 10 records from some table at random, try this:
SELECT TOP 10 * FROM someTable ORDER By NEWID()
Now for my homework, I should find out just how random this is. There's a whole slew of statistical tests I can run to gauge the randomness of pseudorandom number generator such as the Chi-square Test, Serial Correlation Coefficient, and 2-D Random Walk Test to name a few.

IMPORTANT: Please note that this will NOT work in SQL 7 on NT4 because the NEWID() function there generates sequential results

Tuesday, November 2, 2010

[mssql] Compare the records of two tables in SQL Server

When you have two tables (or resultsets from SELECT statements) that you wish to compare, and you want to see any changes in ANY columns, as well as to see which rows exist in 1 table but not the other (in either direction) I have found that the UNION operator works quite well.

UNION allows you to compare all columns very quickly, and also handles comparing NULL values to other NULLs successfully, which a join clause or a WHERE condition doesn't normally do.  It also allows you to very quickly see which rows are missing in either table, which only a FULL OUTER JOIN will do, but of course we all know to avoid those at all costs (right?) -- a full outer join is about as “unrelational” as you can get.  (every column returned is potentially Null and must be wrapped in a COALESCE function).  Best of all, the UNION is quick and easy and short.

The basic idea is: if we GROUP the union of two tables on all columns, then if the two tables are identical all groups will result in a COUNT(*) of 2.  But for any rows that are not completely matched on any column in the GROUP BY clause, the COUNT(*) will be 1 -- and those are the ones we want.  We also need to add a column to each part of the UNION to indicate which table each row comes from, otherwise there is no way to distinguish between which row comes from which table.

So, here's an example, assuming we are comparing tables A and B, and the primary key of both tables is ID:


SELECT MIN(TableName) as TableName, KeyColumn, Column2 [, ...]
  FROM
  (SELECT '<TableName1>' as TableName, KeyColumn, Column2 [, ...]
     FROM <TableName1>
     UNION ALL
   SELECT '<TableName2>' as TableName, KeyColumn, Column2 [, ...]
     FROM <TableName2>
  ) tmp
  GROUP BY KeyColumn, Column2 [, ...]
  HAVING COUNT(*) = 1   -- Alternative: COUNT(*) <> 2
  ORDER BY KeyColumn

The above returns all rows in either table that do not completely match all columns in the other.  In addition, it returns all rows in either table that do not exist in the other table.  It handles nulls as well, since GROUP BY normally consolidates NULL  values together in the same group.  If both tables match completely, no rows are returned at all.

The MIN() aggregate function used on the TableName column is just arbitrary -- it has no effect since we are only returning groups of rows in which there has been no consolidation with the GROUP BY (note the HAVING clause).