Monday, November 16, 2009

[mssql] REPLACE Multiple Spaces with One

Introduction

Replacing multiple spaces with a single space is an old problem. If you Google the problem, you find that most folks still resort to While Loops in functions or maybe even a Tally table or (ugh!) XML in a function to solve this seemingly complex problem. The truth is that you don't need the RBAR of a User Defined Function at all.

This article explains the "puzzle solving" behind a common set based method to replace unknown numbers of adjacent spaces with a single space by establishing and replacing simple patterns in a set based fashion.

The Problem

You have a column of data that looks something like the following...

--===== Create and populate a test table.
     -- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.   ' UNION ALL
 SELECT 'So                     does                      this!' UNION ALL
 SELECT 'As                                does                        this' UNION ALL
 SELECT 'This, that, and the other  thing.' UNION ALL
 SELECT 'This needs no repair.'
The goal is to convert the sections of multiple spaces of unknown length to a single space each as quickly as possible. There are some additional requirements. You can't make the data any bigger during the process because it might get too big for VARCHAR(8000) (or whatever size the column is) or you can't use VARCHAR(MAX) because you're using SQL Server 2000.

The Method Explained

I'll use the letter "O" to act as a "visible" space so you can see what I'm talking about... consider the following sets of spaces, please....
O
OO
OOO
OOOO
OOOOO
OOOOOO
OOOOOOO
OOOOOOOO

Remember the goal is to convert all of those sets of spaces to just a single space without the use of RBAR even if the RBAR is through the simple use of a UDF. I'll also state that the goal is to remove the extra spaces without making the original string any larger in the process because it may already be as large as it can be for the given datatype.

So... STEP 1 is to identify pairs of spaces. This is done by modifying the second space in each pair of spaces to be an "unlikely" character. In this case, I'll use the visible character of "X" (which isn't unlikely but serves this visual example) to represent a space that has been changed in a pair of spaces.

When we replace all pairs of space "OO" with "OX", we get the following
O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX

STEP 2 is to replace all occurrences of "XO" with NOTHING...
O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX

... and that leaves us with ...
O
OX
O
OX
O
OX
O
OX

STEP 3 is to replace "X" with NOTHING...
O
OX
O
OX
O
OX
O
OX

... and that leaves us with just singles spaces everywhere...
O
O
O
O
O
O
O
O

Again... the "O"s represent unmodified spaces and the "X"s represent spaces changed to some "unlikely character" like a special non printable, almost non type-able character like ASCII 7 (the "Bell" character). It's just a matter of 3 nested REPLACE functions to handle ANY number of spaces to accomplish the puzzle we solved above. It can all be done in a single set-based query without loops or even UDF's.

Be careful which "unlikely character" you pick, though. We'll talk more about that when we get to the "Unlikely Characters and Collation" section of this article further below.

The Code

Ok... now that you know how it works, here's the code that accomplishes the 3 steps as 3 nested REPLACE's. I've included the test table I previously covered in "The Problem" section of this article just to make things easy to run. Notice that I've also added an LTRIM/RTRIM to take out any leading or trailing spaces, as well...

--===== Create and populate a test table.
     -- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.   ' UNION ALL
 SELECT 'So                     does                      this!' UNION ALL
 SELECT 'As                                does                        this' UNION ALL
 SELECT 'This, that, and the other  thing.' UNION ALL
 SELECT 'This needs no repair.'

--===== Reduce each group of multiple spaces to a single space
     -- for a whole table without functions, loops, or other
     -- forms of slow RBAR.  In the following example, CHAR(7)
     -- is the "unlikely" character that "X" was used for in 
     -- the explanation.
 SELECT REPLACE(
            REPLACE(
                REPLACE(
                    LTRIM(RTRIM(OriginalString))
                ,'  ',' '+CHAR(7))  --Changes 2 spaces to the OX model
            ,CHAR(7)+' ','')        --Changes the XO model to nothing
        ,CHAR(7),'') AS CleanString --Changes the remaining X's to nothing
   FROM @Demo
  WHERE CHARINDEX('  ',OriginalString) > 0

Because of the equivalent treatment of the Thorn (รพ) character and the "th" digraph, you end up with a most undesirable result...

Results from Code Above
CleanString
is has multiple unknown spaces in it.
So does is!
As does is
is, at, and e oer ing.
... which is nothing like what we wanted.

There's also a danger in selecting the wrong "control character" (ASCII characters 0 through 31) as the "unlikely character". I chose CHAR(7) which is a very benign character in today's world of electronics. It was designed to literally ring the bell on old Tele-Type machines and is just passed through by today's electronics.

If you chose CHAR(0), you've just chosen the "NULL" character and anything that follows it's appearance will simply disappear. Using the current test table in this article, you'd get an output that looks like the following...

Results from Code Above
CleanString
This
So
As
This,that,andtheotherthing.

If you chose CHAR(1), you've just chosen the ASCII "Start of Header" character which is still in use today. When certain electronics see this character, it can cause some very strange behavior usually resulting in failure of your code.

Another good "unlikely character" is CHAR(8) which is the ASCII "Backspace" character. It's normally never included in any type of assembled text now adays. CHAR(127) also works well because it's a left over for "Delete" from the paper tape world. It actually punches all the holes in a paper tape (7 of them) to quite literally delete a character.

For more information on the ASCII "Control Characters", please see the following URL:
http://www.lammertbies.nl/comm/info/ascii-characters.html#cont

Conclusion

To borrow a phrase from R. Barry Young's series of articles on the subject, "There must be 15 ways to lose your cursors". Even the seemingly complex task of condensing multiple adjacent spaces to a single space can be done without a loop.

When you run into a problem where you're absolutely sure there's no way to do it without a loop, have another look. There's usually a high performance way to avoid the loop.
Thanks for listening, folks.

--Jeff Moden