196 lines
6.7 KiB
HTML
196 lines
6.7 KiB
HTML
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||
|
<html><head>
|
||
|
<meta http-equiv="content-type" content="text/html; charset=windows-1252">
|
||
|
<title>Natural Order String Comparison</title>
|
||
|
</head>
|
||
|
<body>
|
||
|
|
||
|
<h1>Natural Order String Comparison</h1>
|
||
|
<p>by <a href="http://sourcefrog.net/">Martin Pool</a>
|
||
|
|
||
|
</p><p>Computer string sorting algorithms generally don't order strings
|
||
|
containing numbers in the same way that a human would do. Consider:
|
||
|
|
||
|
</p><blockquote><pre>rfc1.txt
|
||
|
rfc2086.txt
|
||
|
rfc822.txt
|
||
|
</pre></blockquote>
|
||
|
<p>It would be more friendly if the program listed the files as
|
||
|
|
||
|
</p><blockquote><pre>rfc1.txt
|
||
|
rfc822.txt
|
||
|
rfc2086.txt
|
||
|
</pre></blockquote>
|
||
|
|
||
|
<p>Filenames sort properly if people insert leading zeros, but they
|
||
|
don't always do that.
|
||
|
|
||
|
</p><p>I've written a subroutine that compares strings according to this
|
||
|
natural ordering. You can use this routine in your own software, or
|
||
|
download a patch to add it to your favourite Unix program.
|
||
|
|
||
|
|
||
|
</p><h2>Sorting</h2>
|
||
|
|
||
|
<p>Strings are sorted as usual, except that decimal integer substrings
|
||
|
are compared on their numeric value. For example,
|
||
|
|
||
|
</p><blockquote>
|
||
|
a < a0 < a1 < a1a < a1b < a2 < a10 < a20
|
||
|
</blockquote>
|
||
|
|
||
|
<p>Strings can contain several number parts:
|
||
|
|
||
|
</p><blockquote>
|
||
|
x2-g8 < x2-y7 < x2-y08 < x8-y8
|
||
|
</blockquote>
|
||
|
|
||
|
in which case numeric fields are separated by nonnumeric characters.
|
||
|
Leading spaces are ignored. This works very well for IP addresses
|
||
|
from log files, for example.
|
||
|
|
||
|
<p>
|
||
|
Leading zeros are <u>not</u> ignored, which tends to give more
|
||
|
reasonable results on decimal fractions.
|
||
|
</p>
|
||
|
|
||
|
<blockquote>
|
||
|
1.001 < 1.002 < 1.010 < 1.02 < 1.1 < 1.3
|
||
|
</blockquote>
|
||
|
|
||
|
<p>Some applications may wish to change this by modifying the test
|
||
|
that calls <code>isspace</code>.
|
||
|
|
||
|
|
||
|
</p><p>
|
||
|
Performance is linear: each character of the string is scanned
|
||
|
at most once, and only as many characters as necessary to decide
|
||
|
are considered.
|
||
|
</p>
|
||
|
|
||
|
<p><a href="http://sourcefrog.net/projects/natsort/example-out.txt">Longer example of the results</a>
|
||
|
|
||
|
|
||
|
</p><h2>Licensing</h2>
|
||
|
|
||
|
<p>This software is copyright by Martin Pool, and made available under
|
||
|
the same licence as zlib:
|
||
|
|
||
|
</p><blockquote>
|
||
|
<p> This software is provided 'as-is', without any express or implied
|
||
|
warranty. In no event will the authors be held liable for any damages
|
||
|
arising from the use of this software.
|
||
|
|
||
|
</p><p> Permission is granted to anyone to use this software for any purpose,
|
||
|
including commercial applications, and to alter it and redistribute it
|
||
|
freely, subject to the following restrictions:
|
||
|
|
||
|
</p><p> 1. The origin of this software must not be misrepresented; you must not
|
||
|
claim that you wrote the original software. If you use this software
|
||
|
in a product, an acknowledgment in the product documentation would be
|
||
|
appreciated but is not required.
|
||
|
</p><p> 2. Altered source versions must be plainly marked as such, and must not be
|
||
|
misrepresented as being the original software.
|
||
|
</p><p> 3. This notice may not be removed or altered from any source distribution.
|
||
|
</p></blockquote>
|
||
|
|
||
|
<p>This licence applies only to the C implementation. You are free to
|
||
|
reimplement the idea fom scratch in any language.
|
||
|
|
||
|
</p><h2>Related Work</h2>
|
||
|
|
||
|
|
||
|
<p>
|
||
|
POSIX sort(1) has the -n option to sort numbers, but this doesn't
|
||
|
work if there is a non-numeric prefix.
|
||
|
</p>
|
||
|
|
||
|
<p>
|
||
|
GNU ls(1) has the <tt>--sort=version</tt> option, which works
|
||
|
the same way.
|
||
|
</p>
|
||
|
|
||
|
<p>
|
||
|
The PHP scripting language now has a
|
||
|
<a href="http://us3.php.net/manual/en/function.strnatcmp.php">strnatcmp</a>
|
||
|
function based on this code.
|
||
|
The PHP wrapper was done by Andrei Zimievsky.
|
||
|
</p>
|
||
|
|
||
|
<p>
|
||
|
<a href="http://www.naturalordersort.org/">Stuart
|
||
|
Cheshire</a> has a Macintosh <q>system extension</q> to do natural ordering.
|
||
|
I indepdendently reinvented the algorithm, but Stuart had it
|
||
|
first. I borrowed the term <q>natural sort</q> from him.
|
||
|
|
||
|
</p>
|
||
|
|
||
|
<p>
|
||
|
<a href="http://search.cpan.org/src/EDAVIS/Sort-Versions-1.4/README"><tt>Sort::Versions</tt></a>
|
||
|
in Perl. "The code has some special magic to deal with common
|
||
|
conventions in program version numbers, like the difference between
|
||
|
'decimal' versions (eg perl 5.005) and the Unix kind (eg perl 5.6.1)."
|
||
|
|
||
|
</p><p><a href="http://www.cpan.org/modules/by-module/Sort/Sort-Naturally-1.01.readme"><tt>Sort::Naturally</tt></a>
|
||
|
is also in Perl, by Sean M. Burke. It uses locale-sensitive character classes to sort words and numeric substrings
|
||
|
in a way similar to natsort.
|
||
|
|
||
|
</p><p>
|
||
|
Ed Avis wrote <a href="http://membled.com/work/apps/todo/numsort">something similar in Haskell</a>.
|
||
|
|
||
|
|
||
|
</p><p>
|
||
|
Pierre-Luc Paour wrote a <a href="http://pierre-luc.paour.9online.fr/NaturalOrderComparator.java"><tt>NaturalOrderComparator</tt>
|
||
|
in Java</a>
|
||
|
|
||
|
</p><p>Kristof Coomans wrote a <a href="http://sourcefrog.net/projects/natsort/natcompare.js">natural sort comparison in Javascript</a></p>
|
||
|
|
||
|
<p>Alan Davies wrote
|
||
|
<a href="http://sourcefrog.net/projects/natsort/natcmp.rb"><tt>natcmp.rb</tt></a>,
|
||
|
an implementation in <a href="http://www.ruby-lang.org/">Ruby</a>.
|
||
|
|
||
|
</p><p><a href="http://sourceforge.net/projects/numacomp">Numacomp</a>
|
||
|
- similar thing in Python.
|
||
|
|
||
|
</p><p><a href="http://code.google.com/p/as3natcompare/">as3natcompare</a>
|
||
|
implementation in Flash ActionScript 3.
|
||
|
|
||
|
</p><h2>Get It!</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li><a href="http://sourcefrog.net/projects/natsort/strnatcmp.c">strnatcmp.c</a>,
|
||
|
<a href="http://sourcefrog.net/projects/natsort/strnatcmp.h">strnatcmp.h</a> - the algorithm itself
|
||
|
|
||
|
</li><li><a href="http://sourcefrog.net/projects/natsort/natsort.c">natsort.c</a> - example driver program.
|
||
|
(Try <tt>ls -F /proc | natsort</tt>)
|
||
|
|
||
|
</li><li><a href="http://sourcefrog.net/projects/natsort/textutils.diff">textutils.diff</a> - patch to add
|
||
|
natural sort to sort(1) from GNU textutils-2.0; use the new
|
||
|
<tt>-N</tt> option.</li>
|
||
|
|
||
|
<li>Natural ordering is now in PHP4rc2, through the <a href="http://php.net/manual/html/function.strnatcasecmp.html">strnatcasecmp</a>
|
||
|
and <a href="http://php.net/manual/html/function.strnatcmp.html">strnatcmp</a>
|
||
|
functions.</li>
|
||
|
</ul>
|
||
|
|
||
|
|
||
|
<h2>To Do</h2>
|
||
|
|
||
|
<p>
|
||
|
Comparison of characters is purely numeric, without taking
|
||
|
character set or locale into account. So it is only correct for
|
||
|
ASCII. This should probably be a separate function because doing
|
||
|
the comparisons will probably introduce a dependency on the OS
|
||
|
mechanism for finding the locale and comparing characters.
|
||
|
|
||
|
|
||
|
</p><p>
|
||
|
It might be good to support multibyte character sets too.
|
||
|
|
||
|
</p><p>
|
||
|
If you fix either of these, please mail me. They should not be
|
||
|
very hard.
|
||
|
|
||
|
|
||
|
|
||
|
</p></body></html>
|