[H-GEN] Text Search and Replace @ shell prompt

Jason Parker-Burlingham jasonp at uq.net.au
Sun Aug 25 21:45:37 EDT 2002


Greg Black <gjb at gbch.net> writes:

> | All I want to do is that searching a string and replace it with a
> | given string in the content of the file.

> To give an example answer to the question, which is probably not
> quite adequate because the task is under-specified,

There is one clear way in which your solution---and any along the same
lines---is inadequate:  if the poster wants to correct (presumably)
HREF attributes of A tags, and SRC attributes of IMG tags[1], and uses
your simple regular expression solution she or he may find it has been
overzealous, modifying parts of <p> tags and so on.  No amount of
regex hackery will fix it in the general case[2].

What's required is a parser.  The code is more complex, but so is the
problem.  Here's a *very* simple Perl script which could be used with
find -exec or something similar.  It's not commented (comments cost
extra) but it fits the bill reasonably well.  It requires the
HTML::TreeBuilder module for Perl (see the libhtml-tree-perl debian
package) and just for kicks even uses a dread C<goto>.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix-html.pl
Type: application/x-perl
Size: 1040 bytes
Desc: Small Perl script to effect changes to A-HREF and IMG-SRC attributes
URL: <http://lists.humbug.org.au/pipermail/general/attachments/20020825/5c31b4de/attachment.bin>
-------------- next part --------------

As a parting note to the original poster:  Try following Smith's
Heuristic:

        The first time, do it by hand.
        The second time, do it by hand, but take notes.
        If you do it a third time, use your notes to script it.

Doing this all by hand may not be as bad as it looks; something like

        $ vim $(find . -type f -name \*.html)

and some vi macros will most likely let you find all the strings that
need changing, and enable you to quickly change them yourself on a
case-by-case basis (see Greg's ed(1) script for a guide on how to do
this).  This was the course of action I took when I last had to make a
wide number of changes to code that could not be parsed with a
regex[3] and it was faster and easier than trying to find and correct
every little niggle with a regular expression (I know because I was
overconfident and tried that first).

I would also recommend, no matter how you make your changes, that you
copy your HTML somewhere else, work on a copy, and create a patch
which you or someone else can sanity check before applying it to the
HTML in question---at least that way you'll be able to easily back out
of any changes you make.

[1] : These are usually the big two attributes that require fixing.
      There could well be more.

[2] : After being bitten one too many times trying to fix bad edits
      made to HTML documents with regular expressions, I resolved
      never to use regexes to edit HTML *ever* again.

[3] : C-style comments.  I had to audit and sanitize quite a bit of
      JavaDoc.
-- 
||----|---|------------|--|-------|------|-----------|-#---|-|--|------||
| ``Ooooaah!                                                            |
|   I'm getting so excited about cheese-making I can't stand it!''      |
||--|--------|--------------|----|-------------|------|---------|-----|-|


More information about the General mailing list