[H-GEN] Text Search and Replace @ shell prompt
Jason Parker-Burlingham
jasonp at uq.net.au
Sun Aug 25 21:45:37 EDT 2002
Greg Black <gjb at gbch.net> writes:
> | All I want to do is that searching a string and replace it with a
> | given string in the content of the file.
> To give an example answer to the question, which is probably not
> quite adequate because the task is under-specified,
There is one clear way in which your solution---and any along the same
lines---is inadequate: if the poster wants to correct (presumably)
HREF attributes of A tags, and SRC attributes of IMG tags[1], and uses
your simple regular expression solution she or he may find it has been
overzealous, modifying parts of <p> tags and so on. No amount of
regex hackery will fix it in the general case[2].
What's required is a parser. The code is more complex, but so is the
problem. Here's a *very* simple Perl script which could be used with
find -exec or something similar. It's not commented (comments cost
extra) but it fits the bill reasonably well. It requires the
HTML::TreeBuilder module for Perl (see the libhtml-tree-perl debian
package) and just for kicks even uses a dread C<goto>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix-html.pl
Type: application/x-perl
Size: 1040 bytes
Desc: Small Perl script to effect changes to A-HREF and IMG-SRC attributes
URL: <http://lists.humbug.org.au/pipermail/general/attachments/20020825/5c31b4de/attachment.bin>
-------------- next part --------------
As a parting note to the original poster: Try following Smith's
Heuristic:
The first time, do it by hand.
The second time, do it by hand, but take notes.
If you do it a third time, use your notes to script it.
Doing this all by hand may not be as bad as it looks; something like
$ vim $(find . -type f -name \*.html)
and some vi macros will most likely let you find all the strings that
need changing, and enable you to quickly change them yourself on a
case-by-case basis (see Greg's ed(1) script for a guide on how to do
this). This was the course of action I took when I last had to make a
wide number of changes to code that could not be parsed with a
regex[3] and it was faster and easier than trying to find and correct
every little niggle with a regular expression (I know because I was
overconfident and tried that first).
I would also recommend, no matter how you make your changes, that you
copy your HTML somewhere else, work on a copy, and create a patch
which you or someone else can sanity check before applying it to the
HTML in question---at least that way you'll be able to easily back out
of any changes you make.
[1] : These are usually the big two attributes that require fixing.
There could well be more.
[2] : After being bitten one too many times trying to fix bad edits
made to HTML documents with regular expressions, I resolved
never to use regexes to edit HTML *ever* again.
[3] : C-style comments. I had to audit and sanitize quite a bit of
JavaDoc.
--
||----|---|------------|--|-------|------|-----------|-#---|-|--|------||
| ``Ooooaah! |
| I'm getting so excited about cheese-making I can't stand it!'' |
||--|--------|--------------|----|-------------|------|---------|-----|-|
More information about the General
mailing list