[H-GEN] Edit massive XML files

Mick bareman at tpg.com.au
Sun Sep 18 22:16:42 EDT 2011


I am developing a set of GIS layers to create a map of historic south
west UK, covering the periods up to the mid 19th century with
particular focus on 1600 to 1840.

Included in the source data I am using OpenStreetMap.org.

>From OSM I have several XML files that average 400MB each.

I need to strip out particular fields, the number of fields ranges
between 100k and 400k per file.

I have tried using gedit to do global find/replace for each tag I need
to remove, found by using `grep 'tag k="source"' southwest-110912.osm |
sort -u > tags` to produce a list to manually work through.

This approach is at best tedious and time consuming - 3 min for each
search/replace plus 10 min save-close gedit-reopen gedit every 5th
replace. Without the close/open gedit stops responding and kills the
desktop.

I am sure a combination of grep & sed or similar could automate this
but I can't get my head around the sed part.

what I think I need to do is:
grep 'tag k="source"' southwest-110912.osm | [delete the line]

Could some kind soul please give me a shove in the right direction

mick



More information about the General mailing list