[H-GEN] Edit massive XML files

Sun Sep 18 23:11:11 EDT 2011

On 09/19/2011 12:16 PM, Mick wrote:
> [ Humbug *General* list - semi-serious discussions about Humbug and     ]
> [ Unix-related topics. Posts from non-subscribed addresses will vanish. ]
>
> I am developing a set of GIS layers to create a map of historic south
> west UK, covering the periods up to the mid 19th century with
> particular focus on 1600 to 1840.
>
> Included in the source data I am using OpenStreetMap.org.
>
>  From OSM I have several XML files that average 400MB each.
>
> I need to strip out particular fields, the number of fields ranges
> between 100k and 400k per file.
>
> I have tried using gedit to do global find/replace for each tag I need
> to remove, found by using `grep 'tag k="source"' southwest-110912.osm |
> sort -u>  tags` to produce a list to manually work through.
>
> This approach is at best tedious and time consuming - 3 min for each
> search/replace plus 10 min save-close gedit-reopen gedit every 5th
> replace. Without the close/open gedit stops responding and kills the
> desktop.
>
> I am sure a combination of grep&  sed or similar could automate this
> but I can't get my head around the sed part.
>
> what I think I need to do is:
> grep 'tag k="source"' southwest-110912.osm | [delete the line]
>
> Could some kind soul please give me a shove in the right direction
>
> mick
> _______________________________________________
> General mailing list
> General at lists.humbug.org.au
> http://lists.humbug.org.au/mailman/listinfo/general

Python has some built-in parsers for XML - a quick google gives a link to:
http://diveintopython.org/xml_processing/