[H-GEN] Edit massive XML files

Mick bareman at tpg.com.au
Sun Oct 2 19:56:52 EDT 2011


On Mon, 3 Oct 2011 09:38:46 +1000
Michael Anthon <michael at anthon.net> wrote:

Your other alternative is some web based tool, only one that springs to
mind right now is Yahoo's "Pipes" http://pipes.yahoo.com/pipes/ .  I've
only ever really had a bit of a fiddle with this but it's pretty easy
to use.  I'm sure there will be other similar tools out there but I've
never needed to use them.

The java based open source BI suite from Pentaho also has a pretty
powerful ETL tool called Kettle that could be used for this kind of
work. This one I have used quite a lot and it's performance is pretty
good http://kettle.pentaho.com/

I've tried writing XML parsers in various types of languages using
various libraries before but have often run into difficulties with the
size of files I needed to work with.  The main issue I ran into is that
a lot of the libraries I was using attempted to build the whole DOM as
an in memory object... which can be a bit of an problem at times :-)

Cheers,
Michael

I've managed to find a script to parse/load the data into postgis
databases after using awk strip some of the more obvious redundant tags.

This still leaves 2.3GB of XML to import to load into postgis and
either I have a broken xml file or I'm running into the 2GB file size
limit somewhere.

mick




More information about the General mailing list