A couple of months ago in work we faced an issue where we got XML files with nested structs in structs and arrays (with also structs in them).

Normally we always face these issues in Hive. Our ETL guy ingests the XML in HDFS in a Yarn cluster in AVRO format. Then we do SQL using Hive no matters what…

The thing here is that our Data Engineer basically discovered that Spark would take about 20 minutes roughly on performing an XML parsing that took to Hive more than a day.  Basically she tested the same job in Hive (exploding multiple arrays) and PySpark dataframes using the spark-xml lib.

When I started doing this months ago, I wasn’t really fluent in scala and I didn’t have a fully understand about Spark RDDs, so I wanted a solution based on pyspark dataframes.

As I couldn’t find any way to do it with DFs already posted on GitHub, I decided to design a Tree structure to iterate through XML components (see function node_paths below). It only lists all the paths in XMLs struct, and also all the nested paths inside them. If it finds an array, it adds the whole array as a path to be exploded by the function explodePath.

For nested structs and arrays inside arrays, this code may need a bit of rework. As an extra iteration over the results (quickest way), or using+improving code that is currently commented out.

Thanks for reading!

Source: