Data Struggling
Big Data

Flattening complex XML structures into Hive tables using Spark DFs

A couple of months ago in work we faced an issue where we got XML files with nested structs in structs and arrays (with also structs in them).

Normally we always face these issues in Hive. Our ETL guy ingests the XML in HDFS in a Yarn cluster in AVRO format. Then we do SQL using Hive no matters what…

The thing here is that our Data Engineer basically discovered that Spark would take about 20 minutes roughly on performing an XML parsing that took to Hive more than a day.  Basically she tested the same job in Hive (exploding multiple arrays) and PySpark dataframes using the spark-xml lib.

When I started doing this months ago, I wasn’t really fluent in scala and I didn’t have a fully understand about Spark RDDs, so I wanted a solution based on pyspark dataframes.

As I couldn’t find any way to do it with DFs already posted on GitHub, I decided to design a Tree structure to iterate through XML components (see function node_paths below). It only lists all the paths in XMLs struct, and also all the nested paths inside them. If it finds an array, it adds the whole array as a path to be exploded by the function explodePath.

For nested structs and arrays inside arrays, this code may need a bit of rework. As an extra iteration over the results (quickest way), or using+improving code that is currently commented out.

Thanks for reading!

Source:

Related posts

Working in a Big Data Project using the terminal

cetrulin
4 years ago

4 reasons for Agile in Analytics

cetrulin
4 years ago

Scraping stock prices using Alpha Vantage and Google Finance

cetrulin
5 years ago
Exit mobile version