How to Search in XML File Using Python and Lxml Library
Python is a powerful language when it comes to quick textual analysis of XML documents. Due to its simplicity and expressive power, it is possible to write short python scripts to do analysis on XML documents. In this article I provide a number of python scripts for processing and analysing content in XML documents. I will be using lxml library for processing xml documents in python.
For demonstrating the python script processing of xml, I will be using the following sample xml document. This represents customer information obtained from a typical CRM system. Save the following in a file named sample-customer-xml.xml in the same folder where python scripts are stored.
John 40 9100000000 NY Jack 20 9100000008 NJ Pete 56 9100000001 MD Mark 11 9100000003 LA
The following python scripts assume that you have python3 and lxml library installed on your machine. If you don't have lxml, run the following command to install it. In linux, you may need to prefix the command with sudo (if you get permission errors),
The following python script prints all the customer ids present in the sample XML. We open the xml file in binary mode and then read the entire contents. This is then passed to etree for parsing it into an xml tree. We then use an xpath expression to extract all the ids in a list data structure. Finally we iterate the list and then print all the ids on console. Note that the xml attribute id needs to be prefixed with @ when used in xpath.
from lxml import etree with open("sample-customer-xml.xml",'rb') as f: file_content = f.read() tree = etree.fromstring(file_content) customer_ids = tree.xpath('//customer/@id') for id in customer_ids: print(id)
Following python program prints all the customer names who are in LA. We are using an xpath which returns all the customer names who are in LA.
from lxml import etree with open("sample-customer-xml.xml",'rb') as f: file_content = f.read() tree = etree.fromstring(file_content) # get customer names for customers in LA customers_in_la = tree.xpath('//customer[state/text()="LA"]/name/text()') for name in customers_in_la: print(name)
Following python program converts the customer xml into a CSV file. We create one row for each customer with name, age, mobile and state separated by commas. This program requires python 3.6 or above since it uses literal string interpolation.
from lxml import etree with open("sample-customer-csv.csv",'wt') as output: with open("sample-customer-xml.xml",'rb') as input: file_content = input.read() tree = etree.fromstring(file_content) # get all customer records customers = tree.xpath('//customer') for customer in customers: # note that xpath on text() returns a list name = customer.xpath('name/text()')[0] age = customer.xpath('age/text()')[0] mobile = customer.xpath('mobile/text()')[0] state = customer.xpath('state/text()')[0] # uses python 3.6 string interpolation # save the customer attributes in csv form output.write(f"{name},{age},{mobile},{state}\n")