How to Search in XML File Using Python and Lxml Library

Python is a powerful language when it comes to quick textual analysis of XML documents. Due to its simplicity and expressive power, it is possible to write short python scripts to do analysis on XML documents. In this article I provide a number of python scripts for processing and analysing content in XML documents. I will be using lxml library for processing xml documents in python.

For demonstrating the python script processing of xml, I will be using the following sample xml document. This represents customer information obtained from a typical CRM system. Save the following in a file named sample-customer-xml.xml in the same folder where python scripts are stored.



  
    John
    40
    9100000000
    NY
  
  
    Jack
    20
    9100000008
    NJ
  
  
    Pete
    56
    9100000001
    MD
  
  
    Mark
    11
    9100000003
    LA
  

The following python scripts assume that you have python3 and lxml library installed on your machine. If you don’t have lxml, run the following command to install it. In linux, you may need to prefix the command with sudo (if you get permission errors),

pip3 install lxml

The following python script prints all the customer ids present in the sample XML. We open the xml file in binary mode and then read the entire contents. This is then passed to etree for parsing it into an xml tree. We then use an xpath expression to extract all the ids in a list data structure. Finally we iterate the list and then print all the ids on console. Note that the xml attribute id needs to be prefixed with @ when used in xpath.

from lxml import etree

with open("sample-customer-xml.xml",'rb') as f:
  file_content = f.read()
  tree = etree.fromstring(file_content)
  customer_ids = tree.xpath('//customer/@id')
  for id in customer_ids:
    print(id)

Following python program prints all the customer names who are in LA. We are using an xpath which returns all the customer names who are in LA.

from lxml import etree

with open("sample-customer-xml.xml",'rb') as f:
  file_content = f.read()
  tree = etree.fromstring(file_content)
  # get customer names for customers in LA
  customers_in_la = tree.xpath('//customer[state/text()="LA"]/name/text()')
  for name in customers_in_la:
    print(name)

Following python program converts the customer xml into a CSV file. We create one row for each customer with name, age, mobile and state separated by commas. This program requires python 3.6 or above since it uses literal string interpolation.

from lxml import etree

with open("sample-customer-csv.csv",'wt') as output:
  with open("sample-customer-xml.xml",'rb') as input:
    file_content = input.read()
    tree = etree.fromstring(file_content)
    # get all customer records
    customers = tree.xpath('//customer')
    for customer in customers:
      # note that xpath on text() returns a list
      name = customer.xpath('name/text()')[0]
      age = customer.xpath('age/text()')[0]
      mobile = customer.xpath('mobile/text()')[0]
      state = customer.xpath('state/text()')[0]
      # uses python 3.6 string interpolation
      # save the customer attributes in csv form
      output.write(f"{name},{age},{mobile},{state}\n")

How to Connect to MySQL Database and Run Queries from Python

MySQL now provides support for pure python based access to mysql database. Use the python mysql connector in your program for accessing the mysql database. Run the following command in your machine’s command prompt to install the driver (this assumes you already have python3 installed),

pip3 install mysql-connector-python

If you get access denied error for the above, you may need to add sudo prefix to the above command.

Once you have the mysql connector installed, obtain the mysql server url, username, password and database name. Replace the variables in the following script with the values of the above. You may also want to replace the query written for wordpress database. Note that the following script is written for python3 and above.

The following python script connects to a mysql database configured for wordpress and then prints all the published posts along with the category under which it is published. You can easily customize this script for your queries.

# pip3 install mysql-connector-python
import mysql.connector

# Need to give access to your machine if using dreamhost MySQL instance
mysql_server = "replace_this_server_name"
mysql_user = "replace_this_user"
mysql_password = "replace_this_password"
mysql_db_name = "replace_this_db_name"

connection = mysql.connector.connect(user=mysql_user,
                  password=mysql_password,
                  host=mysql_server,
                  database=mysql_db_name)
try:
  cursor = connection.cursor()
  cursor.execute("""
    SELECT wp_posts.post_title AS ‘Title’, wp_terms.name AS ‘Cateogry’
    FROM wp_posts
    INNER JOIN wp_term_relationships
    ON wp_posts.ID = wp_term_relationships.object_id
    INNER JOIN wp_terms
    ON wp_term_relationships.term_taxonomy_id = wp_terms.term_id
    INNER JOIN wp_term_taxonomy
    ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
    WHERE wp_posts.post_status = 'publish'
    AND wp_posts.post_type = 'post'
    AND wp_term_taxonomy.taxonomy = 'category'
    ORDER BY wp_terms.name;
  """)
  for row in cursor:
    print(row[0]+","+row[1])
finally:
  connection.close()

If you are using a mysql database provisioned by a cloud service provider such as Dreamhost, you may have to enable direct access to the database from your machine. This option is usually available on the cloud service provider portal. For example, if you are using Dreamhost, click on the mysql username from the Dreamhost panel and then add “%” to the list of “Allowable hosts”. This gives user permission to connect from any IP. Alternatively you can give the specific external IP of the machine as well. Without this option, you will get “_mysql_connector.MySQLInterfaceError: Access denied for user and mysql.connector.errors.ProgrammingError: 1045 (28000): Access denied for user” message.

How to Write a MySQL Query to Fetch All WordPress Posts with Category Name

I run this website on wordpress powered by a mysql database. Very often I want to take a look at all the published posts on the site. By quickly looking at the list of post titles I can avoid posting duplicate or similar articles. If you have a wordpress blog with plenty of posts, you may need a list of all your wordpress posts along with the category name. You can then check whether you already have a post on the topic you are planning to publish. In this article I will provide four MySQL queries which you can run on your wordpress mysql database to get a list of your published posts with category name.

The following mysql query prints all the published wordpress posts grouped by category name. Note that the query uses the default table names used in wordpress. If you are using a cloud provider such as Dreamhost, you may have a different name for your wordpress tables. In that case replace the table names in the following query.

SELECT wp_posts.post_title AS ‘Title’, wp_terms.name AS ‘Cateogry’
  FROM wp_posts
  INNER JOIN wp_term_relationships
  ON wp_posts.ID = wp_term_relationships.object_id
  INNER JOIN wp_terms
  ON wp_term_relationships.term_taxonomy_id = wp_terms.term_id
  INNER JOIN wp_term_taxonomy
  ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
  WHERE wp_posts.post_status = 'publish'
  AND wp_posts.post_type = 'post'
  AND wp_term_taxonomy.taxonomy = 'category'
  ORDER BY wp_terms.name;

The following mysql query prints all the published wordpress posts with most recent one printed first. This allows you to analyse how frequently you are posting articles and take a look at your recent or old posts.

SELECT wp_posts.post_title AS ‘Title’,
       wp_terms.name AS ‘Cateogry’,
       wp_posts.post_date AS ‘Date’
  FROM wp_posts
  INNER JOIN wp_term_relationships
  ON wp_posts.ID = wp_term_relationships.object_id
  INNER JOIN wp_terms
  ON wp_term_relationships.term_taxonomy_id = wp_terms.term_id
  INNER JOIN wp_term_taxonomy
  ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
  WHERE wp_posts.post_status = 'publish'
  AND wp_posts.post_type = 'post'
  AND wp_term_taxonomy.taxonomy = 'category'
  ORDER BY wp_posts.post_date DESC;

You can customise the above queries in different ways for your specific requirements. For example, the following mysql query prints posts only in a specific category,

SELECT wp_posts.post_title AS ‘Title’,
       wp_terms.name AS ‘Cateogry’
  FROM wp_posts
  INNER JOIN wp_term_relationships
  ON wp_posts.ID = wp_term_relationships.object_id
  INNER JOIN wp_terms
  ON wp_term_relationships.term_taxonomy_id = wp_terms.term_id
  INNER JOIN wp_term_taxonomy
  ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
  WHERE wp_posts.post_status = 'publish'
  AND wp_posts.post_type = 'post'
  AND wp_term_taxonomy.taxonomy = 'category'
  AND wp_terms.name='Android';

The following mysql query only prints posts with more than 10 comments,

SELECT wp_posts.post_title AS ‘Title’, wp_terms.name AS ‘Cateogry’
    FROM wp_posts
    INNER JOIN wp_term_relationships
    ON wp_posts.ID = wp_term_relationships.object_id
    INNER JOIN wp_terms
    ON wp_term_relationships.term_taxonomy_id = wp_terms.term_id
    INNER JOIN wp_term_taxonomy
    ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
    WHERE wp_posts.post_status = 'publish'
    AND wp_posts.post_type = 'post'
    AND wp_term_taxonomy.taxonomy = 'category'
    AND wp_posts.comment_count > 10
    ORDER BY wp_terms.name;

How to Run MongoDB Queries on Azure Cosmos DB Using Python

Azure Cosmos DB is a highly scalable and fully managed database service available on the Azure cloud platform. Using MongoDB API support, it is possible to store JSON documents in Azure Cosmos DB. If you are using Cosmos DB with MongoDB API support in your projects, sometimes you need quick Python scripts to analyse data. Python can be used to quickly run queries on Cosmos DB. If you are using python to connect to production database, use only the read-only access keys for the Cosmos DB. Read-only keys are available on a separate tab in the Azure portal page for Cosmos DB (Azure Cosmos DB => Connection String => Read-only Keys).

The following sample python program uses pymongo package to connect to Cosmos DB. If you don’t have pymongo already installed, run the following command,

sudo pip3 install pymongo

The following python program connects to the Azure Cosmos DB and then prints documents in a collection. Don’t forget to replace the variables with the details of your database. Also note that the following program requires python 3.6 or above.

from pymongo import MongoClient
import json

# Get these values from the Azure portal page for your cosmos db account
cosmos_user_name = "cosmostestdbaccountname"
cosmos_password = "C7qC0lae6C6gNqRbiNYvhQgchsWWCLJpgLNA0hwA4IsJSLKDK=="
cosmos_url = "cosmostestdbaccountname.documents.azure.com:10255/?ssl=true&replicaSet=globaldb"
cosmos_database_name = "cosmostestdb"
cosmos_collection_name = "cosmostestcollection"

# This requires python 3.6 or above
uri = f'mongodb://{cosmos_user_name}:{cosmos_password}@{cosmos_url}'
mongo_client = MongoClient(uri)

# This is the name of the Mongo compatible database
my_db = mongo_client[cosmos_database_name]

# The name of the collection you are querying
my_col = my_db[cosmos_collection_name]
my_docs = my_col.find({})

# Prints all documents in the collection
for doc in my_docs:
  print(doc)

How to Create a Cosmos DB Database with MongoDB API Using Azure CLI

Azure Cosmos DB supports creation of databases which are compatible with MongoDB API. You can use standard MongoDB query language to access and modify data in such a database. However, please note that there are limitations in MongoDB API support and for aggregations, you will need to explicitly turn on aggregation support when creating the Cosmos DB account.

The following shell script can be used to create a MongoDB API compatible Cosmos DB database using Azure CLI. Follow these instructions if you don’t have Azure CLI installed on your machine. Also note that the following script will run only on linux or mac systems.

The cosmos db shell script given below asks for the following parameters,

  • Subscription id: This is subscription id under which you want all resources to be created. This value is available from the Azure portal under the “Cost management + Billing” section.
  • Resource group name: In Azure, all resources are organized under a container called resource group. Give a name under which you want your cosmos db account to be organized.
  • Cosmos db account name: This is the name you want to give to your cosmos db account. Note that you can have multiple databases under a single account. Also note that the name must be a valid domain name prefix and should be globally unique in Azure. You will be accessing your account as [cosmos_account_name].documents.azure.com.
  • Cosmos db database name: This is the name you want to give your MongDB instance. This need not be globally unique.

Note that the location for the resource group is hard coded as “US West” region. However, you can easily change it to an input parameter. Resources created under the group will be automatically assigned to the same region.

In Cosmos DB, you can specify the scalability factor of request units (database pricing depends on this value) at the database level or at the collection level. The following script sets the request unit limit to 400 at the database level to minimize cost. This is ok for test systems, but in production you may need a higher limit or set the limits at the collection level.

Note the use of “–capabilities” flag to turn on the aggregations preview feature. I have also set “–default-consistency-level” set to “Strong” to enable consistent reads/writes at the cost of performance.

Once the database is created, you can view the database from the Azure portal. You can also access the DB credentials from the portal. Here is the Azure CLI script for creating MongoDB compatible database in Cosmos DB,

#!/bin/bash

declare resource_location="westus"
echo "Creating cosmos db in $resource_location"

if [[ -z "$subscription_id" ]]; then
  echo "Please enter subscription id:"
  read subscription_id
  [[ "${subscription_id:?}" ]]
fi

if [[ -z "$resource_group_name" ]]; then
  echo "Please enter resource group name for cosmos db:"
  read resource_group_name
  [[ "${resource_group_name:?}" ]]
fi
if [[ -z "$cosmos_account_name" ]]; then
  echo "Please enter cosmos account name:"
  read cosmos_account_name
  [[ "${cosmos_account_name:?}" ]]
fi
if [[ -z "$cosmos_db_name" ]]; then
  echo "Please enter cosmos database name:"
  read cosmos_db_name
  [[ "${cosmos_db_name:?}" ]]
fi

az group create \
  --subscription $subscription_id \
  --name $resource_group_name \
  --location $resource_location

az cosmosdb create \
  --subscription $subscription_id \
  --resource-group $resource_group_name \
  --name $cosmos_account_name \
  --kind MongoDB \
  --capabilities EnableAggregationPipeline \
  --default-consistency-level "Strong" \

az cosmosdb database create \
  --subscription $subscription_id \
  --resource-group $resource_group_name \
  --name $cosmos_account_name \
  --db-name $cosmos_db_name \
  --throughput 400