Sept 1st was long after my self imposed deadline to have it online before the promulgation ceremony. The idea had started as a desire to have the constitution online in a way that people could access in a simple way. Katiba.mobi was already online but I wanted to utilize the app engine platform for its scaling capabilities. On an earlier project (Kura Info), I had used python and django and had been impressed by that combination which greatly simplified web application development.
Having decided that the user interface part will be a walkover, I turned to the data where I converted the PDF file to text using pdftotext. The task that faced me then was how to delimit the various sections of the document - chapters, articles, clauses, parts and so on. The articles are numbering is not restarted at the chapter level for the body of the document, so I extracted that portion for further analysis, leaving out the TOC and appendices. The complexity of breaking down the text file was dawning on me so I turned to a compiler construction book for ideas on how to automate the process of building parsers. The book led me to use flex for my first stage where chapter headings, article headings, article numbers and clauses were sorrounded by XML tags.
The next step according to compiler theory would have been to create a grammer, tokenize the file and feed it into bison. At that point, I decided to take a short cut by using regular expressions in python to finish the task of surrounding chapters, part and article sections with new XML tags. With an XML version of the constitution available, my original plan had been to use it to create another XML file in a format suitable for uploading to the datastore. This approach was abandoned due to the additional complexity and time pressure. I decided to use XPATH to mine the desired information. I quickly found out that XPATH support on google app engine (python) was only available as a third party library and decided to switch to Java due to its XML processing facilities.
I did not expect a repeat of the ease I had experienced while using django for the user interface so I went for a user interaction medium that demanded the least in terms of UI development - email. When email started was ready and the application hosted, adding chat was a straightforward process.
No comments:
Post a Comment