The todo is todo
The Crime Tracker portion of this project is going to eventually generate a map of all UD crime data since 2003. The goal of this is to see trends in crime and locations. We will be able to use this data when we present our final project to recommend where our sensor network should be placed to most effectively alert public safety of possible crime situations.
— 2009/01/30 1:45 PM update: 1/28 - 1/29 afterburn and grungy built backbone for the User Interface. afterburn coded like mad and grungy made a friend.
— 2009/01/23 11:13 update: 1/22 early morning afterburn successfully generated a kml file. The map looks great!
— 2009/01/21 12:20 Afterburn recently updated his blog and demonstrated the database that he created using Grungy's .csv file from his Python scripts that ripped the raw crime data from Udel's Public Safety site. The sample queries are located here.
The database is very large, and surfingcat is currently working on converting those locations to coordinates for mapping.
There are two ways to use the code:
import OpenURLS temp = OpenURLS.Access() temp.GotoUrls()
This section deals with the nitty gritty of how the internal of CrimeTracker work.
The Python Data Miner is a software program that crawls through police reports publicly available for the Newark Delaware region and saves the information into a .csv file. The police reports that were used can be viewed at the Public Safety Crime Statistics page.
Two different options were initially considered for the programming language. The first was to use a combination of C++ and the wget program available for *nix operating systems. This combination would have used wget to grab the web pages and then C++ to dissect them. This approach was not developed much past initial conception because Python has modules to handle this exact problem.
The second option considered was the Python programming language. Python is a programming language well suited for this type of project because it includes several modules and the SGMLParser. The modules used in this project were: the regular expressions (called re in Python) module, the datetime module (called datetime in Python), the time module (called time in Python) and the csv module (called csv in Python). The SGMLParser is also a Python module, but in order to use it a class has to be created that inherits from the SGMLParser and then overloads the Parsers functions by creating functions with the same name within the child class.
Python was chosen because well it didn't require chiseling a new wheel. This meant the entire data mining process would be simpler.
This section of the code endeavors to be an in-depth explanation of the code from top to bottom. It provides explanations and reasoning behind code blocks and complements the comments already within the code as a guide for the user. However, a basic understanding of Python is assumed.
The DateURLIterator Class was built as an easy way to access all of the different web pages on the Crime Statistics website. Each page containing a table of police reports follows a general naming format of: “lowercase month | numerical day of the month | last two digits of the year.htm”. An example of a website file name is: january103.htm. The datetime module in Python made this extremely easy because it handles how many days there are in a month, when to change the month string and the other aspects of dates required automatically. The previous method used the SaveUrls function in the Access Class.
Going Through the Code:
def __init__(self,startDate): self.date = startDate
The init function in Python is a constructor for the class. When a DateURLIterator class is instantiated it must be passed a datetime object for startDate in order to be constructed. The syntax to create a datetime object is: datetime.datetime(year,month,day). The startDate variable is important because it is the value at which the class starts counting.
def getUrl(self): return "http://www.udel.edu/PublicSafety/" + self.date.strftime("%B").lower() + "%s" % self.date.day + self.date.strftime("%y") + ".htm"
getUrl is a simple get function that returns the entire url to the next website when called. To do this the strftime function of the date class is used. This function is similar to the PHP function of the same name if you have experience with PHP. The two functions work for the most part the same. By calling it with different formatting characters like %B it will return different parts of the datetime object. In the %B it will return the month. This is then converted to lowercase to fit with the naming convention used by Public Safety.
def nextUrl(self): self.date = self.date + datetime.timedelta(days=1)
nextUrl uses the timedelta method of the datetime function to increment the current date by one. The timedelta function returns a “time duration.” What that really means in the context of this code is that when timedelta is given “days=1” it returns a datetime object of “1 day”. This then can be added to the current date and will give the correct following date. It is not possible to add an integer 1 to the current date and get the correct following date because the types of the variables would not match otherwise.
def getDate(self): return self.date
getDate has a really simple job, which is to simply return the current date.
The Access Class is the class that handles the dangerous job of traversing the internet. It opens the correct webpages and feeds them to the SGMLParser until it reaches the “today” date.
Going Through the Code:
def __init__(self): pass
There isn't anything special the constructor should do for this class when it is instantiated, so tell the Python interpreter to do nothing with the pass command.
def SaveUrls(self,url): """Open the given url and save the links""" sock = urllib.urlopen(url) #open up the given url parser = urllister.URLLister() #grab a reference to URLLister() class parser.feed(sock.read()) #call feed method of SGMLib sock.close() #close the url parser.close() #close and flush the buffer of the parser urlFile = open('URLS.txt', 'w') #open up the URLS text file in write mode for index in parser.urls[:-1]: urlFile.write(index + "\n") #put every url but the last on into the text and append a new line to it urlFile.close() #close the file
The idea for this approach was sparked by the Url Lister example from Mark Pilgram's book Dive Into Python. The example in the book has a class that prints all of the urls on a page to the console. This program modifies the idea slightly by writing the urls to a file, “URLS.txt”. This was ideal for the calendar pages on the Public Safety website because the program could crawl through the page and find out where it needed to go to get the real data. This method was abandoned in favor of using the DateUrlIterator Class, which proved much easier to implement and more versatile.
Version 1: SaveUrls Method
def GotoUrls(self): """Open the URLS.txt file, access the listed urls and print the html""" try: urlFile = open('URLS.txt', 'r') #open the list of urls except IOError: print "The file does not exist" for line in urlFile: #grab a line one at a time from the file sock = urllib.urlopen( ) #open the webpage given on the line parser = TableRip() #grab a reference to the class parser.feed( sock.read() ) #feed the parser what you want to decode sock.close() #close the web page parser.close() #close and flush buffer of the parser HTMLFile = open('./test/Report.csv', 'a') #open a file to save .cvs in HTMLFile.write( parser.output()[:49] ) #write the HTML to file #49 is to remove table headings HTMLFile.close() #close the file urlFile.close()
This version of the function used the SaveUrls() function to get all of the Urls of the needed websites. The general idea of how it works is it opens the URLS.txt file. This file was previously created in the SaveUrls() function and contains a listing of the urls on the calendar pages. A for loop is then used in conjunction with the ”in” operator to iterate through the file. The in operator grabs an entire line at one time so the complete url is given every time rather than just a character at a time. The urls are then passed to the urllib class to open the webpage and pass the HTML to the TableRip class.
Version 2: DateURLIterator Method
def GotoUrls(self): """Open the URLS.txt file, access the listed urls and print the html""" today = datetime.datetime.today() #today = datetime.datetime(2004,4,7) #must run one day after the one you want or figure #out how to do a do loop #print today currUrl = DateURLIterator( datetime.datetime(2003,1,1) ) #print currUrl.getDate() while currUrl.getDate() != today: sock = urllib.urlopen( currUrl.getUrl() ) #open the webpage given on the line parser = TableRip() #grab a reference to the class parser.feed( sock.read() ) #feed the parser what you want to decode sock.close() #close the web page parser.close() #close and flush buffer of the parser HTMLFile = open('./test/Report.csv', 'a') #open a file to save .cvs in HTMLFile.write( parser.output() ) #write the HTML to file #49 is to remove table headings HTMLFile.close() #close the file currUrl.nextUrl() #grab the next page
* currUrl: day you want to start collecting data on
* today: day you want to stop collecting data on.
This was the final solution used for the project. It uses the DateURLIterator class to handle building the correct Urls to the different web pages.
The first line in this function creates a datetime object of the current date. This is then used in the while loop to determine when to stop. The while loop however, will not grab the contents of “today” because it does the conditional check before the commands in the loop. This was originally considered a problem, but it actually is not because the days police reports are not posted on the same day usually. They haven't all happened yet.
The second (non-commented) line creates the variable currUrl, which is a DateURLIterator class. currUrl is the date that you wish to start collecting data on. While the today variable is the date you want to stop collecting data on. These two variables are used in the while loop. By checking to see if the start variable, currUrl, is equal to the end variable, today, it is possible to loop through all of the needed days. This might seem like an infinite loop because currUrl never changes, but it does. At the end of the while loop currUrl grabs the next day with the nextUrl() function from the DateURLIterator class.
The next few lines open the current web page with the urllib class, create an object of the Table Rip Class, read the webpage into the table rip class and save the results to a file. The table rip object is the Parser variable. The feed method that is used to read in the web page to the parser is not in the Table Rip Class class because it is defined in the SGMLParser. The TableRip class inherits that particular method directly from the SGMLParser.
The Table Rip class is an extension of the SGML Parser. It takes input in the form of a string and handles different html tags with function calls. The way the SGML Parser works is it uses a Try-Except block to look for a method for every tag it comes upon.
It looks something like this:
try: method = getattr(self, 'start_' + tag) except AttributeError: ...
Tag is the HTML tag stripped of the ”<”, and ”>” characters. It is passed into the method that contains this example code.
The getattr function attempts to get a reference to a function with the name of the string that is passed to it. If python cannot get a reference it throws the AttributeError exception. In the except block SGMLParser tries several other acceptable function names. If none of those combinations work the start_unknown and end_unknown tag functions are used instead.
This feature was used in the project to look for <td> tags and take the text out of them because <td> is a table cell in HTML. It inherits the BaseHTMLProcessor class, this class is an example from Mark Pilgram's book Dive Into Python. The example was used to learn about Python and the SGML Parser. However, all of the methods in the BaseHTMLProcessor class were over loaded in the Table Rip class. Inheriting the BaseHTMLProcessor class gives the same result now as inheriting the SGMLParser directly, but it is left in there to give credit to Mark Pilgram for helping grungy learn Python.
Going Through the Code:
def reset(self): # extend (called from __init__ in ancestor) # Reset all data attributes self.rip = 0 #flag that controls when the program saves text between tags self.count = 0 #used for formatting the csv file self.centerCount = 0 #un-needed used previously to count center tags and save the date self.date = 0 # self.Groups = () #Used to save the findings of the date regex self.IncidentNumber = 0 # self.brFlag = 0 #flag used to initiate the code to handle br tags inside table cell elements self.tempHolding = [] #list used to save entries when multiple lines of data existed in a table cell self.IncidentPattern = "(\\d{2})-(\\d{5})\\Z" #regex used to find an incident number self.RegSearch = re.compile(r'(^January|February|March|April|May|June|July|August|September|October|November|December)(.*\d{1,2})') #regex used to grab the date BaseHTMLProcessor.reset(self)#
This block of code is called from the ancestor's init function and is used as an initialization block for the child class. The comments in the class accurately describe what each variables purpose (or lack of purpose is).
def start_td(self, attrs): #called for every <td> tage in HTML source #increment rip which will cause text to be saved self.rip += 1 self.unknown_starttag("td", attrs)
The purpose of this function is to increment the “rip” variable every time a <td> tag is encountered. This acts as a switch, so the program only saves text inside of the table cells. How this is done is shown in the handle_data(self,text) method.
def end_td(self): #called for every </td> tag in HTML source #Decrement rip counter turning off save mode self.unknown_endtag("td") self.rip -= 1
Just as we incremented the “rip” variable when a beginning <td> tag was encountered we decrement rip when an ending </td> tag is encountered. This gives the program the ability to selectively turn on the ability to save text and turn it off. In this case it only saves text when it has encountered a beginning <td> tag and “rip” is true. This is done with an if statement in the handle_data method.
def start_center(self,attrs): #increment center count so the 2nd one can be grabbed self.centerCount += 1 self.unknown_endtag("center")
This function was originally used to save the date on the page so it could be added to the data being collected. This was accomplished in the handle_data method. Whenever the centerCount was equal to 2 or rip was true the program would save the text. This worked great until the format of the webpages changed and the date was no longer the second <center> tag. This method was replaced by a regex to grab the date.
def end_center(self): #end center tag function self.unknown_endtag("center")
This method does not do anything in particular. It is here to complete the start_center(self,attrs) method. It does nothing unlike the end_td method because we don't want to decrement the count. We want to count to see how many there are per page and then set the count back to 0 before the next page is loaded.
def handle_data(self, text): #override #called for every block of text in HTML source #If in rip mode save text #otherwise discard text #listText = text.split() print "Raw Text: " + text #print "count: " + str(self.count) try: self.Groups = self.RegSearch.search(text).groups() #look for a month followed by a date self.date = "".join(self.Groups) #print self.date except AttributeError: pass if self.rip: try: re.search(self.IncidentPattern,text).groups() #find all the Incident nums to see if more than self.IncidentNum += 1 #one exists for the entry except AttributeError: pass if text.find(",") > -1: #strip commas from text text = text.replace(",","") if text.find("\r\n") > -1: #replace DOS returns #white space is removed by regular expression anything from 2 to 20 spaces = 1 space text = text.replace("\r\n","") text = re.sub("\s{2,20}"," ",text) if text.find("\n") > -1: #remove newlines text = text.replace("\n","") #if text[-1] == " ": #takes care of the case: Hit & run --> element btw # self.pieces.append(text) #doesn't append with a comma # #print "symbol in between" #deprecated # print "text[-1]: " + text #if self.count == 2: # print "count2: " + text # self.count += 1 if self.count == 0: #tack date onto first cell print "count0: " + self.date + "," + text + "," self.pieces.append(self.date + "," + text + ",") self.count += 1 # elif self.count == 1: # #skip 2nd cell # self.count += 1 # elif self.count == 2: # #skip 3rd cell # self.count += 1 # elif self.count == 3: # #skip 4th cell # self.count += 1 # elif self.count == 4: # #skip 5th cell # self.count += 1 # elif self.count == 7: # #skip eigth cell # self.count += 1 elif self.count == 4: #4 for 2003 #adds a newline after each complete entry self.pieces.append(text + "\n") self.count = 0 self.centerCount = 0 print "count4: " + text if self.brFlag > 0: #handles the case of multiple incident#'s in one cell #By saving the previous incident# into tempHolding whenever a <br> tag is found #and then grabbing the entire last line minus the incident# and date is possible to #tack on the needed numbers and dates with the for loop. #Note: the date is store in the same cell of the list as the Incident# #This case is seen on May 14 2003 self.pastEntry = self.pieces[-4:-1] self.pastEntry.extend([self.pieces[-1]]) #print "This is pastEntry: %s" % self.pastEntry for index in range(self.brFlag): #print "This is brFlag: %s" % self.brFlag self.pieces.extend( self.tempHolding[index] ) self.pieces.extend(self.pastEntry) self.brFlag = 0 else: self.pieces.append(text + ",") #comma is for delimeter in .csv self.count += 1 print "else: " + text
Where to start on this one. This is where most of the logic happens for stripping the correct data. The first block of code is a try-except block:
try: self.Groups = self.RegSearch.search(text).groups() #look for a month followed by a date self.date = "".join(self.Groups) #print self.date except AttributeError: pass
This attempts to find a month followed by a date by using the regex compiled in the RegSearch variable. The try-except block acts as a logic statement for this case. If search() method of RegSearch does not resolve to an object (it didn't find a month followed by a date) then the groups() method will throw an AttributeError. So either it finds the month followed by the date and saves the result in the date variable or it doesn't find the date and does nothing.
if self.rip:
This line filters out everything but text inside of a table cell tag when combined with the code in the start_td and end_td methods. start_td increments rip and end_td decrements rip. This makes rip true only when the Parser is inside a table cell.
try: re.search(self.IncidentPattern,text).groups() #find all the Incident nums to see if more than self.IncidentNum += 1 #one exists for the entry except AttributeError: pass
This is deprecated method for determining if more than one incident number is in the table cell. It works the same way as the first block of code by using the try-except block as a logic statement. This code can be removed
if text.find(",") > -1: #strip commas from text text = text.replace(",","") if text.find("\r\n") > -1: #replace DOS returns #white space is removed by regular expression anything from 2 to 20 spaces = 1 space text = text.replace("\r\n","") text = re.sub("\s{2,20}"," ",text) if text.find("\n") > -1: #remove newlines text = text.replace("\n","")
The entire purpose of these if statements is to strip the code of characters that would mess up the csv file format. Going down the list: “\r\n” is the DOS hard return character. This has to be considered because most and maybe all of these were written on a computer running windows. If a hard return is included in the text of the csv file it starts a newline when viewed in a spread sheet. In order to keep data entries intact these have to be removed. For the same reason “\n” is removed as well.
A comma is stripped because that is the delimiter that is used in this csv format.
Also in the “\r\n” if statement 2 to 20 spaces are replaced by a single space because removing the “\r\n” character usually resulted in excess spaces.
if self.count == 0: #tack date onto first cell print "count0: " + self.date + "," + text + "," self.pieces.append(self.date + "," + text + ",") self.count += 1
I removed the commented out lines to make this more succint and not bore readers to death with code that doesn't do anything anyways. It used to, but now it does not.
Count is a very important variable because it keeps track of how text should be formatted when saved to the pieces list.
When count is 0 it means it is the first element in the data entry. In this case we want to add the date we previously obtained with the regular expression and add commas to delimit each piece of data. Then we want to increment count to get ready to put the next data element in.
elif self.count == 4: #4 for 2003 #adds a newline after each complete entry self.pieces.append(text + "\n") self.count = 0 self.centerCount = 0 print "count4: " + text if self.brFlag > 0: #handles the case of multiple incident#'s in one cell #By saving the previous incident# into tempHolding whenever a <br> tag is found #and then grabbing the entire last line minus the incident# and date is possible to #tack on the needed numbers and dates with the for loop. #Note: the date is store in the same cell of the list as the Incident# #This case is seen on May 14 2003 self.pastEntry = self.pieces[-4:-1] self.pastEntry.extend([self.pieces[-1]]) #print "This is pastEntry: %s" % self.pastEntry for index in range(self.brFlag): #print "This is brFlag: %s" % self.brFlag self.pieces.extend( self.tempHolding[index] ) self.pieces.extend(self.pastEntry) self.brFlag = 0
Notice that “elif” is used instead of “if” in this case. “elif” is the Python “else if” statement and it is used in this case because we don't want to execute this code if the first if statement is true. Also it simplifies the code by not having to have two duplicate else statements for the two different if statements.
This case is to handle the last element in the data entry. Three major events occur here: a new line is appended to the text instead of a comma to the text. It also resets all of the counters used in the handle_data method. This way it is ready for the next line of data in the table.
if self.brFlag > 0: #handles the case of multiple incident#'s in one cell #By saving the previous incident# into tempHolding whenever a <br> tag is found #and then grabbing the entire last line minus the incident# and date is possible to #tack on the needed numbers and dates with the for loop. #Note: the date is stored in the same cell of the list as the Incident# #This case is seen on May 14 2003 self.pastEntry = self.pieces[-4:-1] self.pastEntry.extend([self.pieces[-1]]) #print "This is pastEntry: %s" % self.pastEntry for index in range(self.brFlag): #print "This is brFlag: %s" % self.brFlag self.pieces.extend( self.tempHolding[index] ) self.pieces.extend(self.pastEntry) self.brFlag = 0
The comments in this part of the code illustrate what it does excellently. So I won't repeat myself and let the reader read the comments. I will add that brFlag is incremented inside the “start_br()” method and signals how many <br> tags have been encountered.
else: self.pieces.append(text + ",") #comma is for delimeter in .csv self.count += 1 print "else: " + text def unknown_starttag(self, tag, attrs):
This is the last part in the “handle_data()” method and it handles every other case other than the first and last element. All it does is save the text to the pieces list and append a comma onto it.
def unknown_starttag(self, tag, attrs): #override pass def unknown_endtag(self, tag): #override pass def handle_charref(self, ref): #override pass
For all of these types of HTML elements we don't want the program to do anything. In order to do this we employ Python's “pass” command. We don't want to do anything with these because we're not interested in saving unknown tags or character reference tags to the pieces list.
def handle_entityref(self, ref): #override d = {"amp":"&", "nbsp":" "} refTest = "" #print self.date #print ref #print "entityref: " + self.pieces[-1] #print self.pieces == [] if ref == "amp": try: #try block handles the case where pieces is empty #it wouldn't be empty if it was an &, which is all I care about from the entityrefs refTest = str(self.pieces[-1]) #print refTest if refTest[-1] == ",": #self.refTest.replace(","," ") self.pieces[-1] = refTest.replace(",","") #print "entityref2: " + self.pieces[-1] self.count -= 1 except IndexError: pass try: self.pieces.append(d[ref]) except KeyError: pass else: #skip anything other than an ampersand pass
Note: The server code is currently tailored for our setup. You will have to modify it for your purposes. Please see the Personalizing the Code Section.
Note: you will have to point the code to your server. To do this see the Personalizing the Code Section.
The Python Server Backend was simply a python file running on the server called server.py (available in the svn repository). This file instantiates a simple python server that runs on port 8080. The idea was originally inspired from the pyamf example on GeoIP. The server uses pyAMF to talk to the client side (written in flex 3). pyAMF is Action Message Format (AMF) for python that also includes support for Flash Player (for a better description of pyamf visit their website).
All of that aside the unique and interesting feature of the server-client code was the use of objects to represent the data. Several classes were defined with relevent data fields for use in transporting from the server to the client. These classes were: sensors, nodes, networks. A sensor class represents anything providing data connected to the platform. In the case of this project this includes a passive infra-red sensor, trip wire sensors, and the scream detector. A sensor object was made for each of these. These were then wrapped in a node class. The node class represents a router microcontroller platform with all of it's sensors. The node simply has a list of it's sensors, and knowledge of it's GPS coordinates. These were all wrapped inside a Network class. A network class is a representation of the entire node system and all of their sensors. It only has a list of nodes inside of it.
These classes were defined in the python backend and in the actionscript front end code. By using export statements pyAMF was able to pass the class and have both sides understand what data it was receiving. This was a much more refined solution over the initial idea of using a binary socket.