Vanderbilt Crime View

The final project for my Web 2.0 class involved creating a Mashup by aggregating multiple data sources from the internet and creating a visual representation of the data. As I scoured the internet for various data sources I came across the Vanderbilt Crime Log, which lists all of the crimes or incidences that occurred on or near the Vanderbilt University campus in Nashville, TN. I combined this data along with historical weather data provided by Weather Underground to create Vanderbilt Crime View.

DEMO


How I Did It

Part I: The Data

===================

Parsing the data

Unfortunately, as of May 2010 the data was provided in a PDF format [example PDF] and had to be converted into a spreadsheet/database before geo-coding could be done. Using Adobe Acrobat’s Export to Plain Text feature I was able to produce a file that I could parse (the HTML and DOC exports were unable to produce consistent tables due to the lack of vertical lines on the tables). Luckily the text file was formatted so that each row was on a separate line and columns (for the most part) were delimited by a tab. Using the following regular expression and some PHP I was able to convert 15 months worth of PDFs into a MySQL table.

// REGEX
^(.+?)  (.+?)(?:10|09)-.+?  (.+?)  (.+?)(?:Active|Inactive|Arrest|Unfounded).*?  (.*)

Cleaning the data

After looking through the data that was collected I noticed that the format used for displaying the location had changed sometime during the year. Locations were sometimes written BUILDING NAME – ADDRESS while other times listed as ADDRESS (BUILDING NAME). In order to geo-code the data I needed to separate the address from the building name. I decided to add a `building` field to my MySQL table which I would populate by parsing the location field for each record. I used the following PHP and regular expression to do so:

$l = 'Location From the DB'

if(strpos($l,'(') !== false){
    $loc = preg_replace('/\s*\(.+?\)\s*/im', '', $l);

    if (preg_match('/\((.+?)\)/im', $l, $regs))
        $name = $regs[1];
}
elseif (strpos($l,'-') !== false) {
    $c = explode('-',$l);
    $name = trim($c[0]);
    $loc = trim($c[1]);
}

While this does not take into account hyphenated building names and other edge cases, it provided adequate results for the project and was definitely preferred to manually parsing the data.

Geo-coding the data

The next step of the process was to geo-code (i.e. determine the latitude and longitude) for each of the crimes. I turned to Google for this task and wrote a PHP script to query the Google Geocoding API and save the resulting data to the MySQL table. By only geocoding DISTINCT locations and issuing an UPDATE on all crimes that shared a particular location I was able to drastically cut down on the number of geocoder lookups needed.

For each lookup Google provides an accuracy value that indicates how sure Google is about the result. By saving this value to the database I was able to determine which locations needed more attention. Some of the entries were misspelled or corrupt, while others didn’t include adequate location information. In particular, many locations just had a BUILDING NAME without an address. In addition to this many buildings are located in the middle of campus (not near a road). To remedy these issues I turned to the Vanderbilt Map which makes use of a GML file containing the latitude/longitude coordinates for each building on campus (I discovered the file using the Firefox Tamper Data plugin). After a few tweaks to the file, PHP was able to parse it and provided me the geocoding data for all buildings on campus. I saved this parsed data into my database and used it to update my crimes table.

<?PHP

$buildings = simplexml_load_string(file_get_contents("vu.gml"));

foreach($buildings as $building) {
  $building_name = $building->facilities->FACILITY_NAME;
  $coords = $building->facilities->msGeometry->Polygon->outerBoundaryIs->LinearRing->coordinates;
}

?>

Spreading out the data

Since there exact locations that appear more than once, there are also multiple latitude/longitude pairs that are the same. This becomes problematic when visualizing the data on a map because all of crimes that occurred at the same location are plotted on top of one another are covered by the top marker. In order to provide a more useful visualization I moved all pins that were at the same lat/lon by a random amount (between 0 and a hundred feet) in a random direction. I accomplished this using the below SQL query.

$result = mysql_query("SELECT count(*), lat, lon FROM `crimes` GROUP BY lat, lon HAVING count(*) > 1 ORDER BY count(*) desc");

while($r = mysql_fetch_assoc($result)){
    mysql_query("UPDATE crimes SET lat = (lat + (RAND() * 0.0002516197121987) - 0.00012580985609934), lon = (lon + (RAND() * 0.0002516197121987) - 0.00012580985609934) WHERE lat=".$r['lat']." AND lon=".$r['lon'];
}

3 comments to Vanderbilt Crime View

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>