Behaviour Driven Spark

Spark is a big deal these days, people are using this for all sorts of exciting data wrangling. There’s a huge trend for ease of use within the Spark community and with tools like Apache Zeppelin coming onto the scene the barrier to entry is very low. This is all good stuff: open source projects live and die in the first half an hour of use. New users need to get something cool working quickly or they’ll get bored and wander off…

But for those of us who got past Hello World some time ago and are now using Spark as the basis of large and important projects there’s also the chance to do things right. In fact, since Spark is based on a proper language (Scala, not R or python please!) it’s a great chance to bring some well established best practices into a world where uncontrolled script hackers have held sway for too long!

Check out the source for this article on my GitHub: https://github.com/DanteLore/bdd-spark

cucumber

Behaviour Driven Development, or BDD, is a bit like unit testing. Like unit testing done by an experienced master craftsman. On the surface they look the same – you write some “test” code which calls your production code with known inputs and checks the outputs are what you want them to be. It can be run from your IDE and automated in your CI build because it uses the same runner as your unit tests under the hood.

For me, TDD and BDD differ in these two critical ways: BDD tests at the right level; because you’re writing “Specifications” in pseudo-English not “Tests” in code you feel less inclined to test every function of every class. You test at the external touch-points of your app (load this data, write to this table, show this on the UI), which makes your tests less brittle and more business oriented. Which leads to the second difference: BDD specs are written in Cucumber, a language easily accessible to less techie folks like testers, product owners and stakeholders. Because Cucumber expresses business concepts in near-natural language, even your Sales team have a fighting chance of understanding it… well, maybe.

Project Setup

Before we can crack on and write some Cucumber, there is some setup to be done in the project. I am using IntelliJ, but these steps should work for command line SBT also.

First job, get build.sbt set up for Spark and BDD:

name := "spark-bdd-example"

version := "1.0"
scalaVersion := "2.10.6"

libraryDependencies ++= Seq(
  "log4j" % "log4j" % "1.2.14",
  "org.apache.spark" %% "spark-core" % "1.6.0",
  "org.apache.spark" %% "spark-sql" % "1.6.0",
  "org.apache.spark" %% "spark-mllib" % "1.6.0",
  "org.json4s" %% "json4s-jackson" % "3.2.7",
  "info.cukes" % "cucumber-core" % "1.2.4" % "test",
  "info.cukes" %% "cucumber-scala" % "1.2.4" % "test",
  "info.cukes" % "cucumber-jvm" % "1.2.4" % "test",
  "info.cukes" % "cucumber-junit" % "1.2.4" % "test",
  "junit" % "junit" % "4.12" % "test",
  "org.scalatest" %% "scalatest" % "2.2.4" % "test"
)

For this example I am wrapping Spark up in an object to make it globally available and save me mocking it out “properly”. In a production app, where you need tighter control of the options you pass to spark, you might want to mock it out and write a “Given” to spin Spark up. Here’s my simple object in Spark.scala:

object Spark {
  val conf = new SparkConf()
    .setAppName("BDD Test")
    .setMaster("local[8]")
    .set("spark.default.parallelism", "8")
    .set("spark.sql.shuffle.partitions", "8")

  val sc = new SparkContext(conf)
  LogManager.getRootLogger.setLevel(Level.ERROR)

  val sqlContext = new SQLContext(Spark.sc)
  sqlContext.setConf("spark.sql.shuffle.partitions", "8")
}

If using IntelliJ, like me, you’ll also need a test class to run your cucumber. Mine’s in Runtests.scala. Right click on this and select “Run tests” from the context menu and it’ll run the tests.

@RunWith(classOf[Cucumber])
class RunTests extends {
}

If using the command line, add this line to project/plugins.sbt:

addSbtPlugin("com.waioeka.sbt" % "cucumber-plugin" % "0.0.3")

And these to build.sbt:

enablePlugins(CucumberPlugin)
CucumberPlugin.glue := ""

First Very Simple Example

Here’s the first bit of actual cucumber. We’re using it for a contrived word-counting example here. The file starts with some furniture, defining the name of the Feature and some information on it’s purpose, usually in the format In order to achieve some business aim, As the user or beneficiary of the feature, I want some feature.

Feature: Basic Spark

  In order to prove you can do simple BDD with spark
  As a developer
  I want some spark tests

  Scenario: Count some words with an RDD
    When I count the words in "the complete works of Shakespeare"
    Then the number of words is '5'

The rest of the file is devoted to a series of Scenarios, these are the important bits. Each scenario should test a very specific behaviour, there’s no limit to the number of scenarios you can define, so take the opportunity to keep them focussed. As well as a descriptive name, each scenario is made of a number of steps. Steps can be Givens, Whens or Thens.

  • Given some precondition“: pre-test setup. Stuff like creating a mock filesystem object, setting up a dummy web server or initialising the Spark context
  • When some action“: call the function you’re testing; make the REST call, whatever
  • Then some test“: test the result is what you expected

Step Definitions

Each step is bound up to a method as shown in the “Steps” class below. When the feature file is “executed” the function bound to each step is executed. You can pass parameters to steps as shown here with the input string and the expected number of words. You can re-use steps in as many scenarios and features as you like. Note that the binding between steps and their corresponding functions is done with regular expressions.

class SparkSteps extends ScalaDsl with EN with Matchers {
  When("""^I count the words in "([^"]*)"$"""){ (input:String) =>
    Context.result = Spark.sc.parallelize(input.split(' ')).count().toInt
  }

  Then("""^the number of words is '(\d+)'$"""){ (expected:Int) =>
    Context.result shouldEqual expected
  }
}

The Context

The Context object here is used to store things… any variables needed by the steps. You could use private fields on the step classes to achieve this, but you’d quickly encounter problems when you began to define steps over multiple classes.

object Context {
  var result = 0
}

I don’t particularly like using a Context object like this, as it relies on having vars, which isn’t nice. If you know a better way, please do let me know via the comments box below!

Data Tables

So the word counting example above shows how we can do BDD with spark – we pass in some data and check the result. Great! But it’s not very real. The following example uses Spark DataFrames and Cucumber DataTables to do something a bit more realistic:

  Scenario: Joining data from two data frames to create a new data frame of results
    Given a table of data in a temp table called "housePrices"
      | Price:Int  | Postcode:String | HouseType:String |
      | 318000     | NN9 6LS         | D                |
      | 137000     | NN3 8HJ         | T                |
      | 180000     | NN14 6TN        | S                |
      | 249000     | NN14 6TN        | D                |
    And a table of data in a temp table called "postcodes"
      | Postcode:String | Latitude:Double | Longitude:Double |
      | NN9 6LS         | 51.1            | -1.2             |
      | NN3 8HJ         | 51.2            | -1.1             |
      | NN14 6TN        | 51.3            | -1.0             |
    When I join the data
    Then the data in temp table "results" is
      | Price:Int  | Postcode:String | HouseType:String | Latitude:Double | Longitude:Double |
      | 318000     | NN9 6LS         | D                | 51.1            | -1.2             |
      | 137000     | NN3 8HJ         | T                | 51.2            | -1.1             |
      | 180000     | NN14 6TN        | S                | 51.3            | -1.0             |
      | 249000     | NN14 6TN        | D                | 51.3            | -1.0             |

You only need to write the code to translate the data tables defined in your cucumber to data frames once. Here’s my version:

class ComplexSparkSteps extends ScalaDsl with EN with Matchers {
  def dataTableToDataFrame(data: DataTable): DataFrame = {
    val fieldSpec = data
      .topCells()
      .map(_.split(':'))
      .map(splits => (splits(0), splits(1).toLowerCase))
      .map {
        case (name, "string") => (name, DataTypes.StringType)
        case (name, "double") => (name, DataTypes.DoubleType)
        case (name, "int") => (name, DataTypes.IntegerType)
        case (name, "integer") => (name, DataTypes.IntegerType)
        case (name, "long") => (name, DataTypes.LongType)
        case (name, "boolean") => (name, DataTypes.BooleanType)
        case (name, "bool") => (name, DataTypes.BooleanType)
        case (name, _) => (name, DataTypes.StringType)
      }

    val schema = StructType(
      fieldSpec
        .map { case (name, dataType) =>
          StructField(name, dataType, nullable = false)
        }
    )

    val rows = data
      .asMaps(classOf[String], classOf[String])
      .map { row =>
        val values = row
          .values()
          .zip(fieldSpec)
          .map { case (v, (fn, dt)) => (v, dt) }
          .map {
            case (v, DataTypes.IntegerType) => v.toInt
            case (v, DataTypes.DoubleType) => v.toDouble
            case (v, DataTypes.LongType) => v.toLong
            case (v, DataTypes.BooleanType) => v.toBoolean
            case (v, DataTypes.StringType) => v
          }
          .toSeq

        Row.fromSeq(values)
      }
      .toList

    val df = Spark.sqlContext.createDataFrame(Spark.sc.parallelize(rows), schema)
    df
  }

  Given("""^a table of data in a temp table called "([^"]*)"$""") { (tableName: String, data: DataTable) =>
    val df = dataTableToDataFrame(data)
    df.registerTempTable(tableName)

    df.printSchema()
    df.show()
  }
}

Likewise, you can define a function to compare the output data frame with the “expected” data from the cucumber table. This is a simple implementation, I have seen some much classier versions which report the row and column of the mismatch etc.

  Then("""^the data in temp table "([^"]*)" is$"""){ (tableName: String, expectedData: DataTable) =>
    val expectedDf = dataTableToDataFrame(expectedData)
    val actualDf = Spark.sqlContext.sql(s"select * from $tableName")

    val cols = expectedDf.schema.map(_.name).sorted

    val expected = expectedDf.select(cols.head, cols.tail: _*)
    val actual = actualDf.select(cols.head, cols.tail: _*)

    println("Comparing DFs (expected, actual):")
    expected.show()
    actual.show()

    actual.count() shouldEqual expected.count()
    expected.intersect(actual).count() shouldEqual expected.count()
  }

Coverage Reporting

There’s a great coverage plugin for Scala which can easily be added to the project by adding a single line to plugins.sbt:

logLevel := Level.Warn

addSbtPlugin("com.waioeka.sbt" % "cucumber-plugin" % "0.0.3")
addSbtPlugin("org.scoverage" % "sbt-scoverage" % "1.3.5")

The report is generated with the following SBT command and saved to HTML and XML formats for viewing or ingest by a tool (like SonarQube).

$ sbt clean coverage cucumber coverageReport

...

[info] Written Cobertura report [/Users/DTAYLOR/Development/bdd-spark/target/scala-2.10/coverage-report/cobertura.xml]
[info] Written XML coverage report [/Users/DTAYLOR/Development/bdd-spark/target/scala-2.10/scoverage-report/scoverage.xml]
[info] Written HTML coverage report [/Users/DTAYLOR/Development/bdd-spark/target/scala-2.10/scoverage-report/index.html]
[info] Statement coverage.: 94.69%
[info] Branch coverage....: 100.00%
[info] Coverage reports completed
[info] All done. Coverage was [94.69%]
[success] Total time: 1 s, completed 08-Aug-2016 14:27:17

Screenshot 2016-
08-08 14.29.12

Conclusion

So, hopefully this long and rambling article has made one key point: You can use BDD to develop Spark apps. The fact that you should isn’t something anyone can prove, it’s just something you’ll have to take on faith!

 

Predicting MOT Pass Rates with Spark MLlib

Every car in the UK, once its’s three years old, need to have an MOT test annually to prove it’s safe to drive. The good people at the DVLA have made a large chunk of the data available as part of the government’s push to make more data “open”.  You can download the MOT data here.  You can also get hold of all the code for this article in my GitHub.

Visualising the data

Before I started doing any machine learning, I did some basic visualisation of the data in a series of charts, just to get an idea of the “shape” of things.  I used Spark to process the data (there’s lots of it) and D3js to create some charts. I haven’t been able to make the charts work in WordPress yet but you can see them below as screenshots of elsewhere as a live document.

The data arrives in CSV format, which is very easy to digest but pretty slow when you’re dealing with tens of millions of rows. So the first thing I did was to transform the data to Parquet using Spark’s built in Parquet capabilities. This improved query performance massively.

Test counts over time

First thing to look at: how many tests are carried out on vehicles of a given age? Basically, how many 3-year-old, 4-year-old, 20-year-old… cars are on the road.  The dataset contains records for MOTs on cars well over 100 years old, but there aren’t many of them.

mot1

As you can see from the histogram, most tests are carried out on cars between 3 and 15ish years old.

mot2

The accompanying CDF shows that the 95% percentile is roughly around the 15 year mark.  Let’s zoom in a bit…

mot3

The zoomed-in histogram makes the 10-15 year shelf life of most cars pretty apparent.

Pass rates by age

Are people throwing away their older cars because they’re uncool or because they are broken?

mot5

A look at the pass rate over time shows that it’s probably because they’re broken.  The pass rate starts off pretty high – well over 85%, but dips top an all time low at 14 years of age.

mot6

Once cars get past the 14 year “death zone” their prospects get better though. As cars get older and older the first-test pass rate heads back up towards 100%. At around 60 years old, cars have a better chance of passing their MOT than when they’re brand new!

I guess it’s safe to assume that cars over 30 years of age are treated with a little more respect.  They’re “classics” after all. Once a car is 80+ years old it probably lives in a museum or private collection and drives very little throughout the year. The MOT test is much “easier” for older cars too – a 100 year old car does not have to pass emissions!

Manufacturers

The pass rate changes differently as cars from different manufacturers get older. Some manufacturers make “disposable” cars, some make cars designed to be classics the day they leave the showroom (Aston Martin, Lotus, Rolls Royce). Some make cheap cars that people care less about (Vauxhall, Ford), some make posh cars people take care of (Audi, BMW). Japanese manufacturers seem to be able to build cars with very steady pass rates over time.

mot7

It might not be a shock that Bentley, Porche are at the top here, with TVR close behind. For me the biggest surprise was that Ford takes the deepest dip at the 14 year mark. Fords are clearly not built to last… or maybe people don’t care for them.  Renault and Alpha Romeo join Ford at the bottom of the table here.

Numbers of cars

It’s all very well to be mean to Ford about their poor longevity, but they do have more cars on the road that pretty much anyone else.  Check out the heatmap: mot8

While we’re counting cars, it looks like silver is the most popular colour. The MOT test data “runs out” in 2013, so I’d expect to see a lot more white cars these days.

mot9

Some code

OK, so we’ve looked at some charts not let’s look at some code.  All the charts about were generated by simple Spark dataframe apps, wrapped up in a unit test harness for ease of use.  Here’s an example:

  it should "calculate pass rate by age band and make" in {
    val motTests = Spark.sqlContext.read.parquet(parquetData).toDF()
    motTests.registerTempTable("mot_tests")

    val results =
      motTests
        .filter("testClass like '4%'") // Cars, not buses, bikes etc
        .filter("testType = 'N'") // only interested in the first test
        .filter("firstUseDate <> 'NULL' and date <> 'NULL'")
        .withColumn("passCount", passCodeToInt(col("testResult")))
        .withColumn("age", testDateAndVehicleFirstRegDateToAge(col("date"), col("firstUseDate")))
        .groupBy("age", "make")
        .agg(count("*") as "cnt", sum("passCount") as "passCount")
        .selectExpr("make", "age", "cnt", "passCount * 100 / cnt as rate")
        .filter("cnt >= 1000")
        .rdd

    val resultMap =
      results
        .map({
          x => (
            x.getString(0),
            x.getInt(1),
            x.getLong(2),
            x.getDouble(3)
            )
        })

    val mappedResults =
      resultMap
        .groupBy { case (make, age, cnt, rate) => make }
        .map { case (make, stuff) =>
          AgeAndMakeResults(make,
            stuff
              .map { case (_, age, cnt, rate) => new RateByAge(age, cnt, rate) }
              .filter(x => x.age >= 3 && x.age <= 20) .toSeq ) } .filter(_.series.length >= 18)
        .collect()

    JsonWriter.writeToFile(mappedResults, resultsPath + "passRateByAgeBandAndMake.json")
  }

Not sure what else there is to say about the code. Have a read or hit my github if you want to play around with it!

Machine Learning:  Predicting pass rate

Spark’s MLlib codes with all sorts of machine learning algorithms for predicting and classifying (mainly the latter) data.  I looked at decision trees, random forests and neural networks for this.  The idea was to turn some properties of a vehicle such as age, mileage, manufacturer, model, fuel type and so on into a pass/fail prediction.

It didn’t work! Yes, sorry, that’s right, it’s not really possible to predict a straight pass or fail.  Even in the worst case, the first-test pass rate for all different classes of car is over 50%. Machine learning techniques being as they are, this means that the simplest solution for any predictive model is simply to predict a pass every time.

This happened with all three techniques – neural nets, decision trees and random forests all “learned” to predict a pass every time, giving them a 50-60-ish% accuracy.  Darn it!

Predicting Pass Probability Classes

So, if you can’t predict two classes (“PASS” and “FAIL”) maybe its easier to predict Pass Probability Classes (50-60%, 60-70%, 70-80% etc).  Well, yes it was slightly more successful, but not exactly stunning!

The best results I got were predicting 10 pass rate classes for each decile of probability. This gave me these rather lame results:

Mean Error: 1.1532896239958372
Precision: 0.3961880457753499

So the mean error is greater than 1 – i.e. most test data entries are classified over one class away from their true class.  The precision shows only 40% of samples being predicted correctly.  Pants.

The confusion matrix tells a slightly more positive story though – here it is rendered as a colour map:

mot10

The confusion matrix shows the class predicted by the model (column) versus the actual class of the sample (row). A perfect predictor would give a diagonal green line from top left to bottom right, showing every class predicted correctly.

In this case, the random forest is attempting to predict the banded pass rate (0: 0% – 10%, 1: 10 – 20%, 2: 20 – 30%, … 9: 90% – 100%). Since virtually no classes of vehicle exist where the pass rate is less than 40% it doesn’t do very well at those levels, however, from 40% to 80% it does pretty well.

Some More Code

The code is complex – Spark makes it easy to run machine learning algorithms, but there’s a lot of bits and bobs round the edges like UDFs and utility functions. The following listing is the algorithm which gave me the results above (my best attempt).  Hit the githib link at the top of this page if you want to dig further into the code.

it should "use a decision tree to classify probability classes" in {
    val motTests = Spark.sqlContext.read.parquet(parquetData).toDF()
    motTests.registerTempTable("mot_tests")

    val keyFields = Seq("make", "colour", "mileageBand", "cylinderCapacity", "age", "isPetrol", "isDiesel")

    // Get the distinct values for category fields
    val distinctCategoryValues = Seq("make", "colour")
      .map(fieldName => (fieldName, motTests.select(col(fieldName)).distinct().map(_.getString(0)).collect().toList)).toMap

    // A UDF to convert a text field into an integer index
    // Should probably do this before the Parquet file is written
    val indexInValues = udf((key : String, item : String) => distinctCategoryValues(key).indexOf(item))

    val data =
      motTests
        .filter("testClass like '4%'") // Cars, not buses, bikes etc
        .filter("firstUseDate <> 'NULL' and date <> 'NULL'") // Must be able to calculate age
        .filter("testMileage > 0") // ignore tests where no mileage reported
        .filter("testType = 'N'") // only interested in the first test
        .withColumn("testPassed", passCodeToInt(col("testResult")))
        .withColumn("age", testDateAndVehicleFirstRegDateToAge(col("date"), col("firstUseDate")))
        .withColumn("isPetrol", valueToOneOrZero(lit("P"), col("fuelType")))
        .withColumn("isDiesel", valueToOneOrZero(lit("D"), col("fuelType")))
        .withColumn("mileageBand", mileageToBand(col("testMileage")))
        .groupBy(keyFields.map(col): _*)
        .agg(count("*") as "cnt", sum("testPassed") as "passCount")
        .filter("cnt > 10")
        .withColumn("passRateCategory", passRateToCategory(col("cnt"), col("passCount")))
        .withColumn("make", indexInValues(lit("make"), col("make")))
        .withColumn("colour", indexInValues(lit("colour"), col("colour")))
        .selectExpr((keyFields :+ "passRateCategory").map(x => s"cast($x as double) $x"):_*)
        .cache()

    data.printSchema()

    val labeledPoints = toFeatures(data, "passRateCategory", keyFields)

    labeledPoints.take(10).foreach(println)

    val Array(trainingData, testData, validationData) = labeledPoints.randomSplit(Array(0.8, 0.1, 0.1))
    trainingData.cache()
    testData.cache()
    validationData.cache()

    trainingData.take(10).foreach(println)

    val categoryMap = Seq("make", "colour").map(field => {
      ( data.columns.indexOf(field), distinctCategoryValues(field).length )
    }).toMap

    val model = RandomForest.trainClassifier(trainingData, 11, categoryMap, 20, "auto", "gini", 8, 500)

    val predictionsAndLabels = validationData.map(row => (model.predict(row.features), row.label))
    predictionsAndLabels.take(10).foreach(println)
    val metrics = new MulticlassMetrics(predictionsAndLabels)

    val error = math.sqrt(predictionsAndLabels.map({ case (v, p) => math.pow(v - p, 2)}).sum() / predictionsAndLabels.count())
    println(s"Mean Error: $error")
    println(s"Precision: ${metrics.precision}")

    println("Confusion Matrix")
    println(metrics.confusionMatrix)

    CsvWriter.writeMatrixToFile(metrics.confusionMatrix, resultsPath + "decision-tree-probability-classes-confusion-matrix.csv")

    for(x <- 0 to 10) {
      println(s"Class: $x, Precision: ${metrics.precision(x)}, Recall: ${metrics.recall(x)}")
    }
  }

Conclusions

I think I proved that MLlib and Spark are a great choice for writing machine learning algorithms very quickly and with very little knowledge.

I think I also proved that Data Scientists need to know a hell of a lot more than how to fire up a random forest. I know a little bit about data and machine learning (thus the name of this website!) but in order to make much sense of a dataset like this you need a whole arsenal of tricks up your sleeve.

As usual, D3.js and AngularJS are great.

Train Departure Board

You can find the code for this article on my github: https://github.com/DanteLore/national-rail.

Having found myself time-wealthy for a couple of weeks I’ve been playing around with some open data sets. One of which is the National Rail SOAP API. It’s not a new dataset, I think it’s been around for a decade or so, but it seemed like a sensible thing for me to play with as I’ll be on trains a lot more when I start selling my time to a new employer next month!

I live about a mile away from the local station (Thatcham) so it only takes a few minutes to get there. If a train is delayed or cancelled I’d like to know so I can have another coffee. So what I need is a live departures board, for my local station, in my house. Something like this:

departures

The UI is web based – using AngularJS again. Sadly though, the cross origin nonsense means I can’t make the soap calls directly from the web client, I need a back-end to gather and store the data for use on the UI. I used Python for this because that gives me the option to (easily) run it on a Raspberry Pi, reducing power and space costs as well as noise. Python’s library support is stunning, and this is another great reason to use it for small “hacks” like this one.

SOAPing Up

SOAP is horrible. It’s old, it’s heavy, it’s complex and worst of all it’s XML based. This isn’t a huge handicap though, as we can hit a SOAP service using the requests HTTP library – simply sending a POST with some XML like so:

import requests
import xmltodict

xml_payload = """<?xml version="1.0"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="http://thalesgroup.com/RTTI/2016-02-16/ldb/" xmlns:ns2="http://thalesgroup.com/RTTI/2013-11-28/Token/types">
  <SOAP-ENV:Header>
    <ns2:AccessToken>
      <ns2:TokenValue>{KEY}</ns2:TokenValue>
    </ns2:AccessToken>
  </SOAP-ENV:Header>
  <SOAP-ENV:Body>
    <ns1:GetDepBoardWithDetailsRequest>
      <ns1:numRows>12</ns1:numRows>
      <ns1:crs>{CRS}</ns1:crs>
      <ns1:timeWindow>120</ns1:timeWindow>
    </ns1:GetDepBoardWithDetailsRequest>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>
"""

# url: The URL of the service
# key: Your National Rail API key
# crs: Station code (e.g. THA or PAD)
def fetch_trains(url, key, crs):
    headers = {'content-type': 'text/xml'}
    payload = xml_payload.replace("{KEY}", key).replace("{CRS}", crs)
    response = requests.post(url, data=payload, headers=headers)

    data = xmltodict.parse(response.content)
    services = data["soap:Envelope"]["soap:Body"]["GetDepBoardWithDetailsResponse"]["GetStationBoardResult"]["lt5:trainServices"]["lt5:service"]

    for service in services:
        raw_points = service["lt5:subsequentCallingPoints"]["lt4:callingPointList"]["lt4:callingPoint"]

        calling_points = map(lambda point: {
            "crs": point["lt4:crs"],
            "name": point["lt4:locationName"],
            "st": point.get("lt4:st", "-"),
            "et": point.get("lt4:et", "-")
        }, raw_points)

        cp_string = "|".join(
                map(lambda p: "{0},{1},{2},{3}".format(p["crs"], p["name"], p["st"], p["et"]), calling_points)
        )

        yield {
            "crs": crs,
            "origin": service["lt5:origin"]["lt4:location"]["lt4:locationName"],
            "destination": service["lt5:destination"]["lt4:location"]["lt4:locationName"],
            "std": service.get("lt4:std"),
            "etd": service.get("lt4:etd"),
            "platform": service.get("lt4:platform", "-"),
            "calling_points": cp_string
        }

So, I take a pre-formatted XML request, add the key and station code then POST it to the API URL. Easy. The result comes back in XML which can be parsed very easily using the xmltodict library. I used Postman to test the calls before translating to Python (and if you’re lazy, Postman will even write the code for you!).

The Python script takes the data its gathered and stores it in an SQLite database. I’m not going to show the code because it’s all in github anyway.

Having a REST

So the data is all in a DB, now it needs to be made available to the Javascript client somehow. To do this I created a simple REST service using the excellent Flask library for Python. It’s in the same genre as Sinatra, Nancy, Scalatra and all the other microservice frameworks I love to use. Here’s all the code you need to serve up data via REST:

import argparse
import sqlite3
from flask import Flask, jsonify

app = Flask(__name__)


@app.route('/departures')
def departures():
    return jsonify(fetch_departures("departures"))


@app.route('/departures/<string:crs>')
def departures_for(crs):
    return jsonify(fetch_departures("departures", crs))


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='National Rail Data REST Server')
    parser.add_argument('--db', help='SQLite DB Name', default="../data/trains.db")
    args = parser.parse_args()
    db = args.db

    app.run(debug=True)

Front End

The front end is a very simple Angular JS app. Not much point showing the code here (see github) it’s about as simple as it gets – using a dot matrix font I found via Google and a theme from Bootswatch.

The design is based on a real life station departures board like this:
departures-750

All in all the project took me a little over a day. A leisurely day with many interruptions from my daughters! Feel free to pull the code down and play with it – let me know what you do.

2016-06-14 11.32.01

Quick TeamCity Build Status with AngularJS

So, this isn’t supposed to be the ultimate guide to AngularJS or anything like that – I’m not even using the latest version – this is just some notes on my return to The World of the View Model after a couple of years away from WPF. Yeah, that’s right, I just said WPF while talking about Javascript development. They may be different technologies from different eras: one may be the last hurrah of bloated fat-client development and the other may be the latest and greatest addition to the achingly-cool, tie dyed hemp tool belt of the Single Page App hipster, but under the hood they’re very very similar. Put that in your e-pipe and vape it, designer-bearded UX developers!

BuildStatus

Anyway, when I started, I knew nothing about SPA development. I’d last done JavaScript several years ago and never really used it as a real language. I still contend that JavaScript isn’t a real language (give me Scala or C# any day of the week) but you can’t ignore the fact that this is how user interfaces are developed these days… so, yeah, I started with a tutorial on YouTube.

I decided to do an Information Radiator to show build status from TeamCity on the web. Information Radiators are my passion – at least they’re one of the few passions I’m allowed to pursue at work – and we use Team City for all our continuous integration, release builds, automated tests and so on. Our old radiators are coded in WPF, which looks awesome on the big TVs dotted around the office, but doesn’t translate well for remote workers.

There is no sunshine and there are no rainbows in this article. I found javascript to be a hateful language, filled with boilerplate and confusion. Likewise, though TeamCity is doubtless the best enterprise CI platform on planet earth, the REST APIs are pretty painful to consume. With that in mind, let’s get into the weeds and see how this thing works…

Enable cross-site scripting (CORS) on your Team City server

You can’t hit a server from a web page unless that server is the server that served the web page you’re hitting the server with… unless of course you tell the server you want to hit that the web page you want to hit it with, served from a different server, is allowed to hit it. Got that? Thought so. This is all because of a really logical thing called “Cross Origin Resource Sharing”, which you can enable pretty easily in TeamCity as long as you have admin permissions.

Check out Administration -> Server Administration -> Diagnostics -> Internal Properties. From there you should be able to edit, or at least get the location of the internal.properties file. Weirdly, if the file doesn’t exist, there is no option to edit, so you have to go and create the file. Since my TeamCity server is running on a Windows box, I created the new file here:

C:\ProgramData\JetBrains\TeamCity\config\internal.properties

and added the following:

rest.cors.origins=*

You might want to be a little more selective on who you allow to access the server this way – I guess it depends on how secure your network is, how many clients access the dashboard and so on.

Tool Chain

This article is about AngularJS and it’s about TeamCity. It’s not about NPM or Bower or any of that nonsense. I’m not going to minify my code or use to crazy new-fangled pseudo-cosmic CSS. So setting up the build environment for me was pretty easy: create a folder, add a file called “index.html”, fire up the fantastic Fenix Web Server and configure it to serve up the folder we just created. Awesome.

If you’re already confused, or if you just want to play with the code, you can download the lot from GitHib: https://github.com/DanteLore/teamcity-status-with-angular

I promise to do my best

Hopefully you’ve watched the video I linked above, so you know the basics of an AngularJS app. If not, do so now. Then maybe Google around the subject of promises and http requests in AngularJS. Done that? OK, good.

Web requests take a while to run. In a normal app you might fetch them on another thread but not in JavaScript. JavaScript is all about callbacks. A Promise is basically a callback that promises to get called some time in the future. They are actually pretty cool, and they form the spinal column of the build status app. This is because the TeamCity API is so annoying. Let me explain why. In order to find out the status (OK or broken) and state (running, finished) of each build configuration you need to make roughly six trillion HTTP requests as follows:

  1. Fetch a list of the build configurations in the system. These are called “Build Types” in the API and have properties like “name”, “project” and “id”
  2. For each Build Type, make a REST request to get information on the latest running Build with a matching type ID. This will give you the “name”, “id” and “status” of the last finished build for the given Build Type.
  3. Fetch a list of the currently running builds.
  4. Use the list of finished builds and the list of running builds to create a set of status tiles (more on this later)
  5. Add the tiles to the angular $scope and let the UI render them

Here’s how that looks in code. Hopefully not too much more complicated than above!

buildFactory.getBuilds()
	.then(function(responses) {
		$scope.buildResponses = responses
			.filter(function(r) { return (r.status == 200 && r.data.build.length > 0)})
			.map(function(r){ return r.data.build[0] })
	})
	.then(buildFactory.getRunningBuilds)
	.then(function(data) {
		$scope.runningBuilds = data.data.build.map(function(row) { return row.buildTypeId })
	})
	.then(function() {
		$scope.builds = $scope.buildResponses.map(function(b) { return buildFactory.decodeBuild(b, $scope.runningBuilds); });
	})
	.then(function() {
		$scope.tiles = buildFactory.generateTiles($scope.builds)
	})
	.then(function() {
		$scope.statusVisible = false;
	});

Most of the REST access has been squirrelled away into a factory. And yes, our build server is called “tc” and guest access is allowed to the REST APIs and I have enabled CORS too… because sometimes productivity is more important than security!

angular.module('buildApp').factory('buildFactory', function($http) {
	var factory = {};
	  
	var getBuildTypes = function() {
		return $http.get('http://tc/guestAuth/app/rest/buildTypes?locator=start:0,count:100');
	};
	
	var getBuildStatus = function(id) {
		return $http.get('http://tc/guestAuth/app/rest/builds?locator=buildType:' + id + ',start:0,count:1&fields=build(id,status,state,buildType(name,id,projectName))');
	};
	
	factory.getRunningBuilds = function() {
		return $http.get('http://tc/guestAuth/app/rest/builds?locator=running:true');
	};

// etc

Grouping and Tiles

We have over 100 builds. Good teams have lots of builds. Not too many, just lots. Every product (basically every team) has CI builds, release/packaging builds, continuous deployment builds, continuous test builds, metrics builds… we have a lot of builds. Builds are good.

But a screen with 100+ builds on it means very little. This is an information radiator, not a formal report. So, I use a simple (but messy) algorithm to convert a big list of Builds into a smaller list of Tiles:

  1. Take the broken builds (hopefully not many) and turn each one into a Tile
  2. Take the successful builds and group them by “project” (basically the category, which is basically the team or product name)
  3. Turn each group of successful builds into a Tile, using the “project” as the tile name
  4. Mark any “running” build with a flag so we can give feedback in the UI

BuildStatus2

Displaying It

Not much very exciting here. I used Bootstrap, well, a derivative of Bootstrap to make the UI look nice. I bound some content to the View Model and that’s about it. Download the code and have a look if you like.

Here’s my index.html (which shows all the libraries I used):

<html ng-app="buildApp">
<head>
  <title>Build Status</title>
  
  <link href="https://bootswatch.com/cyborg/bootstrap.min.css" rel="stylesheet">
  <!--link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css" rel="stylesheet"-->
</head>

<body>
  <div ng-view>
  </div>

  <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular.min.js"></script>
  <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular-route.js"></script>
  <script src="https://code.jquery.com/jquery-2.2.3.min.js"></script>
  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
  <script src="utils.js"></script>
  <script src="app.js"></script>
  <script src="build-factory.js"></script>
</body>
</html>

Here’s the “view” HTML for the list (in “templates/list.html”). I love the Angular way of specifying Views and Controllers by the way. Note the cool animated CSS for the “in progress” icon.

<div>
  <style>
	.glyphicon-refresh-animate {
		-animation: spin 1s infinite linear;
		-webkit-animation: spin2 1s infinite linear;
	}

	@-webkit-keyframes spin2 {
		from { -webkit-transform: rotate(0deg);}
		to { -webkit-transform: rotate(360deg);}
	}

	@keyframes spin {
		from { transform: scale(1) rotate(0deg);}
		to { transform: scale(1) rotate(360deg);}
	}
  </style>
  
	<div class="page-header">
		<h1>Build Status <small>from TeamCity</small></h1>
	</div>
	  
    <div class="container-fluid">
		<div class="row">
    		<div class="col-md-3" ng-repeat="tile in tiles | orderBy:'status' | filter:nameFilter">
        		<div ng-class="getPanelClass(tile)">
               <h5><span ng-class="getGlyphClass(tile)" aria-hidden="true"></span>   {{ tile.name | limitTo:32 }}{{tile.name.length > 32 ? '...' : ''}}   {{ tile.buildCount > 0 ? '(' + tile.buildCount + ')' : ''}} </h5>
               <p class="panel-body">{{ tile.project }}</p>
              </div>
        	</div>
    	</div>
    </div>
	<br/><br/><br/><br/><br/><br/>
  
  <nav class="navbar navbar-default navbar-fixed-bottom">
  <div class="container-fluid">
    <p class="navbar-text navbar-left">
		<input type="text" ng-model="nameFilter"/>  <span class="glyphicon glyphicon-filter" aria-hidden="true"></span>  
		<span class="glyphicon glyphicon-refresh glyphicon-refresh-animate" ng-hide="!statusVisible"></span>
	</p>
  </div>
</nav>
</div>

That’s about it!

I think I summarized how I feel about this project in the introduction. It looks cool and the MVC MVVM ViewModel vibe is a good one. The data binding is simple and works very well. All my gripes are with JavaScript as a language really. I want Linq-style methods and I want classes and objects with sensible scope. I want less syntactic nonsense, maybe the odd => every now and again. I think some or all of that is possible with libraries and new language specs… but I want it without any effort!

One thing I will say: that whole page is less than 300 lines of code. That’s pretty darned cool.

Feel free to download and use the app however you like – just bung in a link to this page!

BuildStatus

Mapserver and Leaflet

Leaflet

Leaflet is a very simple but incredibly powerful Javascript mapping library that lets you add interactive maps to your website very easily. Try scrolling and zooming around this one:

For example, to add that map to this page, all I did was add the following code (after reading the Quick-Start Guide):

<div id="danMap" style="height: 200px;"></div>

<link rel="stylesheet" href="http://cdn.leafletjs.com/leaflet-0.7.3/leaflet.css" />
<script type="text/javascript" src="http://cdn.leafletjs.com/leaflet-0.7.3/leaflet.js"></script>

<script type="text/javascript">
var map = L.map('danMap').setView([51.4, -1.25], 13);	
L.tileLayer('http://{s}.tiles.mapbox.com/v3/YOUR.MAP.KEY/{z}/{x}/{y}.png', 
{
  attribution: 'Map data © <a href="http://openstreetmap.org">OpenStreetMap</a> contributors, <a href="http://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, Imagery © <a href="http://mapbox.com">Mapbox</a>',
  maxZoom: 18,
}).addTo(map);
</script>

You can add points, polygons and visualise all kinds of live data using a simple web service that returns some GeoJson data. It works like a charm on mobile devices too!

Why combine Leaflet with Mapserver?

I have a couple of use-cases that meant I needed to look at combining Leaflet with Mapserver. This turns out to be easy enough as Leaflet can hook up to any tile provider and Mapserver can be set up to serve images as a Web Map Service (WMS).

The first thing I wanted to do is serve up some map data when not connected to the internet. Imagine I am in the middle of nowhere, connected to the Raspberry Pi in the back of the Land Rover via WiFi to my phone or tablet. I have a GPS signal but I don’t have any connection to a map imagery server as there’s no mobile coverage. I need to use mapserver to render some local map data so I can see where I am. This use case has a boring work-related benefit too – it enables you to serve up maps in a web-based mapping application behind a strict corporate firewall.

The other use case is simple: raster data. Lots of the data we deal with where I work is served up as raster data by Mapserver. Imagine it as heat-maps of some KPI value layered on top of a street map.

Setting up Mapserver

There are a couple of things you need to do to get Mapserver to act as a WMS. The first is to add a projection and web metadata element to the root of the map file (as below). After a large amount of head-scratching, wailing and gnashing of teeth I found that the de-facto standard projection for all “internet” map data is EPSG 3857. Make sure you use that EPSG at the root of your map file.

PROJECTION
  "init=epsg:3857"
END
	
WEB
  METADATA
    "wms_title" "Dans Layers and Stuff"
    "wms_onlineresource" "http://192.168.2.164/cgi-bin/mapserv.exe?"
    "wms_enable_request" "*"
    "wms_srs" "EPSG:3857"
    "wms_feature_info_mime_type" "text/html"
    "wms_format" "image/png"
  END
END

The next thing to do is add some extra stuff to every layer in your map. You need to set the STATUS field to ‘on’; add a METADATA element and set the ‘wms_title’ to something sensible; and finally add a projection, specifying the projection the layer data is stored in. As I am using the OS VectorMap District dataset, which is on the OSGB projection I used EPSG 27700.

LAYER
  NAME         Woodland
  DATA         Woodland
  PROJECTION
    "init=epsg:27700"
  END
  METADATA
    "wms_title" "Woodland"
  END
  STATUS       on
  TYPE         POLYGON
  CLASS	
    STYLE
      COLOR 20 40 20
    END
  END
END 

Connecting it Together

You can then add a new layer to the Leaflet map, connected to your Mapserver. Here I’m using ms4w, the Windows version of Mapserver and hooking it up to a map file in my Dropbox folder. The map file I am using is the one I created for a previous post on mapserver.

L.tileLayer.wms("http://localhost:8001/cgi-bin/mapserv.exe?map=D:\\Dropbox\\Data\\Mapfiles\\leaflet.map", {
			layers: 'Roads,MotorwayJunctions',
			format: 'image/png',
			transparent: true,
			attribution: "Dan's Amazing Roads",
			maxZoom: 18,
			minZoom: 12,
		}).addTo(map);

Sadly I don’t have a mapserver instance on the internet, so all I can show here is a couple of screenshots. You’ll just have to take my word for it – it works brilliantly!

mapserverLeaflet2

mapserverLeaflet1

Shape Files and SQL Server

Over the last couple of weeks I have been doing a lot of work importing polygons into an SQL server database, using them for some data processing tasks and then exporting the results as KML for display. I thought it’d be worth a post to record how I did it.

Inserting polygons (or any other geometry type) from a shape file to the database can be done with the ogr2ogr tool which ships with the gdal libraries (and with Mapserver for Windows). I knocked up a little batch file to do it:

SET InputShapeFile="D:\Dropbox\Data\SingleView\Brazillian Polygons\BRA_adm3.shp"

SET SqlConnectionString="MSSQL:Server=tcp:yourserver.database.windows.net;Database=danTest;Uid=usernname@yourserver.database.windows.net;Pwd=yourpassword;"

SET TEMPFILE="D:\Dropbox\Data\Temp.shp"
SET OGR2OGR="C:\ms4w\tools\gdal-ogr\ogr2ogr.exe"
SET TABLENAME="TestPolygons"

%OGR2OGR% -overwrite -simplify 0.01 %TEMPFILE% %InputShapeFile% -progress

%OGR2OGR% -lco "SHPT=POLYGON" -f "MSSQLSpatial" %SqlConnectionString% %TEMPFILE% -nln %TABLENAME% -progress

The first ogr2ogr call is used to simplify the polygons. The value 0.01 is the minimum length of an edge (in degrees in this case) to be stored. Results of this command are pushed to a temporary shape file set. The second call to ogr2ogr pushes the polygons from the temp file up to a database in Windows Azure. The same code would work for a local SQL Server, you just need to tweak the connection string.

You can use SQL Server Management Studio to show the spatial results of your query, which is nice! Here I just did a “select * from testPolygons” to see the first 5000 polygons from my file.

PolygonsInSqlServer

Sql Server contains all sorts of interesting data processing options, which I’ll look at another time. Here I’ll just skip to the final step – exporting the polygon data from the database to a local KML file.

polygonsInKml

SET KmlFile="D:\Dropbox\Data\Brazil.kml"

SET SqlConnectionString="MSSQL:Server=tcp:yourserver.database.windows.net;Database=danTest;Uid=usernname@yourserver.database.windows.net;Pwd=yourpassword;"

SET TEMPFILE="D:\Dropbox\Data\Temp.shp"
SET OGR2OGR="C:\ms4w\tools\gdal-ogr\ogr2ogr.exe"
SET SQL="select * from TestPolygons"

%OGR2OGR% -lco "SHPT=POLYGON" -f "KML" %KmlFile% -sql %SQL% %SqlConnectionString%  -progress

Obviously you can make the SQL in that command as complex as you like.

Polygons here are from this site which allows you to download various polygon datasets for various countries.

Combining Shape Files

This is one of those things that’s easy when you know how. Just so I don’t forget, here’s how to combine shape files using ogr2ogr.

I wrote it as a batch file to combine all the OSGB grid squares from the OS VectorMap District dataset into a single large data file for use with MapServer.

echo off

set OGR2OGR="C:\ms4w\tools\gdal-ogr\ogr2ogr"
set inputdir="D:\Dropbox\Data\OS VectorMap"
set outputdir="D:\Dropbox\Data\OS VectorMap Big"

set tiles=(HP HT HU HW HX HY HZ NA NB NC ND NF NG NH NJ NK NL NM NN NO NR NS NT NU NW NX NY NZ OV SC SD SE TA SH SJ SK TF TG SM SN SO SP TL TM SR SS ST TU TQ TR SV SW SX SY SZ TV)

set layers=(Airport AdministrativeBoundary Building ElectricityTransmissionLine Foreshore GlassHouse HeritageSite Land MotorwayJunction NamedPlace PublicAmenity RailwayStation RailwayTrack Road RoadTunnel SpotHeight SurfaceWater_Area SurfaceWater_Line TidalBoundary TidalWater Woodland)

del /Q %outputdir%\*.*

FOR %%L IN %layers% DO (
%OGR2OGR% %outputdir%\%%L.shp %inputdir%\SU_%%L.shp
)

FOR %%T IN %tiles% DO FOR %%L IN %layers% DO (
%OGR2OGR% -update -append %outputdir%\%%L.shp %inputdir%\%%T_%%L.shp -nln %%L
)

After a little map file jiggery-pokery I can now render a huge map of the UK or tiles with smaller maps without the many layer definitions needed to use ~20 shape file sets.

london-zoomed

london-big

uk-big