Tag Archives: postgresql

ETL & Pipelining With Alpakka Kafka

Real-time streaming ETL/pipelining of property listing data

For usage demonstration, the application runs ETL/pipelining of data with a minified real estate property listing data model. It should be noted that expanding or even changing it altogether to a different data model should not affect how the core streaming ETL system operates.

Below are a couple of links related to library dependencies and configurations for the core application:

Library dependencies in build.sbt _{[separate tab]}
Configurations for Akka, Kafka, PostgreSQL & Cassandra in application.conf _{[separate tab]}

It’s also worth noting that the application can be scaled up with just configurative changes. For example, if the Kafka brokers and Cassandra database span across multiple hosts, relevant configurations like Kafka’s bootstrap.servers could be "10.1.0.1:9092,10.1.0.2:9092,10.1.0.3:9092" and contact-points for Cassandra might look like ["10.2.0.1:9042","10.2.0.2:9042"].

Next, let’s get ourselves familiarized with the property listing data definitions in the PostgreSQL and Cassandra, as well as the property listing classes that model the schemas.

schema_postgresql.script.txt _{[separate tab]}
schema_cassandra.script.txt _{[separate tab]}
PropertyListing.scala _{[separate tab]}

A Kafka producer using Alpakka Csv

Alpakka comes with a simple API for CSV file parsing with method lineScanner() that takes parameters including the delimiter character and returns a Flow[ByteString, List[ByteString], NotUsed].

Below is the relevant code in CsvPlain.scala that highlights how the CSV file gets parsed and materialized into a stream of Map[String,String] via CsvParsing and CsvToMap, followed by transforming into a stream of PropertyListing objects.

    val source: Source[PropertyListing, NotUsed] =
      FileIO.fromPath(Paths.get(csvFilePath))
        .via(CsvParsing.lineScanner(CsvParsing.Tab))
        .viaMat(CsvToMap.toMapAsStrings())(Keep.right)
        .drop(offset).take(limit)
        .map(toClassPropertyListing(_))

    // ...

    source
      .map{ property =>
        val prodRec = new ProducerRecord[String, String](
          topic, property.propertyId.toString, property.toJson.compactPrint
        )
        println(s"[CSV] >>> Producer msg: $prodRec")
        prodRec
      }
      .runWith(Producer.plainSink(producerSettings))

val source: Source[PropertyListing, NotUsed] =

FileIO.fromPath(Paths.get(csvFilePath))

.via(CsvParsing.lineScanner(CsvParsing.Tab))

.viaMat(CsvToMap.toMapAsStrings())(Keep.right)

.drop(offset).take(limit)

.map(toClassPropertyListing(_))

// ...

source

.map{ property =>

val prodRec = new ProducerRecord[String, String](

topic, property.propertyId.toString, property.toJson.compactPrint

)

println(s"[CSV] >>> Producer msg: $prodRec")

prodRec

}

.runWith(Producer.plainSink(producerSettings))

Note that the drop(offset)/take(limit) code line, which can be useful for testing, is for taking a segmented range of the stream source and can be removed if preferred.

A subsequent map wraps each of the PropertyListing objects in a ProducerRecord[K,V] with the associated topic and key/value of type String/JSON before being streamed into Kafka via Alpakka Kafka’s Producer.plainSink().

A Kafka producer using Alpakka Slick

The PostgresPlain producer, which is pretty much identical to the one described in the previous blog post, creates a Kafka producer using Alpakka Slick which allows SQL queries into a PostgreSQL database to be coded in Slick’s functional programming style.

The partial code below shows how method Slick.source() takes a streaming query and returns a stream source of PropertyListing objects.

    val source: Source[PropertyListing, NotUsed] =
      Slick
        .source(TableQuery[PropertyListings].sortBy(_.propertyId).drop(offset).take(limit).result)

    // ...

    source
      .map{ property =>
        val prodRec = new ProducerRecord[String, String](
          topic, property.propertyId.toString, property.toJson.compactPrint
        )
        println(s"[POSTRES] >>> Producer msg: $prodRec")
        prodRec
      }
      .runWith(Producer.plainSink(producerSettings))

val source: Source[PropertyListing, NotUsed] =

Slick

.source(TableQuery[PropertyListings].sortBy(_.propertyId).drop(offset).take(limit).result)

// ...

source

.map{ property =>

val prodRec = new ProducerRecord[String, String](

topic, property.propertyId.toString, property.toJson.compactPrint

)

println(s"[POSTRES] >>> Producer msg: $prodRec")

prodRec

}

.runWith(Producer.plainSink(producerSettings))

The high-level code logic in PostgresPlain is similar to that of the CsvPlain producer.

A Kafka consumer using Alpakka Cassandra

We created a Kafka consumer in the previous blog post using Alpakka Kafka’s Consumer.plainSource[K,V] for consuming data from a given Kafka topic into a Cassandra database.

The following partial code from the slightly refactored version of the consumer, CassandraPlain shows how data associated with a given Kafka topic can be consumed via Alpakka Kafka’s Consumer.plainSource().

  def runPropertyListing(consumerGroup: String,
                         topic: String)(implicit
                                        cassandraSession: CassandraSession,
                                        jsonFormat: JsonFormat[PropertyListing],
                                        system: ActorSystem,
                                        ec: ExecutionContext): Future[Done] = {

    val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")
    val consumerSettings =
      ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)
        .withGroupId(consumerGroup)
        .withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

    val table = "propertydata.property_listing"
    val partitions = 10 // number of partitions

    val statementBinder: (ConsumerRecord[String, String], PreparedStatement) => BoundStatement = {
      case (msg, preparedStatement) =>
        val p = msg.value().parseJson.convertTo[PropertyListing]
        preparedStatement.bind(
          (p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),
          Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),
          p.propertyType.getOrElse(""), p.yearBuilt.getOrElse(""), p.lastUpdated.getOrElse(""), p.streetAddress.getOrElse(""), p.city.getOrElse(""), p.state.getOrElse(""), p.zip.getOrElse(""), p.country.getOrElse("")
        )
    }
    val cassandraFlow: Flow[ConsumerRecord[String, String], ConsumerRecord[String, String], NotUsed] =
      CassandraFlow.create(
        CassandraWriteSettings.defaults,
        s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country)
           |VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,
        statementBinder
      )

    val control: DrainingControl[Done] =
      Consumer
        .plainSource(consumerSettings, Subscriptions.topics(topic))
        .via(cassandraFlow)
        .toMat(Sink.ignore)(DrainingControl.apply)
        .run()

    Thread.sleep(2000)
    control.drainAndShutdown()
  }

def runPropertyListing(consumerGroup: String,

topic: String)(implicit

cassandraSession: CassandraSession,

jsonFormat: JsonFormat[PropertyListing],

system: ActorSystem,

ec: ExecutionContext): Future[Done] = {

val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")

val consumerSettings =

ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)

.withGroupId(consumerGroup)

.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

val table = "propertydata.property_listing"

val partitions = 10 // number of partitions

val statementBinder: (ConsumerRecord[String, String], PreparedStatement) => BoundStatement = {

case (msg, preparedStatement) =>

val p = msg.value().parseJson.convertTo[PropertyListing]

preparedStatement.bind(

(p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),

Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),

p.propertyType.getOrElse(""), p.yearBuilt.getOrElse(""), p.lastUpdated.getOrElse(""), p.streetAddress.getOrElse(""), p.city.getOrElse(""), p.state.getOrElse(""), p.zip.getOrElse(""), p.country.getOrElse("")

)

}

val cassandraFlow: Flow[ConsumerRecord[String, String], ConsumerRecord[String, String], NotUsed] =

CassandraFlow.create(

CassandraWriteSettings.defaults,

s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country)

|VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,

statementBinder

)

val control: DrainingControl[Done] =

Consumer

.plainSource(consumerSettings, Subscriptions.topics(topic))

.via(cassandraFlow)

.toMat(Sink.ignore)(DrainingControl.apply)

.run()

Thread.sleep(2000)

control.drainAndShutdown()

}

Alpakka’s CassandraFlow.create() is the stream processing operator responsible for funneling data into the Cassandra database. Note that it takes a CQL PreparedStatement along with a “statement binder” that binds the incoming class variables to the corresponding Cassandra table columns before executing the CQL.

Enhancing the Kafka consumer for ‘at-least-once’ consumption

To enable at-least-once consumption by Cassandra, instead of Consumer.plainSource[K,V], we construct the stream graph via Alpakka Kafka Consumer.committableSource[K,V] which offers programmatic tracking of the commit offset positions. By keeping the commit offsets as an integral part of the streaming data, failed streams could be re-run from the offset positions.

The main stream composition code of the enhanced consumer, CassandraCommittable.scala, is shown below.

  def runPropertyListing(consumerGroup: String,
                         topic: String)(implicit
                                        cassandraSession: CassandraSession,
                                        jsonFormat: JsonFormat[PropertyListing],
                                        system: ActorSystem,
                                        ec: ExecutionContext): Future[Done] = {

    val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")
    val consumerSettings =
      ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)
        .withGroupId(consumerGroup)
        .withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

    val committerConfig = system.settings.config.getConfig("akka.kafka.committer")
    val committerSettings = CommitterSettings(committerConfig)

    val table = "propertydata.property_listing"
    val partitions = 10 // number of partitions

    val statementBinder: (CommittableMessage[String, String], PreparedStatement) => BoundStatement = {
      case (msg, preparedStatement) =>
        val p = msg.record.value().parseJson.convertTo[PropertyListing]
        preparedStatement.bind(
          (p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),
          Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),
          p.propertyType.getOrElse(""), p.yearBuilt.getOrElse(""), p.lastUpdated.getOrElse(""), p.streetAddress.getOrElse(""), p.city.getOrElse(""), p.state.getOrElse(""), p.zip.getOrElse(""), p.country.getOrElse("")
        )
    }
    val cassandraFlow: Flow[CommittableMessage[String, String], CommittableMessage[String, String], NotUsed] =
      CassandraFlow.create(
        CassandraWriteSettings.defaults,
        s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country)
           |VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,
        statementBinder
      )

    val control =
      Consumer
        .committableSource(consumerSettings, Subscriptions.topics(topic))
        .via(cassandraFlow)
        .map(_.committableOffset)
        .toMat(Committer.sink(committerSettings))(DrainingControl.apply)
        .run()

    Thread.sleep(2000)
    control.drainAndShutdown()
  }

def runPropertyListing(consumerGroup: String,

topic: String)(implicit

cassandraSession: CassandraSession,

jsonFormat: JsonFormat[PropertyListing],

system: ActorSystem,

ec: ExecutionContext): Future[Done] = {

val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")

val consumerSettings =

ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)

.withGroupId(consumerGroup)

.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

val committerConfig = system.settings.config.getConfig("akka.kafka.committer")

val committerSettings = CommitterSettings(committerConfig)

val table = "propertydata.property_listing"

val partitions = 10 // number of partitions

val statementBinder: (CommittableMessage[String, String], PreparedStatement) => BoundStatement = {

case (msg, preparedStatement) =>

val p = msg.record.value().parseJson.convertTo[PropertyListing]

preparedStatement.bind(

(p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),

Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),

)

}

val cassandraFlow: Flow[CommittableMessage[String, String], CommittableMessage[String, String], NotUsed] =

CassandraFlow.create(

CassandraWriteSettings.defaults,

s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country)

|VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,

statementBinder

)

val control =

Consumer

.committableSource(consumerSettings, Subscriptions.topics(topic))

.via(cassandraFlow)

.map(_.committableOffset)

.toMat(Committer.sink(committerSettings))(DrainingControl.apply)

.run()

Thread.sleep(2000)

control.drainAndShutdown()

}

A couple of notes:

In order to be able to programmatically keep track of the commit offset positions, each of the stream elements emitted from Consumer.committableSource[K,V] is wrapped in a CommittableMessage[K,V] object, consisting of the CommittableOffset value in addition to the Kafka ConsumerRecord[K,V].
Committing the offset should be done after the stream data is processed for at-least-once consumption, whereas committing prior to processing the stream data would only achieve at-most-once delivery.

Adding a property-rating pipeline to the Alpakka Kafka consumer

Next, we add a data processing pipeline to the consumer to perform a number of ratings of the individual property listings in the stream before delivering the rated property listing data to the Cassandra database, as illustrated in the following diagram.

Alpakka Kafka - Streaming ETL w/ custom pipelines

Since the CassandraFlow.create() stream operator will be executed after the rating pipeline, the corresponding “statement binder” necessary for class-variable/table-column binding will now need to encapsulate also PropertyRating along with CommittableMessage[K,V], as shown in the partial code of CassandraCommittableWithRatings.scala below.

  def runPropertyListing(consumerGroup: String,
                         topic: String)(implicit
                                         cassandraSession: CassandraSession,
                                         jsonFormat: JsonFormat[PropertyListing],
                                         system: ActorSystem,
                                         ec: ExecutionContext): Future[Done] = {

    val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")
    val consumerSettings =
      ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)
        .withGroupId(consumerGroup)
        .withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

    val committerConfig = system.settings.config.getConfig("akka.kafka.committer")
    val committerSettings = CommitterSettings(committerConfig)

    val table = "propertydata.rated_property_listing"
    val partitions = 10 // number of partitions

    val statementBinder: ((PropertyRating, CommittableMessage[String, String]), PreparedStatement) => BoundStatement = {
      case ((rating, msg), preparedStatement) =>
        val p = msg.record.value().parseJson.convertTo[PropertyListing]
        preparedStatement.bind(
          (p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),
          Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),
          p.propertyType.getOrElse(""), p.yearBuilt.getOrElse(""), p.lastUpdated.getOrElse(""), p.streetAddress.getOrElse(""), p.city.getOrElse(""), p.state.getOrElse(""), p.zip.getOrElse(""), p.country.getOrElse(""),
          rating.affordability.getOrElse(0), rating.comfort.getOrElse(0), rating.neighborhood.getOrElse(0), rating.schools.getOrElse(0)
        )
    }
    val cassandraFlow: Flow[(PropertyRating, CommittableMessage[String, String]), (PropertyRating, CommittableMessage[String, String]), NotUsed] =
      CassandraFlow.create(
        CassandraWriteSettings.defaults,
        s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country, rating_affordability, rating_comfort, rating_neighborhood, rating_schools)
           |VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,
        statementBinder
      )

    val control =
      Consumer
        .committableSource(consumerSettings, Subscriptions.topics(topic))
        .via(PropertyRating.compute())
        .via(cassandraFlow)
        .map { case (_, msg) => msg.committableOffset }
        .toMat(Committer.sink(committerSettings))(DrainingControl.apply)
        .run()

    Thread.sleep(2000)
    control.drainAndShutdown()
  }

def runPropertyListing(consumerGroup: String,

topic: String)(implicit

cassandraSession: CassandraSession,

jsonFormat: JsonFormat[PropertyListing],

system: ActorSystem,

ec: ExecutionContext): Future[Done] = {

val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")

val consumerSettings =

ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)

.withGroupId(consumerGroup)

.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

val committerConfig = system.settings.config.getConfig("akka.kafka.committer")

val committerSettings = CommitterSettings(committerConfig)

val table = "propertydata.rated_property_listing"

val partitions = 10 // number of partitions

val statementBinder: ((PropertyRating, CommittableMessage[String, String]), PreparedStatement) => BoundStatement = {

case ((rating, msg), preparedStatement) =>

val p = msg.record.value().parseJson.convertTo[PropertyListing]

preparedStatement.bind(

(p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),

Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),

rating.affordability.getOrElse(0), rating.comfort.getOrElse(0), rating.neighborhood.getOrElse(0), rating.schools.getOrElse(0)

)

}

val cassandraFlow: Flow[(PropertyRating, CommittableMessage[String, String]), (PropertyRating, CommittableMessage[String, String]), NotUsed] =

CassandraFlow.create(

CassandraWriteSettings.defaults,

s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country, rating_affordability, rating_comfort, rating_neighborhood, rating_schools)

|VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,

statementBinder

)

val control =

Consumer

.committableSource(consumerSettings, Subscriptions.topics(topic))

.via(PropertyRating.compute())

.via(cassandraFlow)

.map { case (_, msg) => msg.committableOffset }

.toMat(Committer.sink(committerSettings))(DrainingControl.apply)

.run()

Thread.sleep(2000)

control.drainAndShutdown()

}

For demonstration purpose, we create a dummy pipeline for rating of individual real estate properties in areas such as affordability, neighborhood, each returning just a Future of random integers between 1 and 5 after a random time delay. The rating related fields along with the computation logic are wrapped in class PropertyRating as shown below.

case class PropertyRating(
    propertyId: Int,
    affordability: Option[Int],
    comfort: Option[Int],
    neighborhood: Option[Int],
    schools: Option[Int]
  )

object PropertyRating {
  def rand = java.util.concurrent.ThreadLocalRandom.current

  def biasedRandNum(l: Int, u: Int, biasedNums: Set[Int], biasedFactor: Int = 1): Int = {
    Vector
      .iterate(rand.nextInt(l, u+1), biasedFactor)(_ => rand.nextInt(l, u+1))
      .dropWhile(!biasedNums.contains(_))
      .headOption match {
        case Some(n) => n
        case None => rand.nextInt(l, u+1)
      }
  }

  def fakeRating()(implicit ec: ExecutionContext): Future[Int] = Future{  // Fake rating computation
    Thread.sleep(biasedRandNum(1, 9, Set(3, 4, 5)))  // Sleep 1-9 secs
    biasedRandNum(1, 5, Set(2, 3, 4))  // Rating 1-5; mostly 2-4
  }

  def compute()(implicit ec: ExecutionContext): Flow[CommittableMessage[String, String], (PropertyRating, CommittableMessage[String, String]), NotUsed] =
    Flow[CommittableMessage[String, String]].mapAsync(1){ msg =>
      val propertyId = msg.record.key().toInt  // let it crash in case of bad PK data
      ( for {
            affordability <- PropertyRating.fakeRating()
            comfort <- PropertyRating.fakeRating()
            neighborhood <- PropertyRating.fakeRating()
            schools <- PropertyRating.fakeRating()
          }
          yield new PropertyRating(propertyId, Option(affordability), Option(comfort), Option(neighborhood), Option(schools))
        )
        .map(rating => (rating, msg)).recover{ case e => throw new Exception("ERROR in computeRatingFlow()!") }
    }
}

case class PropertyRating(

propertyId: Int,

affordability: Option[Int],

comfort: Option[Int],

neighborhood: Option[Int],

schools: Option[Int]

)

object PropertyRating {

def rand = java.util.concurrent.ThreadLocalRandom.current

def biasedRandNum(l: Int, u: Int, biasedNums: Set[Int], biasedFactor: Int = 1): Int = {

Vector

.iterate(rand.nextInt(l, u+1), biasedFactor)(_ => rand.nextInt(l, u+1))

.dropWhile(!biasedNums.contains(_))

.headOption match {

case Some(n) => n

case None => rand.nextInt(l, u+1)

}

def fakeRating()(implicit ec: ExecutionContext): Future[Int] = Future{ // Fake rating computation

Thread.sleep(biasedRandNum(1, 9, Set(3, 4, 5))) // Sleep 1-9 secs

biasedRandNum(1, 5, Set(2, 3, 4)) // Rating 1-5; mostly 2-4

}

def compute()(implicit ec: ExecutionContext): Flow[CommittableMessage[String, String], (PropertyRating, CommittableMessage[String, String]), NotUsed] =

Flow[CommittableMessage[String, String]].mapAsync(1){ msg =>

val propertyId = msg.record.key().toInt // let it crash in case of bad PK data

( for {

affordability <- PropertyRating.fakeRating()

comfort <- PropertyRating.fakeRating()

neighborhood <- PropertyRating.fakeRating()

schools <- PropertyRating.fakeRating()

}

yield new PropertyRating(propertyId, Option(affordability), Option(comfort), Option(neighborhood), Option(schools))

)

.map(rating => (rating, msg)).recover{ case e => throw new Exception("ERROR in computeRatingFlow()!") }

}

A Kafka consumer with a custom flow & stream destination

The application is also bundled with a consumer with the property rating pipeline followed by a custom flow to showcase how one can compose an arbitrary side-effecting operator with custom stream destination.

  def customBusinessLogic(key: String, value: String, rating: PropertyRating)(
    implicit ec: ExecutionContext): Future[Done] = Future {

    println(s"KEY: $key  VALUE: $value  RATING: $rating")
    // Perform custom business logic with key/value
    // and save to an external storage, etc.
    Done
  }

  def runPropertyListing(consumerGroup: String,
                         topic: String)(implicit
                                         jsonFormat: JsonFormat[PropertyListing],
                                         system: ActorSystem,
                                         ec: ExecutionContext): Future[Done] = {

    val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")
    val consumerSettings =
      ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)
        .withGroupId(consumerGroup)
        .withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

    val committerConfig = system.settings.config.getConfig("akka.kafka.committer")
    val committerSettings = CommitterSettings(committerConfig)

    val control =
      Consumer
        .committableSource(consumerSettings, Subscriptions.topics(topic))
        .via(PropertyRating.compute())
        .mapAsync(1) { case (rating, msg) =>
          customBusinessLogic(msg.record.key, msg.record.value, rating)
            .map(_ => msg.committableOffset)
        }
        .toMat(Committer.sink(committerSettings))(DrainingControl.apply)
        .run()

    Thread.sleep(5000)
    control.drainAndShutdown()
  }

def customBusinessLogic(key: String, value: String, rating: PropertyRating)(

implicit ec: ExecutionContext): Future[Done] = Future {

println(s"KEY: $key VALUE: $value RATING: $rating")

// Perform custom business logic with key/value

// and save to an external storage, etc.

Done

}

def runPropertyListing(consumerGroup: String,

topic: String)(implicit

jsonFormat: JsonFormat[PropertyListing],

system: ActorSystem,

ec: ExecutionContext): Future[Done] = {

val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")

val consumerSettings =

ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)

.withGroupId(consumerGroup)

.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

val committerConfig = system.settings.config.getConfig("akka.kafka.committer")

val committerSettings = CommitterSettings(committerConfig)

val control =

Consumer

.committableSource(consumerSettings, Subscriptions.topics(topic))

.via(PropertyRating.compute())

.mapAsync(1) { case (rating, msg) =>

customBusinessLogic(msg.record.key, msg.record.value, rating)

.map(_ => msg.committableOffset)

}

.toMat(Committer.sink(committerSettings))(DrainingControl.apply)

.run()

Thread.sleep(5000)

control.drainAndShutdown()

}

Note that mapAsync is used to allow the stream transformation by the custom business logic to be carried out asynchronously.

Running the streaming ETL/pipelining system

To run the application that comes with sample real estate property listing data on a computer, go to the GitHub repo and follow the README instructions to launch the producers and consumers on one or more command-line terminals.

Also included in the README are instructions about how to run a couple of set queries to verify data that get ETL-ed to the Cassandra tables via Alpakka Cassandra’s CassandraSource which takes a CQL query as its argument.

Further enhancements

Depending on specific business requirement, the streaming ETL system can be further enhanced in a number of areas.

This streaming ETL system offers at-least-once delivery only in stream consumptions. If an end-to-end version is necessary, one could enhance the producers by using Producer.committabbleSink() or Producer.flexiFlow() instead of Producer.plainSink().
For exactly-once delivery, which is a generally much more stringent requirement, one approach to achieve that would be to atomically persist the in-flight data with the corresponding commit offset positions using a reliable storage system.
In case tracking of Kafka’s topic partition assignment is required, one can use Consumer.committablePartitionedSource[K,V] instead of Consumer.committableSource[K,V]. More details can be found in the tech doc.
To gracefully restart a stream on failure with a configurable backoff, Akka Stream provides method RestartSource.onFailuresWithBackoff for that as illustrated in an example in this tech doc.

Streaming ETL With Alpakka Kafka

1 Reply

In a previous startup I cofounded, our core product was a geospatial application that provided algorithmic ratings of the individual residential real estate properties for home buyers. Given that there were over 100+ millions of residential properties nationwide, the collective data volume of all the associated attributes necessary for the data engineering work was massive.

For the initial MVP (minimum viable product) releases in which we only needed to showcase our product features in a selected metropolitan area, we used PostgreSQL as the OLTP (online transaction processing) database. Leveraging Postgres’ table partitioning feature, we had an OLTP database capable of accommodating incremental geographical expansion into multiple cities and states.

Batch ETL

The need for a big data warehouse wasn’t imminent in the beginning, though we had to make sure a data processing platform for a highly scalable data warehouse along with efficient ETL (extract/transform/load) functions would be ready on a short notice. The main objective was to make sure the OLTP database could be kept at a minimal volume while less frequently used data got “archived” off to a big data warehouse for data analytics.

With limited engineering resources available in a small startup, I kicked off a R&D project on the side to build programmatic ETL processes to periodically funnel data from PostgreSQL to a big data warehouse in a batch manner. Cassandra was chosen to be the data warehouse and was configured on an Amazon EC2 cluster. The project was finished with a batch ETL solution that functionally worked as intended, although back in my mind a more “continuous” operational model would be preferred.

Real-time Streaming ETL

Fast-forward to 2021, I recently took on a big data streaming project that involves ETL and building data pipelines on a distributed platform. Central to the project requirement is real-time (or more precisely, near real-time) processing of high-volume data. Another aspect of the requirement is that the streaming system has to accommodate custom data pipelines as composable components of the consumers, suggesting that a streaming ETL solution would be more suitable than a batch one. Lastly, stream consumption needs to guarantee at-least-once delivery.

Given all that, Apache Kafka promptly stood out as a top candidate to serve as the distributed streaming brokers. In particular, its capability of keeping durable data in a distributed fault-tolerant cluster allows it to serve different consumers at various instances of time and locales. Next, Akka Stream was added to the tech stack for its versatile stream-based application integration functionality as well as benefits of reactive streams.

Alpakka – a reactive stream API and DSL

Built on top of Akka Stream, Alpakka provides a comprehensive API and DSL (domain specific language) for reactive and stream-oriented programming to address the application integration needs for interoperating with a wide range of prominent systems across various computing domains. That, coupled with the underlying Akka Stream’s versatile streaming functions, makes Alpakka a powerful toolkit for what is needed.

In this blog post, we’ll assemble in Scala a producer and a consumer using the Alpakka API to perform streaming ETL from a PostgreSQL database through Kafka brokers into a Cassandra data warehouse. In a subsequent post, we’ll enhance and package up these snippets to address the requirement of at-least-once delivery in consumption and composability of data pipelines.

Streaming ETL with Alpakka Kafka, Slick, Cassandra, …

The following diagram shows the near-real time ETL functional flow of data streaming from various kinds of data sources (e.g. a PostgreSQL database or a CSV file) to data destinations (e.g. a Cassandra data warehouse or a custom data stream outlet).

The Apache Kafka brokers provide a distributed publish-subscribe platform for keeping in-flight data in durable immutable logs readily available for consumption. Meanwhile, the Akka Stream based Alpakka API that comes with a DSL allows programmatic integrations to compose data pipelines as sources, sinks and flows, in addition to enabling “reactivity” by equipping the streams with non-blocking backpressure.

It should be noted that the same stream can be processed using various data sources and destinations simultaneously. For instance, data with the same schema from both the CSV file and Postgres database could be published to the same topic and consumed by a consumer group designated for the Cassandra database and another consumer group for a different data storage.

Example: ETL of real estate property listing data

The platform will be for general-purpose ETL/pipelining. For illustration purpose in this blog post, we’re going to use it to perform streaming ETL of some simplified dataset of residential real estate property listings.

First, we create a simple class to represent a property listing.

package alpakkafka

case class PropertyListing(
    propertyId: Int,
    dataSource: Option[String],
    bathrooms: Option[Double],
    bedrooms: Option[Int],
    listPrice: Option[Double],
    livingArea: Option[Int],
    propertyType: Option[String],
    yearBuilt: Option[String],
    lastUpdated: Option[String],
    streetAddress: Option[String],
    city: Option[String],
    state: Option[String],
    zip: Option[String],
    country: Option[String]
  ) {
    def summary(): String = {
      val ba = bathrooms.getOrElse(0)
      val br = bedrooms.getOrElse(0)
      val price = listPrice.getOrElse(0)
      val area = livingArea.getOrElse(0)
      val street = streetAddress.getOrElse("")
      val cit = city.getOrElse("")
      val sta = state.getOrElse("")
      s"PropertyID: $propertyId | Price: $$$price ${br}BR/${ba}BA/${area}sqft | Address: $street, $cit, $sta"
    }
  }

package alpakkafka

case class PropertyListing(

propertyId: Int,

dataSource: Option[String],

bathrooms: Option[Double],

bedrooms: Option[Int],

listPrice: Option[Double],

livingArea: Option[Int],

propertyType: Option[String],

yearBuilt: Option[String],

lastUpdated: Option[String],

streetAddress: Option[String],

city: Option[String],

state: Option[String],

zip: Option[String],

country: Option[String]

) {

def summary(): String = {

val ba = bathrooms.getOrElse(0)

val br = bedrooms.getOrElse(0)

val price = listPrice.getOrElse(0)

val area = livingArea.getOrElse(0)

val street = streetAddress.getOrElse("")

val cit = city.getOrElse("")

val sta = state.getOrElse("")

s"PropertyID: $propertyId | Price: $$$price ${br}BR/${ba}BA/${area}sqft | Address: $street, $cit, $sta"

}

Using the good old sbt as the build tool, relevant library dependencies for Akka Stream, Alpakka Kafka, Postgres/Slick and Cassandra/DataStax are included in build.sbt.

name := "alpakka-streaming-etl"

version := "0.1"

scalaVersion := "2.13.6"

scalacOptions += "-deprecation"

val akkaVersion = "2.6.16"

libraryDependencies ++= Seq(
  "com.typesafe.akka" %% "akka-actor" % akkaVersion,
  "com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
  "com.typesafe.akka" %% "akka-stream" % akkaVersion,
  "com.typesafe.akka" %% "akka-stream-kafka" % "2.1.1",
  "com.lightbend.akka" %% "akka-stream-alpakka-slick" % "3.0.3",
  "org.postgresql" % "postgresql" % "42.2.24",
  "com.lightbend.akka" %% "akka-stream-alpakka-csv" % "3.0.3",
  "com.lightbend.akka" %% "akka-stream-alpakka-cassandra" % "3.0.3",
  "com.datastax.oss" % "java-driver-core" % "4.13.0",
  "org.apache.tinkerpop" % "tinkergraph-gremlin" % "3.5.1",
  "io.spray" %%  "spray-json" % "1.3.6",
  "ch.qos.logback" % "logback-classic" % "1.2.4" % Runtime
)

trapExit := false

name := "alpakka-streaming-etl"

version := "0.1"

scalaVersion := "2.13.6"

scalacOptions += "-deprecation"

val akkaVersion = "2.6.16"

libraryDependencies ++= Seq(

"com.typesafe.akka" %% "akka-actor" % akkaVersion,

"com.typesafe.akka" %% "akka-slf4j" % akkaVersion,

"com.typesafe.akka" %% "akka-stream" % akkaVersion,

"com.typesafe.akka" %% "akka-stream-kafka" % "2.1.1",

"com.lightbend.akka" %% "akka-stream-alpakka-slick" % "3.0.3",

"org.postgresql" % "postgresql" % "42.2.24",

"com.lightbend.akka" %% "akka-stream-alpakka-csv" % "3.0.3",

"com.lightbend.akka" %% "akka-stream-alpakka-cassandra" % "3.0.3",

"com.datastax.oss" % "java-driver-core" % "4.13.0",

"org.apache.tinkerpop" % "tinkergraph-gremlin" % "3.5.1",

"io.spray" %% "spray-json" % "1.3.6",

"ch.qos.logback" % "logback-classic" % "1.2.4" % Runtime

)

trapExit := false

Next, we put configurations for Akka Actor, Alpakka Kafka, Slick and Cassandra in application.conf under src/main/resources/:

akka {
  loggers = ["akka.event.slf4j.Slf4jLogger"]
  logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
}

akka.actor.allow-java-serialization = on

slick-postgres {
  profile = "slick.jdbc.PostgresProfile$"
  db {
    dataSourceClass = "slick.jdbc.DriverDataSource"
    properties = {
      driver = "org.postgresql.Driver"
      url = "jdbc:postgresql://localhost/propertydb"
      user = "pipeliner"
      password = "pa$$word"
    }
  }
}

akka.kafka.producer {
  discovery-method = akka.discovery
  service-name = ""
  resolve-timeout = 3 seconds
  parallelism = 10000
  close-timeout = 60s
  close-on-producer-stop = true
  use-dispatcher = "akka.kafka.default-dispatcher"
  eos-commit-interval = 100ms
  kafka-clients {
  }
}

akkafka.producer.with-brokers: ${akka.kafka.producer} {
  kafka-clients {
    bootstrap.servers = "127.0.0.1:9092"
  }
}

akka.kafka.consumer {
  poll-interval = 250ms  # 50ms
  poll-timeout = 50ms
  stop-timeout = 10s     # 30s
  close-timeout = 10s    # 20s
  commit-timeout = 10s   # 15s
  wakeup-timeout = 10s
  eos-draining-check-interval = 30ms
  use-dispatcher = "akka.kafka.default-dispatcher"
  kafka-clients {
    enable.auto.commit = true  # `true` for `Consumer.plainSource`
  }
}

akkafka.consumer.with-brokers: ${akka.kafka.consumer} {
  kafka-clients {
    bootstrap.servers = "127.0.0.1:9092"
  }
}

akka.kafka.committer {
  max-batch = 1000
  max-interval = 10s
  parallelism = 100
  delivery = WaitForAck
  when = OffsetFirstObserved
}

alpakka.cassandra {
  session-provider = "akka.stream.alpakka.cassandra.DefaultSessionProvider"
  service-discovery {
    name = ""
    lookup-timeout = 1 s
  }
  session-dispatcher = "akka.actor.default-dispatcher"
  datastax-java-driver-config = "datastax-java-driver"
}

datastax-java-driver {
  basic {
    contact-points = ["127.0.0.1:9042"]
    load-balancing-policy.local-datacenter = datacenter1
  }
  advanced.reconnect-on-init = true
}

akka {

loggers = ["akka.event.slf4j.Slf4jLogger"]

logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"

}

akka.actor.allow-java-serialization = on

slick-postgres {

profile = "slick.jdbc.PostgresProfile$"

db {

dataSourceClass = "slick.jdbc.DriverDataSource"

properties = {

driver = "org.postgresql.Driver"

url = "jdbc:postgresql://localhost/propertydb"

user = "pipeliner"

password = "pa$$word"

}

akka.kafka.producer {

discovery-method = akka.discovery

service-name = ""

resolve-timeout = 3 seconds

parallelism = 10000

close-timeout = 60s

close-on-producer-stop = true

use-dispatcher = "akka.kafka.default-dispatcher"

eos-commit-interval = 100ms

kafka-clients {

}

akkafka.producer.with-brokers: ${akka.kafka.producer} {

kafka-clients {

bootstrap.servers = "127.0.0.1:9092"

}

akka.kafka.consumer {

poll-interval = 250ms # 50ms

poll-timeout = 50ms

stop-timeout = 10s # 30s

close-timeout = 10s # 20s

commit-timeout = 10s # 15s

wakeup-timeout = 10s

eos-draining-check-interval = 30ms

use-dispatcher = "akka.kafka.default-dispatcher"

kafka-clients {

enable.auto.commit = true # `true` for `Consumer.plainSource`

}

akkafka.consumer.with-brokers: ${akka.kafka.consumer} {

kafka-clients {

bootstrap.servers = "127.0.0.1:9092"

}

akka.kafka.committer {

max-batch = 1000

max-interval = 10s

parallelism = 100

delivery = WaitForAck

when = OffsetFirstObserved

}

alpakka.cassandra {

session-provider = "akka.stream.alpakka.cassandra.DefaultSessionProvider"

service-discovery {

name = ""

lookup-timeout = 1 s

}

session-dispatcher = "akka.actor.default-dispatcher"

datastax-java-driver-config = "datastax-java-driver"

}

datastax-java-driver {

basic {

contact-points = ["127.0.0.1:9042"]

load-balancing-policy.local-datacenter = datacenter1

}

advanced.reconnect-on-init = true

}

Note that the sample configuration is for running the application with all Kafka, PostgreSQL and Cassandra on a single computer. The host IPs (i.e. 127.0.0.1) should be replaced with their corresponding host IPs/names in case they’re on separate hosts. For example, relevant configurations for Kafka brokers and Cassandra database spanning across multiple hosts might look something like bootstrap.servers = "10.1.0.1:9092,10.1.0.2:9092,10.1.0.3:9092" and contact-points = ["10.2.0.1:9042","10.2.0.2:9042"].

PostgresProducerPlain – an Alpakka Kafka producer

The PostgresProducerPlain snippet below creates a Kafka producer using Alpakka Slick which allows SQL queries to be coded in Slick’s functional programming style.

package alpakkafka

import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.{Done, NotUsed}

import akka.stream.alpakka.slick.scaladsl._

import akka.kafka.ProducerSettings
import akka.kafka.scaladsl.Producer
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.kafka.common.serialization.StringSerializer

import spray.json._
import spray.json.DefaultJsonProtocol._

import scala.concurrent.{ExecutionContext, Future}
import scala.util.Try

object PostgresProducerPlain {

  def run(topic: String,
          offset: Int = 0,
          limit: Int = Int.MaxValue)(implicit
                                     slickSession: SlickSession,
                                     jsonFormat: JsonFormat[PropertyListing],
                                     system: ActorSystem,
                                     ec: ExecutionContext): Future[Done] = {

    import slickSession.profile.api._

    class PropertyListings(tag: Tag) extends Table[PropertyListing](tag, "property_listing") {
      def propertyId = column[Int]("property_id", O.PrimaryKey)
      def dataSource = column[Option[String]]("data_source")
      def bathrooms = column[Option[Double]]("bathrooms")
      def bedrooms = column[Option[Int]]("bedrooms")
      def listPrice = column[Option[Double]]("list_price")
      def livingArea = column[Option[Int]]("living_area")
      def propertyType = column[Option[String]]("property_type")
      def yearBuilt = column[Option[String]]("year_built")
      def lastUpdated = column[Option[String]]("last_updated")
      def streetAddress = column[Option[String]]("street_address")
      def city = column[Option[String]]("city")
      def state = column[Option[String]]("state")
      def zip = column[Option[String]]("zip")
      def country = column[Option[String]]("country")
      def * =
        (propertyId, dataSource, bathrooms, bedrooms, listPrice, livingArea, propertyType, yearBuilt, lastUpdated, streetAddress, city, state, zip, country) <> (PropertyListing.tupled, PropertyListing.unapply)
    }

    val source: Source[PropertyListing, NotUsed] =
      Slick
        .source(TableQuery[PropertyListings].sortBy(_.propertyId).drop(offset).take(limit).result)

    val producerConfig = system.settings.config.getConfig("akkafka.producer.with-brokers")
    val producerSettings =
      ProducerSettings(producerConfig, new StringSerializer, new StringSerializer)

    source
      .map{ property =>
        val prodRec = new ProducerRecord[String, String](
            topic, property.propertyId.toString, property.toJson.compactPrint
          )
        println(s"[POSTRES] >>> Producer msg: $prodRec")
        prodRec
      }
      .runWith(Producer.plainSink(producerSettings))
  }

  def main(args: Array[String]): Unit = {
    implicit val system = ActorSystem()
    implicit val ec = system.dispatcher

    implicit val propertyListingJsonFormat: JsonFormat[PropertyListing] = jsonFormat14(PropertyListing)
    implicit val slickSession = SlickSession.forConfig("slick-postgres")

    val topic = "property-listing-topic"
    val offset: Int = if (args.length >= 1) Try(args(0).toInt).getOrElse(0) else 0
    val limit: Int = if (args.length == 2) Try(args(1).toInt).getOrElse(Int.MaxValue) else Int.MaxValue

    run(topic, offset, limit) onComplete{ _ =>
      slickSession.close()
      system.terminate()
    }
  }
}

package alpakkafka

import akka.actor.ActorSystem

import akka.stream.scaladsl._

import akka.{Done, NotUsed}

import akka.stream.alpakka.slick.scaladsl._

import akka.kafka.ProducerSettings

import akka.kafka.scaladsl.Producer

import org.apache.kafka.clients.producer.ProducerRecord

import org.apache.kafka.common.serialization.StringSerializer

import spray.json._

import spray.json.DefaultJsonProtocol._

import scala.concurrent.{ExecutionContext, Future}

import scala.util.Try

object PostgresProducerPlain {

def run(topic: String,

offset: Int = 0,

limit: Int = Int.MaxValue)(implicit

slickSession: SlickSession,

jsonFormat: JsonFormat[PropertyListing],

system: ActorSystem,

ec: ExecutionContext): Future[Done] = {

import slickSession.profile.api._

class PropertyListings(tag: Tag) extends Table[PropertyListing](tag, "property_listing") {

def propertyId = column[Int]("property_id", O.PrimaryKey)

def dataSource = column[Option[String]]("data_source")

def bathrooms = column[Option[Double]]("bathrooms")

def bedrooms = column[Option[Int]]("bedrooms")

def listPrice = column[Option[Double]]("list_price")

def livingArea = column[Option[Int]]("living_area")

def propertyType = column[Option[String]]("property_type")

def yearBuilt = column[Option[String]]("year_built")

def lastUpdated = column[Option[String]]("last_updated")

def streetAddress = column[Option[String]]("street_address")

def city = column[Option[String]]("city")

def state = column[Option[String]]("state")

def zip = column[Option[String]]("zip")

def country = column[Option[String]]("country")

def * =

(propertyId, dataSource, bathrooms, bedrooms, listPrice, livingArea, propertyType, yearBuilt, lastUpdated, streetAddress, city, state, zip, country) <> (PropertyListing.tupled, PropertyListing.unapply)

}

val source: Source[PropertyListing, NotUsed] =

Slick

.source(TableQuery[PropertyListings].sortBy(_.propertyId).drop(offset).take(limit).result)

val producerConfig = system.settings.config.getConfig("akkafka.producer.with-brokers")

val producerSettings =

ProducerSettings(producerConfig, new StringSerializer, new StringSerializer)

source

.map{ property =>

val prodRec = new ProducerRecord[String, String](

topic, property.propertyId.toString, property.toJson.compactPrint

)

println(s"[POSTRES] >>> Producer msg: $prodRec")

prodRec

}

.runWith(Producer.plainSink(producerSettings))

}

def main(args: Array[String]): Unit = {

implicit val system = ActorSystem()

implicit val ec = system.dispatcher

implicit val propertyListingJsonFormat: JsonFormat[PropertyListing] = jsonFormat14(PropertyListing)

implicit val slickSession = SlickSession.forConfig("slick-postgres")

val topic = "property-listing-topic"

val offset: Int = if (args.length >= 1) Try(args(0).toInt).getOrElse(0) else 0

val limit: Int = if (args.length == 2) Try(args(1).toInt).getOrElse(Int.MaxValue) else Int.MaxValue

run(topic, offset, limit) onComplete{ _ =>

slickSession.close()

system.terminate()

}

Method Slick.source[T]() takes a streaming query and returns a Source[T, NotUsed]. In this case, T is PropertyListing. Note that Slick.source() can also take a plain SQL statement wrapped within sql"..." as its argument, if wanted (in which case an implicit value of slick.jdbc.GetResult should be defined).

A subsequent map wraps each of the property listing objects in a ProducerRecord[K,V] with topic and key/value of type String/JSON, before publishing to the Kafka topic via Alpakka Kafka’s Producer.plainSink[K,V].

To run PostgresProducerPlain, simply navigate to the project root and execute the following command from within a command line terminal:

# sbt "runMain alpakkafka.PostgresProducerPlain <offset> <limit>"

# Stream the first 20 rows from property_listing in Postgres into Kafka under topic "property-listing"
sbt "runMain alpakkafka.PostgresProducerPlain 0 20"

# Stream all rows from property_listing in Postgres into Kafka under topic "property-listing"
sbt "runMain alpakkafka.PostgresProducerPlain"

# sbt "runMain alpakkafka.PostgresProducerPlain <offset> <limit>"

# Stream the first 20 rows from property_listing in Postgres into Kafka under topic "property-listing"

sbt "runMain alpakkafka.PostgresProducerPlain 0 20"

# Stream all rows from property_listing in Postgres into Kafka under topic "property-listing"

sbt "runMain alpakkafka.PostgresProducerPlain"

CassandraConsumerPlain – an Alpakka Kafka consumer

Using Alpakka Kafka, CassandraConsumerPlain shows how a basic Kafka consumer can be formulated as an Akka stream that consumes data from Kafka via Consumer.plainSource followed by a stream processing operator, Alpakka Cassandra’s CassandraFlow to stream the data into a Cassandra database.

package alpakkafka

import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.{Done, NotUsed}

import akka.kafka.{CommitterSettings, ConsumerSettings, Subscriptions}
import akka.kafka.scaladsl.{Committer, Consumer}
import akka.kafka.scaladsl.Consumer.DrainingControl
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer

import akka.stream.alpakka.cassandra.{CassandraSessionSettings, CassandraWriteSettings}
import akka.stream.alpakka.cassandra.scaladsl.{CassandraSession, CassandraSessionRegistry}
import akka.stream.alpakka.cassandra.scaladsl.{CassandraFlow, CassandraSource}
import com.datastax.oss.driver.api.core.cql.{BoundStatement, PreparedStatement}

import spray.json._
import spray.json.DefaultJsonProtocol._

import scala.concurrent.{ExecutionContext, Future}
import java.util.concurrent.atomic.AtomicReference
import scala.util.{Failure, Success, Try}

object CassandraConsumerPlain {

  def run(consumerGroup: String,
          topic: String)(implicit
                         cassandraSession: CassandraSession,
                         jsonFormat: JsonFormat[PropertyListing],
                         system: ActorSystem,
                         ec: ExecutionContext): Future[Done] = {

    val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")
    val consumerSettings =
      ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)
        .withGroupId(consumerGroup)
        .withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

    val table = "propertydata.property_listing"
    val partitions = 10  // number of partitions

    val statementBinder: (ConsumerRecord[String, String], PreparedStatement) => BoundStatement = {
      case (msg, preparedStatement) =>
        val p = msg.value().parseJson.convertTo[PropertyListing]
        preparedStatement.bind(
          (p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),
          Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),
          p.propertyType.getOrElse(""), p.yearBuilt.getOrElse(""), p.lastUpdated.getOrElse(""), p.streetAddress.getOrElse(""), p.city.getOrElse(""), p.state.getOrElse(""), p.zip.getOrElse(""), p.country.getOrElse("")
        )
    }
    val cassandraFlow: Flow[ConsumerRecord[String, String], ConsumerRecord[String, String], NotUsed] =
      CassandraFlow.create(
        CassandraWriteSettings.defaults,
        s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country)
           |VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,
        statementBinder
      )

    val control: DrainingControl[Done] =
      Consumer
        .plainSource(consumerSettings, Subscriptions.topics(topic))
        .via(cassandraFlow)
        .toMat(Sink.ignore)(DrainingControl.apply)
        .run()

    Thread.sleep(5000)
    control.drainAndShutdown()
  }

  def query(sql: String)(implicit
                         cassandraSession: CassandraSession,
                         system: ActorSystem,
                         ec: ExecutionContext): Future[Seq[(String, PropertyListing)]] = {

    CassandraSource(sql)
      .map{ r =>
        val partitionKey = r.getString("partition_key")
        val propertyListingTuple = (
            r.getInt("property_id"),
            Option(r.getString("data_source")),
            Option(r.getDouble("list_price")),
            Option(r.getInt("bedrooms")),
            Option(r.getDouble("bathrooms")),
            Option(r.getInt("living_area")),
            Option(r.getString("property_type")),
            Option(r.getString("year_built")),
            Option(r.getString("last_updated")),
            Option(r.getString("street_address")),
            Option(r.getString("city")),
            Option(r.getString("state")),
            Option(r.getString("zip")),
            Option(r.getString("country"))
          )
        val propertyListing = PropertyListing.tupled(propertyListingTuple)
        (partitionKey, propertyListing)
      }
      .runWith(Sink.seq)
  }

  def main(args: Array[String]): Unit = {
    implicit val system = ActorSystem()
    implicit val ec = system.dispatcher

    implicit val propertyListingJsonFormat: JsonFormat[PropertyListing] = jsonFormat14(PropertyListing)
    implicit val cassandraSession: CassandraSession =
      CassandraSessionRegistry.get(system).sessionFor(CassandraSessionSettings())

    val consumerGroup = "datawarehouse-consumer-group"
    val topic = "property-listing-topic"

    run(consumerGroup, topic) onComplete(println)

    // Query for property listings within a Cassandra partition
    val sql = s"SELECT * FROM propertydata.property_listing WHERE partition_key = '0';"
    query(sql) onComplete {
      case Success(res) => res.foreach(println)
      case Failure(e) => println(s"ERROR: $e")
    }

    Thread.sleep(5000)
    system.terminate()
  }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

package alpakkafka

import akka.actor.ActorSystem

import akka.stream.scaladsl._

import akka.{Done, NotUsed}

import akka.kafka.{CommitterSettings, ConsumerSettings, Subscriptions}

import akka.kafka.scaladsl.{Committer, Consumer}

import akka.kafka.scaladsl.Consumer.DrainingControl

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}

import org.apache.kafka.common.serialization.StringDeserializer

import akka.stream.alpakka.cassandra.{CassandraSessionSettings, CassandraWriteSettings}

import akka.stream.alpakka.cassandra.scaladsl.{CassandraSession, CassandraSessionRegistry}

import akka.stream.alpakka.cassandra.scaladsl.{CassandraFlow, CassandraSource}

import com.datastax.oss.driver.api.core.cql.{BoundStatement, PreparedStatement}

import spray.json._

import spray.json.DefaultJsonProtocol._

import scala.concurrent.{ExecutionContext, Future}

import java.util.concurrent.atomic.AtomicReference

import scala.util.{Failure, Success, Try}

object CassandraConsumerPlain {

def run(consumerGroup: String,

topic: String)(implicit

cassandraSession: CassandraSession,

jsonFormat: JsonFormat[PropertyListing],

system: ActorSystem,

ec: ExecutionContext): Future[Done] = {

val consumerConfig = system.settings.config.getConfig("akkafka.consumer.with-brokers")

val consumerSettings =

ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)

.withGroupId(consumerGroup)

.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

val table = "propertydata.property_listing"

val partitions = 10 // number of partitions

val statementBinder: (ConsumerRecord[String, String], PreparedStatement) => BoundStatement = {

case (msg, preparedStatement) =>

val p = msg.value().parseJson.convertTo[PropertyListing]

preparedStatement.bind(

(p.propertyId % partitions).toString, Int.box(p.propertyId), p.dataSource.getOrElse("unknown"),

Double.box(p.bathrooms.getOrElse(0)), Int.box(p.bedrooms.getOrElse(0)), Double.box(p.listPrice.getOrElse(0)), Int.box(p.livingArea.getOrElse(0)),

)

}

val cassandraFlow: Flow[ConsumerRecord[String, String], ConsumerRecord[String, String], NotUsed] =

CassandraFlow.create(

CassandraWriteSettings.defaults,

s"""INSERT INTO $table (partition_key, property_id, data_source, bathrooms, bedrooms, list_price, living_area, property_type, year_built, last_updated, street_address, city, state, zip, country)

|VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""".stripMargin,

statementBinder

)

val control: DrainingControl[Done] =

Consumer

.plainSource(consumerSettings, Subscriptions.topics(topic))

.via(cassandraFlow)

.toMat(Sink.ignore)(DrainingControl.apply)

.run()

Thread.sleep(5000)

control.drainAndShutdown()

}

def query(sql: String)(implicit

cassandraSession: CassandraSession,

system: ActorSystem,

ec: ExecutionContext): Future[Seq[(String, PropertyListing)]] = {

CassandraSource(sql)

.map{ r =>

val partitionKey = r.getString("partition_key")

val propertyListingTuple = (

r.getInt("property_id"),

Option(r.getString("data_source")),

Option(r.getDouble("list_price")),

Option(r.getInt("bedrooms")),

Option(r.getDouble("bathrooms")),

Option(r.getInt("living_area")),

Option(r.getString("property_type")),

Option(r.getString("year_built")),

Option(r.getString("last_updated")),

Option(r.getString("street_address")),

Option(r.getString("city")),

Option(r.getString("state")),

Option(r.getString("zip")),

Option(r.getString("country"))

)

val propertyListing = PropertyListing.tupled(propertyListingTuple)

(partitionKey, propertyListing)

}

.runWith(Sink.seq)

}

def main(args: Array[String]): Unit = {

implicit val system = ActorSystem()

implicit val ec = system.dispatcher

implicit val propertyListingJsonFormat: JsonFormat[PropertyListing] = jsonFormat14(PropertyListing)

implicit val cassandraSession: CassandraSession =

CassandraSessionRegistry.get(system).sessionFor(CassandraSessionSettings())

val consumerGroup = "datawarehouse-consumer-group"

val topic = "property-listing-topic"

run(consumerGroup, topic) onComplete(println)

// Query for property listings within a Cassandra partition

val sql = s"SELECT * FROM propertydata.property_listing WHERE partition_key = '0';"

query(sql) onComplete {

case Success(res) => res.foreach(println)

case Failure(e) => println(s"ERROR: $e")

}

Thread.sleep(5000)

system.terminate()

}

A few notes:

Consumer.plainSource: As a first stab at building a consumer, we use Alpakka Kafka’s Consumer.plainSource[K,V] as the stream source. To ensure the stream to be stopped in a controlled fashion, Consumer.Drainingcontrol is included when composing the stream graph. While it’s straight forward to use, plainSource doesn’t offer programmatic tracking of the commit offset position thus cannot guarantee at-least-once delivery. An enhanced version of the consumer will be constructed in a subsequent blog post.
Partition key: Cassandra mandates having a partition key as part of the primary key of every table for distributing across cluster nodes. In our property listing data, we make the modulo of the Postgres primary key property_id by the number of partitions to be the partition key. It could certainly be redefined to something else (e.g. locale or type of the property) in accordance with the specific business requirement.
CassandraSource: Method query() simply executes queries against a Cassandra database using CassandraSource which takes a CQL query with syntax similar to standard SQL’s. It isn’t part of the consumer flow, but is rather as a convenient tool for verifying stream consumption result.
CassandraFlow: Alpakka’s CassandraFlow.create[S]() is the main processing operator responsible for streaming data into the Cassandra database. It takes a CQL PreparedStatement and a “statement binder” that binds the incoming class variables to the corresponding Cassandra columns before executing the insert/update. In this case, S is ConsumerRecord[K,V].

To run CassandraConsumerPlain, Navigate to the project root and execute the following from within a command line terminal:

# Stream any uncommitted data from Kafka topic "property-listing" into Cassandra
sbt "runMain alpakkafka.CassandraConsumerPlain"

1 2	# Stream any uncommitted data from Kafka topic "property-listing" into Cassandra sbt "runMain alpakkafka.CassandraConsumerPlain"

Table schema in PostgreSQL & Cassandra

Obviously, the streaming ETL application is supposed to run in the presence of one or more Kafka brokers, a PostgreSQL database and a Cassandra data warehouse. For proof of concept, getting all these systems with basic configurations on a descent computer (Linux, Mac OS, etc) is a trivial exercise. The ETL application is readily scalable that it would require only configurative changes when, say, needs rise for scaling up of Kafka and Cassandra to span clusters of nodes in the cloud.

Below is how the table schema of property_listing can be created in PostgreSQL via psql:

CREATE ROLE pipeliner WITH createdb login ENCRYPTED PASSWORD 'pa$$word';
CREATE DATABASE propertydb WITH OWNER 'pipeliner' ENCODING 'utf8';

CREATE TABLE property_listing (
    property_id integer PRIMARY KEY,
    bathrooms numeric,
    bedrooms integer,
    list_price double precision,
    living_area integer,
    property_type text,
    year_built text,
    data_source text,
    last_updated timestamp with time zone,
    street_address character varying(250),
    city character varying(50),
    state character varying(50),
    zip character varying(10),
    country character varying(3)  
);
ALTER TABLE property_listing OWNER TO pipeliner;

CREATE ROLE pipeliner WITH createdb login ENCRYPTED PASSWORD 'pa$$word';

CREATE DATABASE propertydb WITH OWNER 'pipeliner' ENCODING 'utf8';

CREATE TABLE property_listing (

property_id integer PRIMARY KEY,

bathrooms numeric,

bedrooms integer,

list_price double precision,

living_area integer,

property_type text,

year_built text,

data_source text,

last_updated timestamp with time zone,

street_address character varying(250),

city character varying(50),

state character varying(50),

zip character varying(10),

country character varying(3)

);

ALTER TABLE property_listing OWNER TO pipeliner;

To create keyspace propertydata and the corresponding table property_listing in Cassandra, one can launch cqlsh and execute the following CQL statements:

CREATE KEYSPACE propertydata
  WITH REPLICATION = { 
   'class' : 'SimpleStrategy',  // use 'NetworkTopologyStrategy' for multi-node
   'replication_factor' : 2     // use 'datacenter1' for multi-node
  };

CREATE TABLE propertydata.property_listing (
    partition_key text,
    property_id int,
    data_source text,
    bathrooms double,
    bedrooms int,
    list_price double,
    living_area int,
    property_type text,
    year_built text,
    last_updated text,
    street_address text,
    city text,
    state text,
    zip text,
    country text,
    PRIMARY KEY ((partition_key), property_id)
);

CREATE KEYSPACE propertydata

WITH REPLICATION = {

'class' : 'SimpleStrategy', // use 'NetworkTopologyStrategy' for multi-node

'replication_factor' : 2 // use 'datacenter1' for multi-node

};

CREATE TABLE propertydata.property_listing (

partition_key text,

property_id int,

data_source text,

bathrooms double,

bedrooms int,

list_price double,

living_area int,

property_type text,

year_built text,

last_updated text,

street_address text,

city text,

state text,

zip text,

country text,

PRIMARY KEY ((partition_key), property_id)

);

What’s next?

So, we now have a basic streaming ETL system running Alphakka Kafka on top of a cluster of Kafka brokers to form the reactive stream “backbone” for near real-time ETL between data stores. With Alpakka Slick and Alpakka Cassandra, a relational database like PostgreSQL and a Cassandra data warehouse can be made part of the system like composable stream components.

As noted earlier, the existing Cassandra consumer does not guarantee at-least-once delivery, which is part of the requirement. In the next blog post, we’ll enhance the existing consumer to address the required delivery guarantee. We’ll also add a data processing pipeline to illustrate how to construct additional data pipelines as composable stream operators. All relevant source code along with some sample dataset will be published in a GitHub repo.

PostgreSQL Table Partitioning

1 Reply

With the ever growing demand for data science work in recent years, PostgreSQL has gained superb popularity especially in areas where extensive geospatial/GIS (geographic information system) functionality is needed. In a previous startup venture, MySQL was initially adopted and I went through the trouble of migrating to PostgreSQL mainly because of the sophisticated geospatial features PostGIS offers.

PostgreSQL offers a lot of goodies, although it does have a few things that I wish were done differently. Most notable to me is that while its SELECT statement supports SQL-92 Standard’s JOIN syntax, its UPDATE statement would not. For instance, the following UPDATE statement would not work in PostgreSQL:

--
-- SQL-92 compliance JOIN
--
UPDATE X
INNER JOIN Y ON X.yid = Y.id
LEFT JOIN Z ON Y.zid = Z.id
SET
    X.yname = Y.name,
    X.zname = Z.name
WHERE
    X.id IN (101,102,103)
;

-- SQL-92 compliance JOIN

UPDATE X

INNER JOIN Y ON X.yid = Y.id

LEFT JOIN Z ON Y.zid = Z.id

SET

X.yname = Y.name,

X.zname = Z.name

WHERE

X.id IN (101,102,103)

;

Partial indexing

Nevertheless, for general performance and scalability, PostgreSQL remains one of the top candidates with proven track record in the world of open source RDBMS. In scaling up a PostgreSQL database, there is a wide variety of approaches. Suitable indexing is probably one of the first things to look into. Aside from planning out proper column orders in indexes that are optimal for the frequently used queries, there is another indexing feature that PostgreSQL provides for handling large datasets.

Partial indexing allows an index to be built over a subset of a table based on a conditional expression. For instance:

--
-- Partial Index
--
CREATE INDEX ix_unshipped_orders ON orders (order_num) WHERE shipped is FALSE;

-- Partial Index

CREATE INDEX ix_unshipped_orders ON orders (order_num) WHERE shipped is FALSE;

In the case of a table with large amount of rows, this feature could make an otherwise gigantic index much smaller, thus more efficient for queries against the selectively indexed data.

Scaling up with table partitioning

However, when a table grows to certain volume, say, beyond a couple of hundreds of million rows, and if periodically archiving off data from the table isn’t an option, it would still be a problem even with applicable indexing strategy. In many cases, it might be necessary to do something directly with the table structure and table partitioning is often a good solution.

There are a few approaches to partition a PostgreSQL table. Among them, partitioning by means of table inheritance is perhaps the most popular approach. A master table will be created as a template that defines the table structure. This master table will be empty whereas a number of child tables inherited from this master table will actually host the data.

The partitioning is based on a partition key which can be a column or a combination of columns. In some common use cases, the partition keys are often date-time related. For instance, a partition key could be defined in a table to partition all sales orders by months with constraint like the following:

order_date >= ‘2016-12-01 00:00:00’ AND order_date < ‘2017-01-01 00:00:00’

Other common cases include partitioning geographically, etc.

A table partitioning example

When I was with a real estate startup building an application that involves over 100 millions nationwide properties, each with multiple attributes of interest, table partitioning was employed to address the demanding data volume. Below is a simplified example of how the property sale transaction table was partitioned to maintain a billion rows of data.

First, create the master table which will serve as the template for the table structure.

--
-- Create master table
--
DROP TABLE IF EXISTS property_sale CASCADE;

CREATE TABLE property_sale (
    "id" bigserial primary key,
    "state" character varying(20) not null,
    "property_type" character varying(100),
    "sale_date" timestamp with time zone,
    "sale_price" integer
);

-- Create master table

DROP TABLE IF EXISTS property_sale CASCADE;

CREATE TABLE property_sale (

"id" bigserial primary key,

"state" character varying(20) not null,

"property_type" character varying(100),

"sale_date" timestamp with time zone,

"sale_price" integer

);

Next, create child tables inheriting from the master table for the individual states. For simplicity, I only set up 24 states for performance evaluation.

--
-- Create child tables (for 24 selected states)
--
CREATE TABLE property_sale_ca ( CHECK ( state = 'CA' )) INHERITS (property_sale);
CREATE TABLE property_sale_ny ( CHECK ( state = 'NY' )) INHERITS (property_sale);
CREATE TABLE property_sale_tx ( CHECK ( state = 'TX' )) INHERITS (property_sale);
CREATE TABLE property_sale_il ( CHECK ( state = 'IL' )) INHERITS (property_sale);
CREATE TABLE property_sale_wa ( CHECK ( state = 'WA' )) INHERITS (property_sale);
CREATE TABLE property_sale_fl ( CHECK ( state = 'FL' )) INHERITS (property_sale);
CREATE TABLE property_sale_va ( CHECK ( state = 'VA' )) INHERITS (property_sale);
CREATE TABLE property_sale_co ( CHECK ( state = 'CO' )) INHERITS (property_sale);
CREATE TABLE property_sale_oh ( CHECK ( state = 'OH' )) INHERITS (property_sale);
CREATE TABLE property_sale_nv ( CHECK ( state = 'NV' )) INHERITS (property_sale);
CREATE TABLE property_sale_or ( CHECK ( state = 'OR' )) INHERITS (property_sale);
CREATE TABLE property_sale_pa ( CHECK ( state = 'PA' )) INHERITS (property_sale);
CREATE TABLE property_sale_ut ( CHECK ( state = 'UT' )) INHERITS (property_sale);
CREATE TABLE property_sale_ma ( CHECK ( state = 'MA' )) INHERITS (property_sale);
CREATE TABLE property_sale_ct ( CHECK ( state = 'CT' )) INHERITS (property_sale);
CREATE TABLE property_sale_la ( CHECK ( state = 'LA' )) INHERITS (property_sale);
CREATE TABLE property_sale_wi ( CHECK ( state = 'WI' )) INHERITS (property_sale);
CREATE TABLE property_sale_wy ( CHECK ( state = 'WY' )) INHERITS (property_sale);
CREATE TABLE property_sale_nm ( CHECK ( state = 'NM' )) INHERITS (property_sale);
CREATE TABLE property_sale_nj ( CHECK ( state = 'NJ' )) INHERITS (property_sale);
CREATE TABLE property_sale_nh ( CHECK ( state = 'NH' )) INHERITS (property_sale);
CREATE TABLE property_sale_mi ( CHECK ( state = 'MI' )) INHERITS (property_sale);
CREATE TABLE property_sale_md ( CHECK ( state = 'MD' )) INHERITS (property_sale);
CREATE TABLE property_sale_dc ( CHECK ( state = 'DC' )) INHERITS (property_sale);
CREATE TABLE property_sale_err ( CHECK ( state NOT IN (
    'CA', 'NY', 'TX', 'IL', 'WA', 'FL', 'VA', 'CO', 'OH', 'NV', 'OR', 'PA', 
    'UT', 'MA', 'CT', 'LA', 'WI', 'WY', 'NM', 'NJ', 'NH', 'MI', 'MD', 'DC' ))
) INHERITS (property_sale);

-- Create child tables (for 24 selected states)

CREATE TABLE property_sale_ca ( CHECK ( state = 'CA' )) INHERITS (property_sale);

CREATE TABLE property_sale_ny ( CHECK ( state = 'NY' )) INHERITS (property_sale);

CREATE TABLE property_sale_tx ( CHECK ( state = 'TX' )) INHERITS (property_sale);

CREATE TABLE property_sale_il ( CHECK ( state = 'IL' )) INHERITS (property_sale);

CREATE TABLE property_sale_wa ( CHECK ( state = 'WA' )) INHERITS (property_sale);

CREATE TABLE property_sale_fl ( CHECK ( state = 'FL' )) INHERITS (property_sale);

CREATE TABLE property_sale_va ( CHECK ( state = 'VA' )) INHERITS (property_sale);

CREATE TABLE property_sale_co ( CHECK ( state = 'CO' )) INHERITS (property_sale);

CREATE TABLE property_sale_oh ( CHECK ( state = 'OH' )) INHERITS (property_sale);

CREATE TABLE property_sale_nv ( CHECK ( state = 'NV' )) INHERITS (property_sale);

CREATE TABLE property_sale_or ( CHECK ( state = 'OR' )) INHERITS (property_sale);

CREATE TABLE property_sale_pa ( CHECK ( state = 'PA' )) INHERITS (property_sale);

CREATE TABLE property_sale_ut ( CHECK ( state = 'UT' )) INHERITS (property_sale);

CREATE TABLE property_sale_ma ( CHECK ( state = 'MA' )) INHERITS (property_sale);

CREATE TABLE property_sale_ct ( CHECK ( state = 'CT' )) INHERITS (property_sale);

CREATE TABLE property_sale_la ( CHECK ( state = 'LA' )) INHERITS (property_sale);

CREATE TABLE property_sale_wi ( CHECK ( state = 'WI' )) INHERITS (property_sale);

CREATE TABLE property_sale_wy ( CHECK ( state = 'WY' )) INHERITS (property_sale);

CREATE TABLE property_sale_nm ( CHECK ( state = 'NM' )) INHERITS (property_sale);

CREATE TABLE property_sale_nj ( CHECK ( state = 'NJ' )) INHERITS (property_sale);

CREATE TABLE property_sale_nh ( CHECK ( state = 'NH' )) INHERITS (property_sale);

CREATE TABLE property_sale_mi ( CHECK ( state = 'MI' )) INHERITS (property_sale);

CREATE TABLE property_sale_md ( CHECK ( state = 'MD' )) INHERITS (property_sale);

CREATE TABLE property_sale_dc ( CHECK ( state = 'DC' )) INHERITS (property_sale);

CREATE TABLE property_sale_err ( CHECK ( state NOT IN (

'CA', 'NY', 'TX', 'IL', 'WA', 'FL', 'VA', 'CO', 'OH', 'NV', 'OR', 'PA',

'UT', 'MA', 'CT', 'LA', 'WI', 'WY', 'NM', 'NJ', 'NH', 'MI', 'MD', 'DC' ))

) INHERITS (property_sale);

Nothing magical so far, until a suitable trigger for propagating insert is put in place. The trigger essentially redirects insert requests against the master table to the corresponding child tables.

--
-- Create trigger for insert
--
DROP FUNCTION IF EXISTS fn_insert_property_sale();
CREATE OR REPLACE FUNCTION fn_insert_property_sale()
RETURNS TRIGGER AS $
BEGIN
    CASE NEW.state
        WHEN 'CA' THEN
            INSERT INTO property_sale_ca VALUES (NEW.*);
        WHEN 'NY' THEN
            INSERT INTO property_sale_ny VALUES (NEW.*);
        WHEN 'TX' THEN
            INSERT INTO property_sale_tx VALUES (NEW.*);
        WHEN 'IL' THEN
            INSERT INTO property_sale_il VALUES (NEW.*);
        WHEN 'WA' THEN
            INSERT INTO property_sale_wa VALUES (NEW.*);
        WHEN 'FL' THEN
            INSERT INTO property_sale_fl VALUES (NEW.*);
        WHEN 'VA' THEN
            INSERT INTO property_sale_va VALUES (NEW.*);
        WHEN 'CO' THEN
            INSERT INTO property_sale_co VALUES (NEW.*);
        WHEN 'OH' THEN
            INSERT INTO property_sale_oh VALUES (NEW.*);
        WHEN 'NV' THEN
            INSERT INTO property_sale_nv VALUES (NEW.*);
        WHEN 'OR' THEN
            INSERT INTO property_sale_or VALUES (NEW.*);
        WHEN 'PA' THEN
            INSERT INTO property_sale_pa VALUES (NEW.*);
        WHEN 'UT' THEN
            INSERT INTO property_sale_ut VALUES (NEW.*);
        WHEN 'MA' THEN
            INSERT INTO property_sale_ma VALUES (NEW.*);
        WHEN 'CT' THEN
            INSERT INTO property_sale_ct VALUES (NEW.*);
        WHEN 'LA' THEN
            INSERT INTO property_sale_la VALUES (NEW.*);
        WHEN 'WI' THEN
            INSERT INTO property_sale_wi VALUES (NEW.*);
        WHEN 'WY' THEN
            INSERT INTO property_sale_wy VALUES (NEW.*);
        WHEN 'NM' THEN
            INSERT INTO property_sale_nm VALUES (NEW.*);
        WHEN 'NJ' THEN
            INSERT INTO property_sale_nj VALUES (NEW.*);
        WHEN 'NH' THEN
            INSERT INTO property_sale_nh VALUES (NEW.*);
        WHEN 'MI' THEN
            INSERT INTO property_sale_mi VALUES (NEW.*);
        WHEN 'MD' THEN
            INSERT INTO property_sale_md VALUES (NEW.*);
        WHEN 'DC' THEN
            INSERT INTO property_sale_dc VALUES (NEW.*);
        ELSE
            INSERT INTO property_sale_err VALUES (NEW.*);
    END CASE;
    RETURN NULL;
END
$
LANGUAGE plpgsql;

CREATE TRIGGER tr_insert_property_sale
    BEFORE INSERT ON property_sale
    FOR EACH ROW EXECUTE PROCEDURE fn_insert_property_sale();

-- Create trigger for insert

DROP FUNCTION IF EXISTS fn_insert_property_sale();

CREATE OR REPLACE FUNCTION fn_insert_property_sale()

RETURNS TRIGGER AS $

BEGIN

CASE NEW.state

WHEN 'CA' THEN

INSERT INTO property_sale_ca VALUES (NEW.*);

WHEN 'NY' THEN

INSERT INTO property_sale_ny VALUES (NEW.*);

WHEN 'TX' THEN

INSERT INTO property_sale_tx VALUES (NEW.*);

WHEN 'IL' THEN

INSERT INTO property_sale_il VALUES (NEW.*);

WHEN 'WA' THEN

INSERT INTO property_sale_wa VALUES (NEW.*);

WHEN 'FL' THEN

INSERT INTO property_sale_fl VALUES (NEW.*);

WHEN 'VA' THEN

INSERT INTO property_sale_va VALUES (NEW.*);

WHEN 'CO' THEN

INSERT INTO property_sale_co VALUES (NEW.*);

WHEN 'OH' THEN

INSERT INTO property_sale_oh VALUES (NEW.*);

WHEN 'NV' THEN

INSERT INTO property_sale_nv VALUES (NEW.*);

WHEN 'OR' THEN

INSERT INTO property_sale_or VALUES (NEW.*);

WHEN 'PA' THEN

INSERT INTO property_sale_pa VALUES (NEW.*);

WHEN 'UT' THEN

INSERT INTO property_sale_ut VALUES (NEW.*);

WHEN 'MA' THEN

INSERT INTO property_sale_ma VALUES (NEW.*);

WHEN 'CT' THEN

INSERT INTO property_sale_ct VALUES (NEW.*);

WHEN 'LA' THEN

INSERT INTO property_sale_la VALUES (NEW.*);

WHEN 'WI' THEN

INSERT INTO property_sale_wi VALUES (NEW.*);

WHEN 'WY' THEN

INSERT INTO property_sale_wy VALUES (NEW.*);

WHEN 'NM' THEN

INSERT INTO property_sale_nm VALUES (NEW.*);

WHEN 'NJ' THEN

INSERT INTO property_sale_nj VALUES (NEW.*);

WHEN 'NH' THEN

INSERT INTO property_sale_nh VALUES (NEW.*);

WHEN 'MI' THEN

INSERT INTO property_sale_mi VALUES (NEW.*);

WHEN 'MD' THEN

INSERT INTO property_sale_md VALUES (NEW.*);

WHEN 'DC' THEN

INSERT INTO property_sale_dc VALUES (NEW.*);

ELSE

INSERT INTO property_sale_err VALUES (NEW.*);

END CASE;

RETURN NULL;

END

LANGUAGE plpgsql;

CREATE TRIGGER tr_insert_property_sale

BEFORE INSERT ON property_sale

FOR EACH ROW EXECUTE PROCEDURE fn_insert_property_sale();

Let’s test inserting data into the partitioned tables via the trigger:

--
-- Check created tables, functions, triggers
--
\dt property_sale*
\df fn_*
select tgname from pg_trigger where tgname like 'tr_%';

--
-- Test insert
--
INSERT INTO property_sale ("state", "property_id", "property_type", "sale_date", "sale_price") VALUES
('CA', 1008, 'Single Family House', '1990-05-05 09:30:00', 750000),
('NY', 2505, 'Apartment', '2002-01-30 12:00:00', 800000),
('ZZ', 9999, 'Single Family House', '2012-06-28 14:00:00', 500000),
('CA', 1008, 'Condominium', '2015-11-02 16:00:00', 1200000),
('TX', 3030, 'Condominium', '2006-04-20 11:15:00', 500000)
;

-- Check created tables, functions, triggers

\dt property_sale*

\df fn_*

select tgname from pg_trigger where tgname like 'tr_%';

-- Test insert

INSERT INTO property_sale ("state", "property_id", "property_type", "sale_date", "sale_price") VALUES

('CA', 1008, 'Single Family House', '1990-05-05 09:30:00', 750000),

('NY', 2505, 'Apartment', '2002-01-30 12:00:00', 800000),

('ZZ', 9999, 'Single Family House', '2012-06-28 14:00:00', 500000),

('CA', 1008, 'Condominium', '2015-11-02 16:00:00', 1200000),

('TX', 3030, 'Condominium', '2006-04-20 11:15:00', 500000)

;

A Python program for data import

Now that the master table and its child tables are functionally in place, we’re going to populate them with large-scale data for testing. First, write a simple program using Python (or any other programming/scripting language) as follows to generate simulated data in a tab-delimited file for data import:

#!/usr/bin/python
import getopt, sys
import csv
import random
import time

def usage():
    print("Usage: %s [-h] [-r rows -f file]" % sys.argv[0])
    print(" e.g.: %s -r 20000 -f /db/data/test/pg_partitioning_infile.txt" % sys.argv[0])

def randDateTime(start, end, rand):
    format = '%Y-%m-%d %H:%M:%S'
    startTime = time.mktime(time.strptime(start, format))
    endTime = time.mktime(time.strptime(end, format))
    randTime = startTime + rand * (endTime - startTime)
    return time.strftime(format, time.localtime(randTime))

if __name__ == "__main__":

    rows = 0
    txtFile = ""

    try:
        opts, args = getopt.getopt(sys.argv[1:], "hr:f:", ["help", "rows=", "file="])
    except getopt.GetoptError as err:
        usage()
        sys.exit(2)

    for okey, oval in opts:
        if okey in ("-h", "--help"):
            usage()
            sys.exit()
        elif okey in ("-r", "--rows"):
            rows = int(oval)
        elif okey in ("-f", "--file"):
            txtFile = oval
        else:
            assert False, "unhandled option"

    print("rows = %d" % rows)

    stateList = ['CA', 'NY', 'TX', 'IL', 'WA', 'FL', 'VA', 'CO', 'OH', 'NV', 'OR', 'PA', \
                 'UT', 'MA', 'CT', 'LA', 'WI', 'WY', 'NM', 'NJ', 'NH', 'MI', 'MD', 'DC']
    propTypeList = ['Single Family', 'Condominium', 'Multi (2-4 units)', 'Duplex', 'Triplex', \
                    'Quadruplex', 'Miscellaneous', 'Mobile Home', 'Residential Vacant Land']

    with open(txtFile, 'w') as f:
        w = csv.writer(f, dialect = 'excel-tab')
        rowCount = 0

        for propId in range(1000, 200000000):
            state = random.choice(stateList)
            propType = random.choice(propTypeList)
            numSales = random.randint(1, 8)
            randPrice = random.randint(250, 2500) * 1000

            saleCount = 0
            while rowCount < rows and saleCount < numSales:
                saleDate = randDateTime("1980-01-01 00:00:00", "2017-01-01 00:00:00", random.random())
                salePrice = randPrice * (1.0 + random.randint(-20, 20) / 100.0)
                salePrice = int(salePrice / 1000.0) * 1000

                w.writerow([state, propId, propType, saleDate, salePrice])
                rowCount += 1
                saleCount += 1

            if rowCount >= rows:
                break

#!/usr/bin/python

import getopt, sys

import csv

import random

import time

def usage():

print("Usage: %s [-h] [-r rows -f file]" % sys.argv[0])

print(" e.g.: %s -r 20000 -f /db/data/test/pg_partitioning_infile.txt" % sys.argv[0])

def randDateTime(start, end, rand):

format = '%Y-%m-%d %H:%M:%S'

startTime = time.mktime(time.strptime(start, format))

endTime = time.mktime(time.strptime(end, format))

randTime = startTime + rand * (endTime - startTime)

return time.strftime(format, time.localtime(randTime))

if __name__ == "__main__":

rows = 0

txtFile = ""

try:

opts, args = getopt.getopt(sys.argv[1:], "hr:f:", ["help", "rows=", "file="])

except getopt.GetoptError as err:

usage()

sys.exit(2)

for okey, oval in opts:

if okey in ("-h", "--help"):

usage()

sys.exit()

elif okey in ("-r", "--rows"):

rows = int(oval)

elif okey in ("-f", "--file"):

txtFile = oval

else:

assert False, "unhandled option"

print("rows = %d" % rows)

stateList = ['CA', 'NY', 'TX', 'IL', 'WA', 'FL', 'VA', 'CO', 'OH', 'NV', 'OR', 'PA', \

'UT', 'MA', 'CT', 'LA', 'WI', 'WY', 'NM', 'NJ', 'NH', 'MI', 'MD', 'DC']

propTypeList = ['Single Family', 'Condominium', 'Multi (2-4 units)', 'Duplex', 'Triplex', \

'Quadruplex', 'Miscellaneous', 'Mobile Home', 'Residential Vacant Land']

with open(txtFile, 'w') as f:

w = csv.writer(f, dialect = 'excel-tab')

rowCount = 0

for propId in range(1000, 200000000):

state = random.choice(stateList)

propType = random.choice(propTypeList)

numSales = random.randint(1, 8)

randPrice = random.randint(250, 2500) * 1000

saleCount = 0

while rowCount < rows and saleCount < numSales:

saleDate = randDateTime("1980-01-01 00:00:00", "2017-01-01 00:00:00", random.random())

salePrice = randPrice * (1.0 + random.randint(-20, 20) / 100.0)

salePrice = int(salePrice / 1000.0) * 1000

w.writerow([state, propId, propType, saleDate, salePrice])

rowCount += 1

saleCount += 1

if rowCount >= rows:

break

Run the Python program to generate up to 1 billion rows of property sale data. Given the rather huge output, make sure the generated file is on a storage device with plenty of space. Since it’s going to take some time to finish the task, it would better be run in the background, perhaps along with mail notification, like the following:

--
-- Run pg_partitioning_infile.py to create tab-delimited infile
--
cd /db/app/test/
nohup python pg_partitioning_infile.py -r 1000000000 -f /db/data/test/pg_partitioning_infile_1b.txt 2>&1 | mail -s "pg partitioning infile - creating 1-billion-row infile" me@mydomain.com &

-- Run pg_partitioning_infile.py to create tab-delimited infile

cd /db/app/test/

nohup python pg_partitioning_infile.py -r 1000000000 -f /db/data/test/pg_partitioning_infile_1b.txt 2>&1 | mail -s "pg partitioning infile - creating 1-billion-row infile" me@mydomain.com &

Next, load data from the generated infile into the partitioned tables using psql. In case there are indexes created for the partitioned tables, it would generally be much more efficient to first drop them and recreate them after loading the data, like in the following:

--
-- Drop indexes, if exist, to speed up loading
--
psql -d mydb -U dbu -h localhost
DROP INDEX IF EXISTS property_sale_ca_prop_type_id;
DROP INDEX IF EXISTS property_sale_ny_prop_type_id;
DROP INDEX IF EXISTS property_sale_tx_prop_type_id;
DROP INDEX IF EXISTS property_sale_il_prop_type_id;
DROP INDEX IF EXISTS property_sale_wa_prop_type_id;
DROP INDEX IF EXISTS property_sale_fl_prop_type_id;
DROP INDEX IF EXISTS property_sale_va_prop_type_id;
DROP INDEX IF EXISTS property_sale_co_prop_type_id;
DROP INDEX IF EXISTS property_sale_oh_prop_type_id;
DROP INDEX IF EXISTS property_sale_nv_prop_type_id;
DROP INDEX IF EXISTS property_sale_or_prop_type_id;
DROP INDEX IF EXISTS property_sale_pa_prop_type_id;
DROP INDEX IF EXISTS property_sale_ut_prop_type_id;
DROP INDEX IF EXISTS property_sale_ma_prop_type_id;
DROP INDEX IF EXISTS property_sale_ct_prop_type_id;
DROP INDEX IF EXISTS property_sale_la_prop_type_id;
DROP INDEX IF EXISTS property_sale_wi_prop_type_id;
DROP INDEX IF EXISTS property_sale_wy_prop_type_id;
DROP INDEX IF EXISTS property_sale_nm_prop_type_id;
DROP INDEX IF EXISTS property_sale_nj_prop_type_id;
DROP INDEX IF EXISTS property_sale_nh_prop_type_id;
DROP INDEX IF EXISTS property_sale_mi_prop_type_id;
DROP INDEX IF EXISTS property_sale_md_prop_type_id;
DROP INDEX IF EXISTS property_sale_dc_prop_type_id;
DROP INDEX IF EXISTS property_sale_err_prop_type_id;
\q

--
-- Load data from the infile into property_sale
--
nohup psql -d mydb -U dbu -h localhost -c "\copy property_sale(state, property_id, property_type, sale_date, sale_price) from '/db/data/test/pg_partitioning_infile_1b.txt' delimiter E'\t'" 2>&1 | mail -s "pg partitioning test - loading 1 billion rows" me@mydomain.com &

psql -d mydb -U dbu -h localhost -c "select count(*) from property_sale;"
psql -d mydb -U dbu -h localhost -c "select count(*) from property_sale where state = 'NY';"

--
-- Recreate indexes (run in background)
--
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ca_prop_type_id ON property_sale_ca (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ny_prop_type_id ON property_sale_ny (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_tx_prop_type_id ON property_sale_tx (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_il_prop_type_id ON property_sale_il (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_wa_prop_type_id ON property_sale_wa (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_fl_prop_type_id ON property_sale_fl (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_va_prop_type_id ON property_sale_va (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_co_prop_type_id ON property_sale_co (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_oh_prop_type_id ON property_sale_oh (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nv_prop_type_id ON property_sale_nv (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_or_prop_type_id ON property_sale_or (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_pa_prop_type_id ON property_sale_pa (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ut_prop_type_id ON property_sale_ut (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ma_prop_type_id ON property_sale_ma (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ct_prop_type_id ON property_sale_ct (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_la_prop_type_id ON property_sale_la (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_wi_prop_type_id ON property_sale_wi (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_wy_prop_type_id ON property_sale_wy (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nm_prop_type_id ON property_sale_nm (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nj_prop_type_id ON property_sale_nj (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nh_prop_type_id ON property_sale_nh (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_mi_prop_type_id ON property_sale_mi (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_md_prop_type_id ON property_sale_md (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_dc_prop_type_id ON property_sale_dc (property_type, id);" &
nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_err_prop_type_id ON property_sale_err (property_type, id);" &

-- Drop indexes, if exist, to speed up loading

psql -d mydb -U dbu -h localhost

DROP INDEX IF EXISTS property_sale_ca_prop_type_id;

DROP INDEX IF EXISTS property_sale_ny_prop_type_id;

DROP INDEX IF EXISTS property_sale_tx_prop_type_id;

DROP INDEX IF EXISTS property_sale_il_prop_type_id;

DROP INDEX IF EXISTS property_sale_wa_prop_type_id;

DROP INDEX IF EXISTS property_sale_fl_prop_type_id;

DROP INDEX IF EXISTS property_sale_va_prop_type_id;

DROP INDEX IF EXISTS property_sale_co_prop_type_id;

DROP INDEX IF EXISTS property_sale_oh_prop_type_id;

DROP INDEX IF EXISTS property_sale_nv_prop_type_id;

DROP INDEX IF EXISTS property_sale_or_prop_type_id;

DROP INDEX IF EXISTS property_sale_pa_prop_type_id;

DROP INDEX IF EXISTS property_sale_ut_prop_type_id;

DROP INDEX IF EXISTS property_sale_ma_prop_type_id;

DROP INDEX IF EXISTS property_sale_ct_prop_type_id;

DROP INDEX IF EXISTS property_sale_la_prop_type_id;

DROP INDEX IF EXISTS property_sale_wi_prop_type_id;

DROP INDEX IF EXISTS property_sale_wy_prop_type_id;

DROP INDEX IF EXISTS property_sale_nm_prop_type_id;

DROP INDEX IF EXISTS property_sale_nj_prop_type_id;

DROP INDEX IF EXISTS property_sale_nh_prop_type_id;

DROP INDEX IF EXISTS property_sale_mi_prop_type_id;

DROP INDEX IF EXISTS property_sale_md_prop_type_id;

DROP INDEX IF EXISTS property_sale_dc_prop_type_id;

DROP INDEX IF EXISTS property_sale_err_prop_type_id;

-- Load data from the infile into property_sale

nohup psql -d mydb -U dbu -h localhost -c "\copy property_sale(state, property_id, property_type, sale_date, sale_price) from '/db/data/test/pg_partitioning_infile_1b.txt' delimiter E'\t'" 2>&1 | mail -s "pg partitioning test - loading 1 billion rows" me@mydomain.com &

psql -d mydb -U dbu -h localhost -c "select count(*) from property_sale;"

psql -d mydb -U dbu -h localhost -c "select count(*) from property_sale where state = 'NY';"

-- Recreate indexes (run in background)

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ca_prop_type_id ON property_sale_ca (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ny_prop_type_id ON property_sale_ny (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_tx_prop_type_id ON property_sale_tx (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_il_prop_type_id ON property_sale_il (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_wa_prop_type_id ON property_sale_wa (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_fl_prop_type_id ON property_sale_fl (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_va_prop_type_id ON property_sale_va (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_co_prop_type_id ON property_sale_co (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_oh_prop_type_id ON property_sale_oh (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nv_prop_type_id ON property_sale_nv (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_or_prop_type_id ON property_sale_or (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_pa_prop_type_id ON property_sale_pa (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ut_prop_type_id ON property_sale_ut (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ma_prop_type_id ON property_sale_ma (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_ct_prop_type_id ON property_sale_ct (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_la_prop_type_id ON property_sale_la (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_wi_prop_type_id ON property_sale_wi (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_wy_prop_type_id ON property_sale_wy (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nm_prop_type_id ON property_sale_nm (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nj_prop_type_id ON property_sale_nj (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_nh_prop_type_id ON property_sale_nh (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_mi_prop_type_id ON property_sale_mi (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_md_prop_type_id ON property_sale_md (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_dc_prop_type_id ON property_sale_dc (property_type, id);" &

nohup psql -d mydb -U dbu -h localhost -c "CREATE INDEX property_sale_err_prop_type_id ON property_sale_err (property_type, id);" &

Query with Constraint Exclusion

Prior to querying the tables, make sure the query optimization parameter, constraint_exclusion, is enabled.

--
-- Turn on query optimization for partitioned tables
--
SET constraint_exclusion = on;

--
-- Query test
--
psql -d mydb -U dbu -h localhost -c "select count(*) from property_sale where state = 'NY' and property_type = 'Condominium';"

-- Turn on query optimization for partitioned tables

SET constraint_exclusion = on;

-- Query test

psql -d mydb -U dbu -h localhost -c "select count(*) from property_sale where state = 'NY' and property_type = 'Condominium';"

With constraint exclusion enabled, the query planner will be smart enough to examine query constraints to exclude scanning of those partitioned tables that don’t match the constraints. Unfortunately, though, if the constraints involve matching against non-constants like the NOW() function, the query planner won’t have enough information to filter out unwanted partitions hence won’t be able to take advantage of the optimization.

Final notes

With a suitable partitioning scheme applied to a big table, query performance can be improved by an order of magnitude. As illustrated in the above case, the entire partitioning scheme centers around the key column used for partitioning, hence it’s critical to properly plan out which key column (or combination of columns) to partition. Number of partitions should also be carefully thought out, as too few partitions might not help whereas too many partitions would create too much overhead.

Genuine Blog

A Tech Blog by Leo Cheung

Tag Archives: postgresql

ETL & Pipelining With Alpakka Kafka

Real-time streaming ETL/pipelining of property listing data

A Kafka producer using Alpakka Csv

A Kafka producer using Alpakka Slick

A Kafka consumer using Alpakka Cassandra

Enhancing the Kafka consumer for ‘at-least-once’ consumption

Adding a property-rating pipeline to the Alpakka Kafka consumer

A Kafka consumer with a custom flow & stream destination

Running the streaming ETL/pipelining system

Further enhancements

Streaming ETL With Alpakka Kafka

Batch ETL

Real-time Streaming ETL

Alpakka – a reactive stream API and DSL

Streaming ETL with Alpakka Kafka, Slick, Cassandra, …

Example: ETL of real estate property listing data

PostgresProducerPlain – an Alpakka Kafka producer

CassandraConsumerPlain – an Alpakka Kafka consumer

Table schema in PostgreSQL & Cassandra

What’s next?

PostgreSQL Table Partitioning

Partial indexing

Scaling up with table partitioning

A table partitioning example

A Python program for data import

Query with Constraint Exclusion

Final notes