i haz teh codez: User Defined Functions in a Spark DataFrame

First lets create an example DataFrame. You could use anything (Avro, JSON etc) but for this we will create it programmatically.

import org.apache.spark.sql.functions._

case class Document(id: Integer, name: String)

val documents = Seq(
  new Document(1, "The Document"),
  new Document(2, "The Other Document"),
  new Document(3, "More Documents"),
  new Document(4, "Fred The Document"),
  new Document(5, "Ye Olde Document")
)

val documentsDF = sc.parallelize(documents, 4).toDF()

We want to do some processing on each row, in this case filter out the offensive words in the name of each document. Like Fred. First we create a User Defined Function (udf) which will take a string and perform some processing on it.

val documentCensor = udf {(x: String) => x.replaceAll("Fred", "****")}

This user defined function in this case does a simple string replacement on the input string but it could do something more complex.

documentsDF.select($"id", documentCensor($"name").as("name")).show

Now we select out the columns we want (note we are using select with columns, not select with string names). For the processed rows we use the udf defined before and give it a nice name with as.

i haz teh codez

Pages

Wednesday, 10 February 2016

User Defined Functions in a Spark DataFrame

No comments:

Post a Comment