import org.apache.spark.sql.functions._
case class Document(id: Integer, name: String) val documents = Seq( new Document(1, "The Document"), new Document(2, "The Other Document"), new Document(3, "More Documents"), new Document(4, "Fred The Document"), new Document(5, "Ye Olde Document") )
val documentsDF = sc.parallelize(documents, 4).toDF()
We want to do some processing on each row, in this case filter out the offensive words in the name of each document. Like Fred. First we create a User Defined Function (udf) which will take a string and perform some processing on it.
val documentCensor = udf {(x: String) => x.replaceAll("Fred", "****")}
This user defined function in this case does a simple string replacement on the input string but it could do something more complex.
documentsDF.select($"id", documentCensor($"name").as("name")).show
Now we select out the columns we want (note we are using select with columns, not select with string names). For the processed rows we use the udf defined before and give it a nice name with as.
No comments:
Post a Comment