case class Document(colA : String,
colB : String,
colC : String,
colD : String,
colE : String,
colF0 : String,
colF1 : String,
colF2 : String )
val documents = Seq(
new Document("colA", "colB", "colC", "colD", "colE", "colF0", "colF1", "colF2"),
new Document("1", "2", "3", "4", "5", "6", "7", "8")
)
val df = sc.parallelize(documents, 2).toDF
From this we wish to return just the columns with a name starting with colF.
First grab the column names with
df.columns
, then filter down to just the column names you want.filter(_.startsWith("colF"))
. This gives you an array of Strings. But the select takes select(String, String*)
. Luckily select for columns is select(Column*)
, so finally convert the Strings into Columns with .map(df(_))
, and finally turn the Array of Columns into a var arg with : _*
.df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
If we also want to select a number of other columns then we have two options, the first is to simply make the filter more complex.
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
This could easily get quite ugly, confusing and bloated if you are simply after a fixed list of other columns. A much cleaner solution would be to merge in a fixed list of columns.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
This is based on my Stack Overflow answer.
No comments:
Post a Comment