Pages

Thursday, 11 February 2016

Spark DataFrame: Programmatically filtering column names

In this we have a dataframe where we wish to extract a number of columns from it, but do not know the names of the columns ahead of time. For example:

case class Document(colA : String,
                    colB : String,
                    colC : String,
                    colD : String,
                    colE : String,
                    colF0 : String,
                    colF1 : String,
                    colF2 : String )

val documents = Seq(
  new Document("colA", "colB", "colC", "colD", "colE", "colF0", "colF1", "colF2"),
  new Document("1", "2", "3", "4", "5", "6", "7", "8")
)

val df = sc.parallelize(documents, 2).toDF

From this we wish to return just the columns with a name starting with colF.

First grab the column names with df.columns, then filter down to just the column names you want.filter(_.startsWith("colF")). This gives you an array of Strings. But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)), and finally turn the Array of Columns into a var arg with : _*.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
If we also want to select a number of other columns then we have two options, the first is to simply make the filter more complex.

df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show 

This could easily get quite ugly, confusing and bloated if you are simply after a fixed list of other columns. A much cleaner solution would be to merge in a fixed list of columns. 

df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show





This is based on my Stack Overflow answer.

No comments:

Post a Comment