Sunday 22 March 2015

Join Operations in Spark Dataframe

Below are some tests on Inner Join in Dataframe
It seems something is wrong with the api. 1-7 all result into cross join(cartesian product), except the 8th one.

    //1. 
    val fraudsters = new_df.where(new_df("phone").in(old_df("phone")))

    //2.

    val fraudsters = new_df.filter(new_df("phone").in(old_df("phone")))

    //3 
    val fraudsters = new_df.filter(new_df("phone")===old_df("phone"))

    //4 
    val fraudsters = new_df.join(old_df, new_df("phone")===old_df("phone"),"inner")

    //5
    val fraudsters = new_df.join(old_df, new_df("phone").equalTo(old_df("phone")),"inner")

    //6
    val fraudsters = new_df.join(old_df).where(new_df("phone") === old_df("phone"))

    //7
    val fraudsters = new_df.join(old_df, $"new_df.phone" === $"old_df.phones")

    //8 finally, correct!
    
    val df1 = new_df.as('df1)
    val df2 = susperious_number_new_trans.as('df2)

    val fraudsters = df1.join(df2, col("df1.phone") === col("df2.phone"))
    

Reference:
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame
https://github.com/apache/spark/pull/4847/files#r25584861
https://github.com/apache/spark/blob/v1.3.0/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala



1 comment: