@darrencl wrote:
Hi,
I want to test my variable is independent from target
y
withChisqTest
fromHypothesisTests.jl
, so I think I would need to use the contingency table instead of goodness of fit (likesklearn
's).First, I did one-hot-encode my categorical variable, then fetch it to
ChisqTest
function. I saw there is ak
parameter which affect the degree of freedom (it seems degree of freedom = (k - 1)^2). I am not a statistician here, so what value should I put?Anyway, using my one-hot-encoded feature, it seems that this produces
NaN
p-values in all of my feature. Why is that? I am using titanic dataset fromRDatasets.jl
. Here’s the sample that it produces NaN when testing one of my feature against target ‘y’ (Survived).julia> titanic = dataset("datasets", "Titanic"); julia> X = one_hot_encode(titanic[:, [:Class, :Sex, :Age]]; drop_original=true) 32×8 DataFrame │ Row │ Class_1st │ Class_2nd │ Class_3rd │ Class_Crew │ Sex_Female │ Sex_Male │ Age_Adult │ Age_Child │ │ │ Bool │ Bool │ Bool │ Bool │ Bool │ Bool │ Bool │ Bool │ ├─────┼───────────┼───────────┼───────────┼────────────┼────────────┼──────────┼───────────┼───────────┤ │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ │ 2 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ │ 3 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ │ 4 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ │ 5 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ │ 6 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ │ 7 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ │ 8 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 1 │ │ 9 │ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ │ 10 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ │ 11 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ 1 │ 0 │ │ 12 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │ │ 13 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ │ 14 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ │ 15 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 0 │ │ 16 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ │ 17 │ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ │ 18 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ │ 19 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ │ 20 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ │ 21 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ │ 22 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ │ 23 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 1 │ │ 24 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 1 │ │ 25 │ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ │ 26 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ │ 27 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ 1 │ 0 │ │ 28 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0 │ │ 29 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ │ 30 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ │ 31 │ 0 │ 0 │ 1 │ 0 │ 1 │ 0 │ 1 │ 0 │ │ 32 │ 0 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ julia> y = Vector{Int64}(recode(titanic.Survived, "No"=> 1, "Yes"=> 2) ); julia> X_data=convert(Matrix, X); julia> ChisqTest(Int.(X_data[:,1]), y,2) Pearson's Chi-square Test ------------------------- Population details: parameter of interest: Multinomial Probabilities value under h_0: [0.5, 0.0, 0.5, 0.0] point estimate: [0.5, 0.0, 0.5, 0.0] 95% confidence interval: Tuple{Float64,Float64}[(0.25, 0.8761), (0.0, 0.3761), (0.25, 0.8761), (0.0, 0.3761)] Test summary: outcome with 95% confidence: reject h_0 one-sided p-value: NaN Details: Sample size: 8 statistic: NaN degrees of freedom: 1 residuals: [0.0, NaN, 0.0, NaN] std. residuals: [NaN, NaN, NaN, NaN]
Posts: 1
Participants: 1