随机森林模型在REITs基金中的应用

Quantitative Models and Construction Methods - Model Name: Random Forest Model Model Construction Idea: The core idea of the Random Forest model is to build multiple decision tree models and integrate their results to make predictions. It enhances generalization performance by reducing feature correlation through random feature selection and uses ensemble learning to improve accuracy and robustness[7][15][30]. Model Construction Process: 1. Use bootstrap sampling to extract multiple subsets from the original dataset[30]. 2. Build a decision tree for each subset, randomly selecting a portion of features at each node for splitting[31][32]. 3. Repeat the above steps until the specified number of decision trees is generated[33]. Formula: Entropy of the dataset: $ H(D)=-\sum\nolimits_{i=1}^{m}p_{i}log_{2}(p_{i}) $ Conditional entropy for feature "Market Sentiment": $ H(D|A)=\sum_{v\in{Values(A)}}\frac{|D_{v}|}{|D|}H(D_{v}) $ Information gain: $ Gain(D,A)=H(D)-H(D|A) $[22][24][25] Model Evaluation: The Random Forest model has strong generalization ability, effectively handles high-dimensional and missing data, and provides feature importance analysis. However, it has high computational complexity and sensitivity to parameter changes[12][66][70]. Model Backtesting Results - Random Forest Model: - Annualized Return: 39.76%[11][65] - Excess Return: 40.01%[11][65] - Sharpe Ratio: 2.82[11][65] - Year-to-Date Return (2025): 73.81%[11][65] - Year-to-Date Excess Return (2025): 60.49%[11][65] - Maximum Drawdown: -17.14%[65] - Excess Maximum Drawdown: -6.59%[65] - Information Ratio (IR): 4.78[65] Quantitative Factors and Construction Methods - Factor Selection: Factors with an IC absolute value greater than 2.5% were selected for model fitting[8][44]. - Factor List: - Positive IC Factors: Turnover rate (6.44%), P/NAV (17.21%), Flow market value (22.02%), Previous week's return (15.38%), etc. - Negative IC Factors: Previous closing price (-9.95%), Opening price (-10.81%), Valuation of CSI REITs (-8.41%), etc.[45][46] Factor Backtesting Results - Selected Factors: - IC Range: From -23.87% to 22.02%[45][46] Model Construction Details - Parameter Sensitivity Analysis: - Number of Trees (n_estimators): Tested within the range of 1 to 200. Optimal value selected at 100 based on RMSE minimization[49][51]. - Feature Count (max_features): No restriction applied due to the limited number of features (27 factors)[52][53]. - Tree Depth (max_depth): Optimal depth determined as 15 through grid search[55][58]. - Minimum Samples per Leaf (min_samples_leaf): Optimal value determined as 15 through grid search[55][58]. Model Performance Metrics - In-Sample Results: - Mean Squared Error (MSE): 0.00044[59] - Root Mean Squared Error (RMSE): 0.021[60] - R² (Coefficient of Determination): 0.501[61] - Out-of-Sample Results: - R²: 0.51983 (max_depth=15, min_samples_leaf=10)[56][58] Model Evaluation - Advantages: - Captures nonlinear relationships and complex interactions[68]. - Robust against noise and overfitting[68]. - Provides feature importance evaluation for better interpretability[68][69]. - Disadvantages: - High computational cost and complexity[70]. - Requires extensive hyperparameter tuning[70]. - Limited interpretability compared to linear models[70]. - Potential overfitting with deep trees[70]. - Challenges in handling imbalanced datasets[70].