DSE Market Prediction
Forecasting next-day directional moves on the Dhaka Stock Exchange using XGBoost on engineered features — with honest, leakage-aware backtesting.
- Python
- XGBoost
- Machine Learning
- Financial Analytics
- Data Science

01 — Problem
Why this project
Most published forecasting work on the DSE either uses leaky features (lookahead bias in technical indicators) or evaluates on a single train/test split, both of which inflate performance.
I wanted to see how much real, regime-robust predictive signal there is in publicly available daily data for the DSE Broad Index (DSEX) and a handful of large-caps — and how that signal degrades under walk-forward evaluation.
02 — Approach
How I tackled it
- 01
Pull daily OHLCV for DSEX and 10 large-caps from publicly archived end-of-day files. Align to business-day calendar.
- 02
Engineer features: lagged returns (1, 5, 10 days), realised volatility (5d, 20d), volume z-score, simple technical indicators (RSI, MACD), and sector dummies.
- 03
Frame the problem as binary classification: P(next-day return > 0). Avoids needing to calibrate continuous returns and is what most strategy code cares about anyway.
- 04
Walk-forward backtest with an expanding window — retrain every 60 trading days, evaluate on the next 60.
- 05
Optimise XGBoost hyperparameters via Bayesian search on the first training window only; held those fixed for fairness across regimes.
03 — Data sources
Where the data came from
| Source | Via | Rows |
|---|---|---|
| DSEX daily OHLCV | DSE archive CSV | ~3,200 rows |
| Large-cap OHLCV (10 tickers) | DSE archive CSV | ~32,000 rows |
| Sector classifications | Hand-curated mapping | — |
04 — Pipeline
End-to-end flow
- 01
Ingest
DSE EOD CSV files → pandas DataFrame
- 02
Calendar alignment
join on business-day index
- 03
Feature engineering
lags, vol, RSI, MACD, sector
- 04
Target
binary: next-day excess return > 0
- 05
Walk-forward split
60-day windows, expanding train
- 06
XGBoost training
Bayesian-tuned hyperparams (held fixed)
- 07
Out-of-sample metrics
directional accuracy, AUC, log-loss
- 08
Backtest
long/short on signal, transaction costs included
05 — Code
A key snippet
Walk-forward training loop (simplified)
import xgboost as xgb
from sklearn.metrics import roc_auc_score
WINDOW, STEP = 60, 60
results = []
for i in range(WINDOW, len(df), STEP):
train = df.iloc[:i]
test = df.iloc[i:i + STEP]
X_train, y_train = train[FEATURES], train["target"]
X_test, y_test = test[FEATURES], test["target"]
model = xgb.XGBClassifier(
n_estimators=400, max_depth=4, learning_rate=0.04,
subsample=0.8, colsample_bytree=0.7,
eval_metric="logloss", tree_method="hist",
)
model.fit(X_train, y_train)
p = model.predict_proba(X_test)[:, 1]
results.append({
"start": test.index[0],
"auc": roc_auc_score(y_test, p),
"dir_acc": ((p > 0.5) == y_test).mean(),
})
06 — Results
What it shipped
| Metric | Value |
|---|---|
| Out-of-sample AUC (median) | {{TODO: 0.5x}} |
| Directional accuracy | {{TODO: 5x%}} |
| Backtest Sharpe (incl. costs) | {{TODO: x.x}} |
| Walk-forward windows | {{TODO}} |
| Top feature by importance | {{TODO}} |
Caveat: Performance is regime-dependent — the model does best in trending periods and clearly worse in choppy, low-volume stretches. The directional accuracy on a single test split was misleadingly high (~62%); the walk-forward median is the honest number.
07 — Lessons
What I learned
If you only run a single train/test split, you will fool yourself. Walk-forward is the minimum bar for time-series finance work.
Most of my 'gains' from feature engineering disappeared once I closed every avenue for lookahead bias (e.g. using same-day VWAP as a feature).
Transaction costs and slippage eat a surprising amount of edge on DSE — bid-ask spreads on smaller stocks are wider than I expected.
Tree models beat LSTMs handily here, because the dataset is small (~3,200 days) and the structure is mostly tabular.
Next iteration (v2): adding macro and sector features, regime-conditional models, and a public live dashboard.
08 — Links