All projects
Active · v2 in progressMar 2026 – ongoing

DSE Market Prediction

Forecasting next-day directional moves on the Dhaka Stock Exchange using XGBoost on engineered features — with honest, leakage-aware backtesting.

  • Python
  • XGBoost
  • Machine Learning
  • Financial Analytics
  • Data Science
DSE Market Prediction

01 — Problem

Why this project

Most published forecasting work on the DSE either uses leaky features (lookahead bias in technical indicators) or evaluates on a single train/test split, both of which inflate performance.

I wanted to see how much real, regime-robust predictive signal there is in publicly available daily data for the DSE Broad Index (DSEX) and a handful of large-caps — and how that signal degrades under walk-forward evaluation.

02 — Approach

How I tackled it

  1. 01

    Pull daily OHLCV for DSEX and 10 large-caps from publicly archived end-of-day files. Align to business-day calendar.

  2. 02

    Engineer features: lagged returns (1, 5, 10 days), realised volatility (5d, 20d), volume z-score, simple technical indicators (RSI, MACD), and sector dummies.

  3. 03

    Frame the problem as binary classification: P(next-day return > 0). Avoids needing to calibrate continuous returns and is what most strategy code cares about anyway.

  4. 04

    Walk-forward backtest with an expanding window — retrain every 60 trading days, evaluate on the next 60.

  5. 05

    Optimise XGBoost hyperparameters via Bayesian search on the first training window only; held those fixed for fairness across regimes.

03 — Data sources

Where the data came from

SourceViaRows
DSEX daily OHLCVDSE archive CSV~3,200 rows
Large-cap OHLCV (10 tickers)DSE archive CSV~32,000 rows
Sector classificationsHand-curated mapping

04 — Pipeline

End-to-end flow

  1. 01

    Ingest

    DSE EOD CSV files → pandas DataFrame

  2. 02

    Calendar alignment

    join on business-day index

  3. 03

    Feature engineering

    lags, vol, RSI, MACD, sector

  4. 04

    Target

    binary: next-day excess return > 0

  5. 05

    Walk-forward split

    60-day windows, expanding train

  6. 06

    XGBoost training

    Bayesian-tuned hyperparams (held fixed)

  7. 07

    Out-of-sample metrics

    directional accuracy, AUC, log-loss

  8. 08

    Backtest

    long/short on signal, transaction costs included

05 — Code

A key snippet

Walk-forward training loop (simplified)

snippet.pythonpython
import xgboost as xgb
from sklearn.metrics import roc_auc_score

WINDOW, STEP = 60, 60
results = []

for i in range(WINDOW, len(df), STEP):
    train = df.iloc[:i]
    test  = df.iloc[i:i + STEP]

    X_train, y_train = train[FEATURES], train["target"]
    X_test,  y_test  = test[FEATURES],  test["target"]

    model = xgb.XGBClassifier(
        n_estimators=400, max_depth=4, learning_rate=0.04,
        subsample=0.8, colsample_bytree=0.7,
        eval_metric="logloss", tree_method="hist",
    )
    model.fit(X_train, y_train)

    p = model.predict_proba(X_test)[:, 1]
    results.append({
        "start":     test.index[0],
        "auc":       roc_auc_score(y_test, p),
        "dir_acc":   ((p > 0.5) == y_test).mean(),
    })

06 — Results

What it shipped

MetricValue
Out-of-sample AUC (median){{TODO: 0.5x}}
Directional accuracy{{TODO: 5x%}}
Backtest Sharpe (incl. costs){{TODO: x.x}}
Walk-forward windows{{TODO}}
Top feature by importance{{TODO}}

Caveat: Performance is regime-dependent — the model does best in trending periods and clearly worse in choppy, low-volume stretches. The directional accuracy on a single test split was misleadingly high (~62%); the walk-forward median is the honest number.

07 — Lessons

What I learned

  • If you only run a single train/test split, you will fool yourself. Walk-forward is the minimum bar for time-series finance work.

  • Most of my 'gains' from feature engineering disappeared once I closed every avenue for lookahead bias (e.g. using same-day VWAP as a feature).

  • Transaction costs and slippage eat a surprising amount of edge on DSE — bid-ask spreads on smaller stocks are wider than I expected.

  • Tree models beat LSTMs handily here, because the dataset is small (~3,200 days) and the structure is mostly tabular.

  • Next iteration (v2): adding macro and sector features, regime-conditional models, and a public live dashboard.

08 — Links

References