Asset Allocation via Machine Learning

The authors are indebted to the MOE (Ministry of Education of China) General Program of Humanities, Philosophies and Social Sciences “A Study on the Sentiment and Behavior of Financial Market Investors Based on Deep Learning Technology” (Grant No.19YJA790104) for providing financial support. Abstract In this paper, we document a novel machine learning-based numerical framework to solve static and dynamic portfolio optimization problems, with, potentially, an extremely large number of assets. The framework proposed applies to general constrained optimization problems and overcomes many major difficulties arising in current literature. We not only empirically test our methods in U.S. and China A-share equity markets, but also run a horse-race comparison of some optimization schemes documented in (Homescu, 2014). We record significant excess returns, relative to the selected benchmarks, in both U.S. and China equity markets using popular schemes solved by our framework, where the conditional expected returns are obtained via machine learning regression, inspired by (Gu, Kelly & Xiu, 2020) and (Leippold, Wang & Zhou, 2021), of future returns on pricing factors carefully chosen.

simulation and portfolio weights computation module, generates uniformly distributed portfolio weights in the constrained region and selects the globally optimal one corresponding to a certain objective utility function with nonlinear constraints. The method in this module is fast and convergent under mild assumptions. Most importantly, this module does not require the computation of joint higher order moments (Note 1) of a random vector (e.g., the rate of returns), which is computationally expensive especially in high dimensions. The last module evaluates the profit and loss of the selected portfolios and performs backtesting. We empirically test our methodology with numerical examples covering various objective functions and constraints, in U.S. and Chinese stock markets.

Literature Review
Portfolio optimization is an important topic in financial economics that attracts both academic researchers and financial practitioners. Starting from the seminal work of modern portfolio theory in (Markowitz, 1952), where it was suggested that correlations between different assets should be included as inputs to portfolio optimization practice, in addition to volatility and expected returns, the theory of portfolio selection has become rigorous science rather than art. The modern portfolio theory later inspired the famous CAPM model proposed in (Treynor, 1962) and later extended in (Merton, 1973) to a dynamic setting. Through years, we have witnessed an explosion of the number of results on portfolio optimization as the advancement of theoretical and empirical research, and the increase of computational power. Refinements on modern portfolio theory have been studied in (Brodie, Daubechies, De Mol, Giannone & Loris, 2008) in a context of sparse portfolios that stabilizing the output of mean-variance model. The carrier portfolio strategy, described in (Kusiak, 2013), is another approach which can deliver sparse and stable portfolios. It relies on a simple, mathematical linear program that directly considers and treats each return observation as individual data, with no assumptions on joint return distributions. Other examples, in a single period setting, include the Black and Litterman approach that was extended by (Black & Litterman, 1992), in which a Bayesian type analysis was used to incorporate investors' ex-ante views that correct the equilibrium CAPM expected asset returns. Because the first order moments of asset returns are notoriously hard to estimate, simple return-agnostic strategies were created to account for this phenomenon. For example, the minimum variance portfolio and equal weight (1-over-N) portfolio are studied in (Demiguel, Plyakha, Uppal & Vilkov, 2013) and (Allen, Lizieri & Satchell, 2012). In addition to the first and second moments, there have been strategies based on higher order moments, namely, skewness and kurtosis, appearing in the work of (Harvey, Liechty, Liechty & Mueller, 2010) and many others. The computation of higher order co-moments may rely on factor models, such as the single-factor method considered in (Martellini & Ziemann, 2010), or forward-looking information obtained from option prices as documented in (Christoffersen, Jacobs & Chang, 2012). But whichever approach we choose, the methods suffer greatly from the curse of dimensionality as the dimension of asset span increases. Since the proposal of (Ang, 2013) on factor investing, factor-based strategies start to emerge. References can be found in (Koedijk, Slager & Stork, 2014) and (Roncalli & Weisang, 2012), among others. Additional return agnostic, or risk-based strategies, such as maximum diversification strategy, risk-parity strategies, and their variations have been proposed. Research results can be found in (Anderson, Bianchi & Goldberg, 2014) and references therein. Another strand of literature focuses on dynamic portfolio allocation in a randomly varying market environment and considers multi-period optimization problems. The pioneering work attributes to (Merton, 1971) with numerous later refinements, see (Cvitanic & Karatzas, 1992) and (Schroder & Skiadas, 1999) for example. The characteristics of this type of problems are that they often involve dynamic programming and in a continuous time setting, a PDE, or equivalently a BSDE, system often needs to be solved.
Theoretically being sound, the Markowitz's mean-variance optimization method, still popular in financial industry, has many obvious drawbacks, which are well-documented in the literature, for example, (Homescu, 2014) and (Perrin & Roncalli, 2019). There are three major difficulties with respect to the mean-variance approach. The first is that the method output is extremely sensitive to the model input, which is often difficult to estimate accurately. The second is that the analytical solution of the quadratic programming problem involved requires the computation of a large covariance matrix and its inverse, when the number of assets is large. Third, the computational burden increases sharply if multiple linear or nonlinear constraints are added. To address the input estimation problem, many deep-learning based methods have been proposed recently, such as (Yang, Liu & Wu, 2018), (Gu, Kelly & Xiu, 2020), (Babiak & Barunik, 2020) and references therein. To address the second and third questions, (Perrin & Roncalli, 2019) reviewed machine learning optimization approaches to solve the constrained quadratic programming problem in high dimensions. In (Homescu, 2014), general formulations of static portfolio optimization are outlined, taking into consideration the reward, risk and various constraints on optimal portfolios. Compared to the reference in the literature, our method enjoys all the advantages while being theoretically simple and easy to implement in practice.

Main Contributions
The contributions of this paper are five-folds. First, it proposes a fast and convergent numerical framework, which is universal and applies to arbitrary constrained optimization problems with unique solutions without calling explicitly any optimization routines, unlike the current problem-specific deep learning-based methods in the literature. Second, our methodology involves no estimation of cross higher order moments of the asset span by its construction. This is crucial for the methodology to overcome the curse of dimensionality when higher order moments are involved, and the number of assets is large. Third, the paper proposes to use the (hierarchical) clustering method to reduce the dimension of the optimization problem, when there are many assets in the portfolio. Fourth, the paper advocates using deep learning techniques to estimate the model inputs. Fifth, we provide empirical studies on portfolio choice among a large number of stocks in China and U.S. equity markets and evaluate the performance.

Organization of This Paper
The paper is organized as following. Section 2 describes the methodology and the optimization framework. Section 3 performs numerical experiments and Section 4 concludes.

Main Methodology
In this section, we document and present first three of the aforementioned 4 modules. As the first step, we prepare the inputs for the optimization process, which are often the first and higher order moments of asset return vectors. Second, we decompose the asset span, which usually consists of thousands of assets, with the help of a set of risk factors, into small subsets in a hierarchical manner, perform optimization on each sub-level and obtain the final optimal weight on every asset based on the intermediate weights on each cluster. Third, most importantly, we perform portfolio optimization at each clustering level in the hierarchy based on Monte-Carlo simulation. The last step is to compute the results of the back testing and present the transaction cost at each point in time. Theoretical convergence results are provided in this section and we will illustrate how to perform portfolio choice numerically for static problems.

Input Preparation and Orthogonalization
The current portfolio choice literature often requires users input various moments of the asset returns. To compute the th order conditional moments , where denotes the vector of asset returns and represents the standard multi-index notation, i.e., , we first look at the general semi-martingale decomposition below, written in matrix and vector notations Here the random source term satisfies and it has unit variance-covariance matrix. We will have the higher order moments of as functions of if we assume that its distribution is elliptic.
Single or multi-variate time series models have been proposed to estimate , and even higher order joint moments of the asset return vector. From the simplest single variable linear autoregressive (AR) and moving average models (MA) of order 1 to vector auto-regressive Models (VAR) or even non-linear parametric models, we can develop complicated tools for the estimation purposes, dependent on the availability of sufficient data set. This set of parametric models often assumes a predetermined functional dependency of the moments on the past data. Another route resorts to non-parametric models, to recover the unknown functional relationship between the quantities that we want to estimate and the available data, like the kernel density estimation approach, to name a few. However, the problem with both approaches are that they are either too stringent on the model assumptions, or too sensitive to the data sets chosen for the estimation task. Both suffer severely from the curse of dimensionality problems.
. For example, and can be computed via machine learning methods (see (Gu, Kelly & Xiu, 2020)) or a Monte-Carlo simulation and clustering-based method introduced in (Zhang, 2020) More detailed analysis of the choice of factors and a review of popular regression methodologies can be found in (Gu, Kelly & Xiu, 2020) for U.S. market and (Leippold, Wang & Zhou, 2021) for China A-share market. The machine learning methods, as (Gu, Kelly & Xiu, 2020) reviewed, range from linear regression, LASSO, ridge regression, elastic-net, partial least squares (PLS), to methods introducing non-linearity, such as artificial neural networks, which constitute a big category with deep neural networks, convolution neural networks, recursive neural networks, generative adversal networks, graph-based neural networks and so on. Popular methods also include regression trees, random forests and support vector machines. Those tools listed above are quite flexible and are able to approximate any continuous functions in a compact domain of the Euclidean space. Empirically being effective, the machine learning methods also enjoy certain mathematical rigor, e.g., the universal approximation theorem that neural networks rely upon. One might argue that the machine learning methods are not only complicated to implement but also require huge amounts of data for inference purposes because they often depend on a large number of unknown parameters that one needs to recover. On the contrary, as computational power becomes cheaper and the development of programming languages, those are no longer the obstacles that prevent people from using ML tools. In Python programming language, there are multiple packages, such as keras, sklearn or even xgboost, that enable users to build machine learning structures, i.e., DNN, boosting regression trees, and perform regression or classification prediction tasks. The packages are quite efficient in terms of computational speed with moderate memory requirements for most financial applications. To implement the methods, we need to determine the most effective ML method for the data set we choose, as well as a set of factors, features or characteristics on which we can express the conditional expected moments of asset returns, as a known function. The choice of factors is very important and directly relates to the accuracy of the predictions. In what follows, we will outline the factors in the empirical studies for the completeness of the paper.
In empirical studies, we often have , meaning that we will only compute the first order moments. The estimation of cross higher order moments, for example, the second order moments, also known as the covariance matrix, is not necessary. To understand this, consider a lead-lag panel regression of .
The second order moment of the weighted assets can be represented by , where is the associated factor value of the synthetic asset . This is an interpolation problem when , on which the machine learning methods work well. The same applies to higher order cross moments.

Asset Space Decomposition
In order to apply the bottom-up approach to construct the optimal portfolios, we first use a topdown (hierarchical) clustering method (see (Raffinot, 2019) as an introduction to hierarchical clustering in portfolio choice) to decompose the asset universe into stratified sub-spaces. To understand this, suppose we are dealing with a portfolio optimization problem of 6400 assets, which is a very large number. Each assets is associated with a vector of characteristics . Our methodology aims at decomposing the asset universe into, say, 80 small clusters, according to the factor values , meaning, in each cluster, the factor values of the associated assets are sufficiently close (Note 2). On average, there are 80 assets in each of the clusters. If we determine that 80 is still a relatively large number, we can, inside each cluster, divide it into 10 sub-clusters. Now, we have 80 X 10 = 800 clusters with each cluster containing 8 assets on average. Then, we perform portfolio optimization separately in each of the sub-clusters. After portfolio optimization is done on each of the sub-clusters, we treat each optimized sub-cluster as one synthetic asset based on the optimal weights and perform portfolio optimization among the 10 synthesized assets.
In this way, starting from the sub-clusters of the lowest level, we obtain the optimal portfolio weights based on the parameters input from Section 2.1 and the methodology to be illustrated in Section 2.3. By working from the lowest to highest level, we will obtain the optimal portfolio weights corresponding to each of the intermediate sub-spaces, and finally the weights on each asset.
The above analysis uses iterated clustering approach to build the hierarchy of clusters. Of course, hierarchical clustering can be directly applied here. In addition to the clustering approach, we can create various criteria, industry classification, as an example, based on the factor values to partition the asset universe into small buckets, with the assets in each bucket representing some similar behaviors (Note 3).

The Algorithm
Assume that we are going to maximize an objective function (Note 4), where denotes the portfolio weights and is the rate of return vector of a certain class of assets. There are constraints on , namely and are, potentially, nonlinear functions of . A general formulation of the static portfolio optimization problems can be found in (Homescu, 2014). In this section, we outline a simulation and machine learning-based approach to obtain the optimized weights . Assume that Equation (4)  It would be interesting to discuss the utility loss introduced by this hierarchical construction. It can be easily understood that, at the last level of clustering, we perform optimization for each of the subsets and the final global weights are proportional to the weights in each final cluster. Of course, this methodology is sub-optimal compared to the global optimization on the whole asset universe.
However, we have gains in terms of a faster computational speed, less resources requirement, less severe propagation of estimation errors, and the elimination of the potential corner solutions or local optimum. Moreover, the portfolio optimization is done in a bottom-up way, but the clustering is top-down, therefore our methodology enjoys the benefit of both approaches. Moreover, the reason to use clustering approach in Step 4 above is to reduce the computation burden when M is very large and the evaluations of objective function are time consuming.

Theoretical Convergence
In this section, we discuss the global convergence of the proposed approach based on three critical assumptions below.

Assumption 1 (Completeness)
The random number generator satisfies the following. Suppose that the size of random numbers generated is , and the random numbers generated by form a set . Then we have the fact that is always dense in the compact set Assumption 1 ensures that, as we sample more random realizations for the portfolio weight vectors, any point in is reachable with the samples generated. Therefore, the solution obtained is truly the global optimum.

Assumption 3 (Continuity) The solution to the optimization problem is continuous with respect to , This means that
where is the original optimization problem with constraints.
This assumption above ensures that, given more simulated samples, the approximate solution will converge to the true one. The assumption is the property of the optimization problem itself and should be verified in a case-by-case manner. Because of the smoonth dependency of the solution of, for example, the mean-variance optimization problem to its input variables, it is easy to see that the above three assumptions are automatically satisfied. Combining the above three assumptions, we have the theorem below as our main theoretical result.
Theorem 1 (Global Convergence) Under Assumptions 1, 2 and 3, our algorithm output is convergent to the unique global optimal solution.

Dynamic Portfolio Optimization
In a dynamic portfolio optimization problem, we try to solve the Bellman equation where R is the immediate reward function, is a mapping from the state space S to the action space A and is called the policy function. denotes the probability transition matrix and is the value function. Last, is the discount factor process, which is often taken as a constant. The goal is to find an optimal policy function _ such that the value function is maximized. Of course, in general, is nonlinear in both time and state s (Note 5  (8), we generate M independent copies of , therefore M different functional forms of , and for each copy, compute the value function via Monte-Carlo simulation based on the data generating process for and Equation (8), and use the method proposed in Section 2.3.1 to determine the best choice of among the M independent samples. Last, the data generating process for can, alternatively, be replaced by a non-parametric inference directly using historical relationship.

Simulated Data and Static Portfolio Choice
Assume that there is an m-dimensional vector process , whose data generating process (DGP) is Where is another stochastic process (Note 6) and . The asset return vector is denoted by , which is n-dimensional. We have approximately the following regression relationship Here is considered as a small perturbation term, which mimicks the effect of missing factors or measurement

errors. Further observe that
where h is potentially a nonlinear function of and is the pricing error term. The detailed configurations are described below. The DGP for the factor process of each stock is chosen as an ARMA-GARCH model The factor f is n-dimensional (Note 7), is n-dimensional, is , is and the error term , where the correlation generator P is an lower triangular matrix with squared sum of each row being 1.
is an n-dimensional independent Gaussian process with mean 0 and variance 1. The parameter set is generated randomly accordingto uniform distributions and the values ensure that the ARMA-GARCH models are stationary. The n-dimensional return process satisfies (Note 8) and is an n-dimensional uniformly distributed random vector serving as the perturbation term, accounting for missing factors or measurement errors. The number of factors is 1, the number of assets n = 1000 and the number of time periods T = 250. The number of clusters is , i.e., the integer part of . The portfolio weights are constrained within [0, 1] and sum up to 1. The objective function is the classical mean-variance quadratic one. The equity curve of the out-of-sample optimization results is displayed in Exhibit 1 below.

Figure 1. Equity Curve for Simulation Study
It is obvious that the method performs well in an artificial simulation environment, according to the equity curve.
The x-axis is the number of periods, which is up to 250, and the y-axis is the value of the equity curve. From the plot we can see that the equity curve increases stably from 0 to approximately 1 with very limited drawdowns. This is inline with our expectations: in a simulation environment, we know and are able to recover exactly the functional relationship between factors and expected future returns. Of course, there are also small negative returns appearing along the equity curve. This is caused by the small perturbation term in Equation (10) and the fact that the realization of future returns can deviate from their expected values, which is illustrated by the pricing error term .

The Data
The data are from WRDS. Stock returns, prices, and the number of shares outstanding are downloaded from CRSP monthly files, while the accounting data are from Compustat fundamentals annual files. We exclude financial firms (SIC code between 6000 and 6999) from the sample and consider only NYSE, AMEX, and NASDAQ firms with common stocks (SHRCD = 10 or 11). We sample accounting data used for the construction of characteristics in calendar year t from the statements with the fiscal year end t-1. Based on the data that are available from CRSP, Compustat, IBES, and researchers' websites, we analyze the out-of-sample predictability of 94 firm characteristics that are collected from (Green, Hand & Zhang, 2017). The sample period ranges from 198001 to 201812.

The Methodology
To carry out the analysis, some details have to be determined. The objective function is , where ISSN 1927-5986 E-ISSN 1927 are the conditional expected return, computed via machine learning regression, and empirical volatility. is the empirical value computed for every simulated portfolio weight vector w using past 6 months' asset return data.
Portfolio weights are constrained within (0; 1) and they sum up to 1. The number of sample weights is 1; 000; 000 universally and that of the clusters is the squaredroot of the number of assets to be optimized. The conditional expected returns are estimated via a 94-factor (Note 9) regression-based asset pricing model implemented with Python function XGBRegressor provided by module xgboost. The cross-section regression is done in a rolling window manner. The prediction is based on monthly data. The clustering is done by the scores computed via equal weights on the factor values. The optimal portfolios are generated at the beginning of each period and are held until the end of the period.

The Results
The backtesting results are summarized in the equity curve plot in Figure 2, where the upper plot contains log-net equity curve of portfolio optimization strategies and equal weight portfolios and the lower plot represents the excess returns of the portfolio choice strategies relative to the -portfolio, and the performance metrics in Table 3. From Figure 2, we can see that the mean-variance strategy performs best, and is followed by minimum variance strategy.
Both curves are above the -portfolio curve. The performance of all three curves (Note 10) are significantly above the S&P500 index. Table 3 contains performance metrics. Vol means annualized volatility, IR means information ratio, SR is Sortino ratio, MDD represents maximum drawdown and CR stands for Calmar ratio.
As is known, the quality of the optimized portfolio depends crucially on the inputs. An accurate estimate of conditional expected returns and volatilities ensures good performance of portfolio optimization results. We attribute the good performance in U.S. market to a long period of bull market, which makes the prediction easier, and the asset pricing factors carefully chosen, which can be a good set of proxies for prediction purposes. Last, our numerical framework enables a fast and efficient computation of the optimal portfolios.

The Data
The data for backtesting are downloaded from UQER platform. The stocks that we study are listed on Shanghai and Shenzhen Stock Exchanges and are included in CSI800 index. We exclude the firms marked ST or have been through the IPO process for less than two months. Similar to the U.S. case, we take fundamental data used for the construction of characteristics in calendar year t from the financial statements with the fiscal year ending in year t-1. Due to the difference in data availability and market structure between China and U.S., we resort to the appendix of (Fang, 2019), instead of (Green, Hand & Zhang, 2017), for the construction of firm characteristics. The factors used by (Fang, 2019) are largely the same as those used by (Green, Hand & Zhang, 2017), but (Fang, 2019) has adapted the individual factors based on Chinese market realities. Finally, we constructed 93 firm characteristics. The sample period ranges from 200701 to 202012.

The Methodology
The objective function is , where are the conditional expected return and empirical volatility associated with the portfolio is empirical values computed for every simulated portfolio weight vector w using past 6 months' asset return data. Portfolio weights are constrained within (0; 1) and they sum up to 1. The number of sample weights is 1; 000; 000 universally and that of the clusters is the squared-root of the number of assets to be optimized. The conditional expected returns are estimated via a 94-factor (Note 11) lead-lag regression model implemented with Python function RidgeCV provided by module sklearn. The cross-section regression is done in a rolling window manner. The prediction is based on monthly data. The cross section is either the CSI300 or CSI800 index stocks. The optimal portfolios are computed at the beginning of each period and are held until the end of the period.

The Results
The backtesting results are summarized in the equity curve plots in Figures 4, 5 and 6 and the performance metrics in Table 7. Here MV means mean-variance, MinVAR represents minimum variance, RP denotes risk parity and EW is -portfolio. In China A shares market, the simple single period optimization methods reveal significant excess returns as can be observed in Figures 4 to 6. Tables 7 and 8, again, documents the performance metrics. Various risk and reward indexes point out that mean-variance method is the best strategy among the three competing methods and CSI300 index bears minimum IR, SR and CR.

Conclusion
In this paper, inspired by the methodology introduced in (Zhang, 2020b), we document a novel four-step portfolio optimization framework and test it with simulated and real financial data in U.S. equity and China A-share markets. Our results reveal superior returns over the out-of-sample testing periods  in both markets, which illustrates the usefulness of our methodology, that is not only a numerical framework, but also contributes to the literature of large scale optimizations. The empirical study of our proposed dynamic portfolio choice method via reinforcement learning is both interesting and important, yet postponed to future research. In addition, our methodology can be extended to fixed income and option portfolio selection, combining the work of (Zhang, 2020a) and (Zhang, 2020b), which we leave to the interested readers as exercises.