Possible Workflow Outline for Data Science Project
In 2020, I wrote down this outline for Data Science Proejcts based on my reading of Applied Predicitve Modeling
Workflow (goals) stages - 7 stages
- Question or problem definition.
- Provide background info.
- Pose question to tackle initially.
- Acquire training and testing data.
- We may want to classify or categorize our samples.
- Wrangle, mung, prepare, cleanse the data, NA, missing values, structure,…
- Analyze, identify patterns, and explore the data, EDA.
- Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal?
- Model, predict and solve the problem.
- For modeling stage, depending on the choice of model algorithm one may require features to be converted to numerical equivalent values.
- Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.
- Visualize, report, and present the problem solving steps and final solution.
- Viz may be moved up to work with EDA
- Select the right visualization, (hard?)
- Submit the results in report format
Practical Stages
- Load data & libraries
- Preliminary data analysis, str(), summary(), head(), tail()
- Which columns are factors or features
- Check for missing values, errors; Is Imputation needed?
- boxplot(), hist() or scatter plot(), tables of survival or not…
- Is data normal? poisson, distribution, skew, outliers?
- Is data nominal, cat, muerical, discrete or continous?
- Correlation plots
- Near-zero variance or zero variance data deleted?
- Model data using caret(?)
- Titanic data is modeled against Logit, KNN, SVM, NB, DT, RF, ANN
- Provide hyperparameters (tuning) and confidence intervals(?)
- Cross validation
- Model evaluation; accuracy, ROC,…
See also: mrisdal provided an .rmd format.
- Introduction; Load and check data
- Feature Engineering; What’s in a name? Do families sink or swim together? Treat a few more variables, PCA, LDA…
- Missingness; Sensible value imputation, Predictive imputation, Feature Engineering: Round 2
- Prediction; Split into training & test sets, Building the model, Variable importance, Prediction!
- Conclusion
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.