Possible Workflow Outline for Data Science Project

In 2020, I wrote down this outline for Data Science Proejcts based on my reading of Applied Predicitve Modeling

Question or problem definition.
- Provide background info.
- Pose question to tackle initially.
Acquire training and testing data.
- We may want to classify or categorize our samples.
Wrangle, mung, prepare, cleanse the data, NA, missing values, structure,…
Analyze, identify patterns, and explore the data, EDA.
- Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal?
Model, predict and solve the problem.
- For modeling stage, depending on the choice of model algorithm one may require features to be converted to numerical equivalent values.
- Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.
Visualize, report, and present the problem solving steps and final solution.
- Viz may be moved up to work with EDA
- Select the right visualization, (hard?)
Submit the results in report format

Load data & libraries
Preliminary data analysis, str(), summary(), head(), tail()
- Which columns are factors or features
Check for missing values, errors; Is Imputation needed?
boxplot(), hist() or scatter plot(), tables of survival or not…
Is data normal? poisson, distribution, skew, outliers?
Is data nominal, cat, muerical, discrete or continous?
Correlation plots
Near-zero variance or zero variance data deleted?
Model data using caret(?)
- Titanic data is modeled against Logit, KNN, SVM, NB, DT, RF, ANN
- Provide hyperparameters (tuning) and confidence intervals(?)
- Cross validation
Model evaluation; accuracy, ROC,…

See also: mrisdal provided an .rmd format.

Introduction; Load and check data
Feature Engineering; What’s in a name? Do families sink or swim together? Treat a few more variables, PCA, LDA…
Missingness; Sensible value imputation, Predictive imputation, Feature Engineering: Round 2
Prediction; Split into training & test sets, Building the model, Variable importance, Prediction!
Conclusion

Was this page helpful?

Sorry to hear that. Please tell us how we can improve.