Titanic Analysis Using Python
07 Feb 2018
Since I have always felt that mini-projects teach me more than any course could on its own, I have decided to use a well-known dataset, the Titanic dataset, to perform a basic analysis to learn python. I have watched a lot of videos and even practiced writing python, but I haven’t really performed an analysis like I have with STATA and R. I plan to do some form of predictive analysis (some machine learning algorithm). Based on the material I have learned in my Johns Hopkins Coursera Course (Practical Machine Learning) – there are basically six things to consider:
What is the Question
?
What Data
(or Input)?
What are the Features
?
What Algorithm
should you use?
What are the Parameters
?
How does the model Evaluate
?
The first I’m going to do is get familiar with the dataset so I can know what questions are reasonable for this dataset. The dataset I’m going to use is from the Kaggle site – and can be obtained here. I’ll dig into this dataset tomorrow if possible!
10 Feb 2018
The question is – will we be able to predict mortality based on passanger status as determined by gender, location on the ship, as well as other factors.
Setup for the project: I open the command line and type in python --version
to get the version of Python 3.6.0, and then I type jupyter notebook
to open Jupyter notebooks in my local host port so I can save my code for the future.
I need to actually start git for that folder so I can have version control and synch with github for this project. To do that I simply change the directory to that folder and type git init
.
I created a new repo in my github, and copied the url. I then entered the following commands to push my notebook into the repo so I can then track versions of code.
git add .
git commit -m "first commit"
git remote add origin https://github.com/davidcarnahan/python-titanic-analysis-project.git
git remote -v
git push -u origin master
Now I’m ready to get started. I’ll try to pick this back up later tonight.
16 Feb 2018
Ahhhhhhhh. No one in the house and a full night of coding ahead. Is there anything better? Well, there are a lot of better things but this is one of the simple pleasures I enjoy – just like going to the bookstore.
… Three hours later …
It’s going to take some practice to be able to do even simple descriptive/exploratory graphics in python. I remember going through the same thing in R … maybe this will become second nature in time.
The following code [what little is there] is what I accomplished tonight:
#import training dataset
train = pd.read_csv("../titanic-project/titanic-train.csv", index_col = "PassengerId")
#view first 5 rows of dataset
train.head()
#check out descriptive stats on numeric variables
train.describe()
#pair down dataframe and graph pair plot for correlations
train2 = train.loc[ :, ["Survived", "Age", "SibSp", "Fare"]]
sns.pairplot(train2, diag_kind = 'kde', plot_kws = {'alpha': 0.2})
#draw histogram for Fare
sns.distplot(a=train2["Fare"], hist=True, kde=False, rug=True, bins=20)
sns.plt.show()