Titanic Analysis Using Python

07 Feb 2018

Since I have always felt that mini-projects teach me more than any course could on its own, I have decided to use a well-known dataset, the Titanic dataset, to perform a basic analysis to learn python. I have watched a lot of videos and even practiced writing python, but I haven’t really performed an analysis like I have with STATA and R. I plan to do some form of predictive analysis (some machine learning algorithm). Based on the material I have learned in my Johns Hopkins Coursera Course (Practical Machine Learning) – there are basically six things to consider:

What is the Question?
What Data (or Input)?
What are the Features?
What Algorithm should you use?
What are the Parameters?
How does the model Evaluate?

The first I’m going to do is get familiar with the dataset so I can know what questions are reasonable for this dataset. The dataset I’m going to use is from the Kaggle site – and can be obtained here . I’ll dig into this dataset tomorrow if possible!

10 Feb 2018

The question is – will we be able to predict mortality based on passanger status as determined by gender, location on the ship, as well as other factors.

Setup for the project: I open the command line and type in python --version to get the version of Python 3.6.0, and then I type jupyter notebook to open Jupyter notebooks in my local host port so I can save my code for the future.

jupyter screenshot

I need to actually start git for that folder so I can have version control and synch with github for this project. To do that I simply change the directory to that folder and type git init.

I created a new repo in my github, and copied the url. I then entered the following commands to push my notebook into the repo so I can then track versions of code.

git add .
git commit -m "first commit"
git remote add origin https://github.com/davidcarnahan/python-titanic-analysis-project.git
git remote -v
git push -u origin master

Now I’m ready to get started. I’ll try to pick this back up later tonight.

16 Feb 2018

Ahhhhhhhh. No one in the house and a full night of coding ahead. Is there anything better? Well, there are a lot of better things but this is one of the simple pleasures I enjoy – just like going to the bookstore.

… Three hours later …

It’s going to take some practice to be able to do even simple descriptive/exploratory graphics in python. I remember going through the same thing in R … maybe this will become second nature in time.

The following code [what little is there] is what I accomplished tonight:

#import training dataset
train = pd.read_csv("../titanic-project/titanic-train.csv", index_col = "PassengerId")

#view first 5 rows of dataset
train.head()

#check out descriptive stats on numeric variables
train.describe()

#pair down dataframe and graph pair plot for correlations
train2 = train.loc[ :, ["Survived", "Age", "SibSp", "Fare"]]
sns.pairplot(train2, diag_kind = 'kde', plot_kws = {'alpha': 0.2})

python pairplot

#draw histogram for Fare
sns.distplot(a=train2["Fare"], hist=True, kde=False, rug=True, bins=20)
sns.plt.show()

python histogram

Titanic Analysis Using Python

07 Feb 2018

10 Feb 2018

16 Feb 2018

Similar Posts

Exploring Hadoop & Hive with AWS Athena

Text Analysis with R

Automating Website Development with AWS Project

Automating Website Development with Amazon Web Services