Python Exercises
Lane: General Data Analysis
Tools: Python
Github: Check out my Python scripts for this project here!
Technical Analysis
In order to complete these exercises, it was necessary to first import all required Python packages and libraries. To ensure that only relevant dependencies were included, I reviewed all project tasks prior to writing any code and identified the specific libraries needed. Once identified, these packages were imported, followed by loading the LossClaims, CreditCharges, and InsuranceCharges datasets using pd.read_csv().
CreditCharges Dataset
To conduct an initial exploratory analysis of the CreditCharges dataset, I utilized four Pandas functions: .head(), .info(), .describe(), and .nunique(). The .head() function was used to display the first five rows of the dataset, providing an initial view of the columns and example values. This preview was useful for understanding the structure of the data prior to manipulation. The .info() function was then used to obtain structural details, revealing that the dataset contained 11 columns and 5,896 rows, with both float64 and object data types. This function also showed that two columns, Unnamed: 9 and Unnamed: 10, contained zero non-null values, indicating that they were entirely empty. As a result, these columns were removed from the dataset. The .describe() function was then applied to generate descriptive statistics for the numerical variables, and .nunique() was used to identify the number of unique values in each column.
Next, I checked the CreditCharges dataset for duplicate and missing values. When searching for duplicate rows, 896 rows were identified. These rows were entirely empty, which caused them to be flagged as duplicates. Since these rows contained no data, all duplicates were removed using keep=False. After this step, a check for missing values returned zero rows with null values, confirming that all missing data had already been addressed.
Using the cleaned CreditCharges dataset, I constructed a multiple linear regression model incorporating all relevant variables. To simplify model construction, I first standardized the column names. I identified Annual Charges as the dependent variable and selected Annual Income, Household Size, Education, Hours Per Week Watching TV, Age, Gender, and Exceeded Credit Limit in Past 12 Months as independent variables. The Account Number variable was excluded, as it serves only as an identifier and has no explanatory value with respect to Annual Charges.
Categorical variables—Education, Gender, and Exceeded Credit Limit in Past 12 Months—were transformed into dummy variables. These included indicators such as Education_High school, Education_Doctoral, Education_Junior high, Education_Master, Education_Primary school, Education_Undergraduate, Gender_Female, Gender_Male, ExceededCreditLimitPast12Months_No, and ExceededCreditLimitPast12Months_Yes. A new dataframe, creditchargesDUMMIES, was created as a copy of the original dataset with the categorical columns replaced by their corresponding dummy variables.
The data was then split into training and testing subsets using a 70/30 split. Prior to model fitting, numeric variables were scaled to ensure comparability across different measurement units. Scaling is essential in regression modeling, as it prevents variables with larger magnitudes from disproportionately influencing the model. After preprocessing, the regression model was built and fitted using the selected variables.
The resulting model produced an R² value of 0.36, indicating that the independent variables explain a relatively small portion of the variance in Annual Charges. This low explanatory power suggests limited predictive accuracy, which is reflected in the observed differences between the AnnualCharges and predictedAnnualCharges values. A coefficient matrix was also generated to evaluate variable importance, and the results showed that none of the predictors exhibited strong or statistically meaningful contributions to the dependent variable.
LossClaims Dataset
As with the CreditCharges dataset, I began exploring the LossClaims dataset using .head() to review the structure and values. The dataset contained the following columns: Claim Number, Date Received, Incident Date, Airport Code, Airport Name, Airline Name, Claim Type, Claim Site, Item, Claim Amount, Status, Close Amount, and Disposition. I then used the .info() function to assess the dataset’s structure, which revealed 13 columns and 94,848 rows, a combination of object and float64 data types, and no missing values. The .describe().T function was applied to generate descriptive statistics for the Claim Amount and Close Amount variables, and .nunique() was used to determine the number of unique values in each column. For example, Claim Amount had 27,013 unique values, while Disposition had only three.
I then checked the dataset for duplicate and missing values. When identifying duplicates using lossclaims[lossclaims.duplicated(keep=False)], multiple duplicate rows were found. However, these rows contained meaningful information, so duplicates were removed while retaining the first occurrence by using keep='first'. As previously noted, no rows contained missing values requiring removal.
The first primary task for the LossClaims dataset involved renaming columns to remove spaces, which simplifies coding and improves readability. This was accomplished using the .columns = () assignment. Next, I reviewed and corrected data types where necessary. The Date Received column was initially stored as an object but was converted to a datetime type. Although Incident Date was also stored as an object, it was already formatted as datetime data and therefore did not require conversion.
I then created a visualization to display the proportions of claim dispositions. To accomplish this, I grouped the data by Disposition and used the count of Disposition as the values. A pie chart was generated to illustrate these proportions, as pie charts are effective for displaying relative shares. Using autopct='%1.0f%%', the chart showed that 56% of claims were denied, 24% were approved in full, and 19% were settled.
The next task involved determining the number of claims associated with each claim site. This was done by grouping the data by Claim Site and counting the number of Claim Numbers in each group. The resulting table showed four categories: Checked Baggage (80,540 claims), Checkpoint (14,010 claims), Motor Vehicle (2 claims), and Other (280 claims). Using method chaining, this table was visualized as a bar chart with .plot.bar().
The first advanced task for the LossClaims dataset was calculating the average claim amount. Using the .mean() function on the Claim Amount column returned an average value of 3,635.41, which aligned with the mean reported by the .describe() function. Next, I created a distribution plot for all claims under 5,000. This involved querying the dataset for claim amounts below 5,000, selecting only the Claim Amount column, and plotting the results using .plot.kde() to generate a density plot. The final advanced task required identifying the top five airports with the highest number of claims. This was accomplished by grouping by Airport Name, counting Claim Numbers, sorting the results in descending order, and applying .head(5).
The final task for the LossClaims dataset involved identifying patterns or trends in Close Amount over time. Time-series analysis requires consistent time intervals, but the dataset’s time data was irregular. To address this, I extracted the year, month, and day components from the Date Received column. Using the newly created Month variable, I analyzed Close Amount trends over time and identified a pattern showing higher Close Amounts between December and February. This trend likely reflects increased travel during the holiday season, with claims peaking in February due to delayed resolution of lost or damaged items from holiday travel.
InsuranceCharges Dataset
Using the .head() function on the InsuranceCharges dataset revealed eight columns: age, sex, education, children, smoker, region, bmi, and charges. The .info() function indicated that the dataset contained 1,338 rows, no missing values, and a mix of int64, float64, and object data types. The .describe().T function provided descriptive statistics for the age, children, bmi, and charges columns, while .nunique() revealed the number of unique values in each variable. For example, age had 47 unique values, sex had 2, education had 6, children had 6, smoker had 2, region had 4, bmi had 548, and charges had 1,337. The dataset contained no duplicate rows or missing values requiring removal.
The first primary task for the InsuranceCharges dataset was to create a matrix displaying correlation coefficients between charges and the age, children, and bmi variables. The resulting coefficients were 0.299008 for age, 0.198341 for bmi, and 0.067998 for children, indicating relatively weak linear relationships with charges. Next, I identified and removed outliers from the charges column. Outliers were detected using a boxplot and removed using the interquartile range (IQR) method. This process involved calculating the first and third quartiles, determining the IQR, defining upper and lower bounds, and removing values outside those bounds. A subsequent boxplot confirmed that the outliers had been successfully removed.
The final primary task involved creating a visualization of average annual charges for males and females across regions. To do this, I constructed a pivot table using region as the index, sex as the columns, and charges as the values. The pivot table was then plotted using plot.bar(), resulting in a bar chart with four regions on the x-axis and two bars per region representing average charges for each sex.
As part of the advanced tasks, I calculated the percentage of individuals classified as obese. This was done using a for loop that iterated through the bmi column and incremented a counter whenever a value of 30 or higher was encountered. The final count was divided by the total number of bmi values and multiplied by 100, yielding an obesity rate of 48.4%.
Lastly, for the challenging task, I applied k-means clustering to group observations within the InsuranceCharges dataset. Features were created using the age, children, bmi, and charges variables. The elbow method was then used to determine the optimal number of clusters by plotting inertia values and identifying the elbow point. This analysis indicated that four clusters were appropriate, and a new variable was created to assign each observation to one of the four clusters.