Work

GDSC Entry Form Analysis and Selection

EDA
Pandas
Seaborn

An exploratory data analysis of the club registration for GDSC at my college.

the github repository preview of the analysis


PyPI Downloads

Overview

A analysis of the first registration form of the GDSC (google developer student club) of our college. It aimed to classify and select students based on there skills while giving insights about there distribution along various dimentions. It was my first ever work on real world data and made me realize that real world is messy af.

The analysis is segregated into the following parts:

Preprocessing the data

  • Column Renaming

    First we will start with renaming out columns. Some rules of thumb for columns names.

    • A column name must have two to three words max.
    • All letters must be in small case.
    • And there should be no spaces in subsequent words.

  • Dropping unnessary columns

    Here we will drop any columns that we do not need for our analysis.


  • Parsing Data
    • Parsing Date and Time
    • Removing trailing spaces and lowercase everything
    • Fill all NaN’s and funky values

  • Encoding values

    For our analysis purposes we will binary encode the columns which has any links and paragraphs.


  • Spliting strings

    Some entries such that tools and languages have multiple values but they are grouped together as a comma , and slash / separated string. So we will need to open this encoding to use these the values efficiently.

Exploratory Data Analysis (EDA)

Student distribution across courses

The graph shows that BTech 1st year students registered the most. BTech is generally registered more because it has much more students then BCA.

student distribution across courses

Field selected by students

The most selected obviously are programming and coding, but they are kind of vague, so we would not consider them here as a field. Hence, the most selected field after removing the vagueness is AI/ML.

field selected by students

Devices used by students

devices used by students

Tools used by students

The most used tool obviously is VS Code, because most of the applicants are novice programmers.

tools used by students

Languages known by students

C, C++ are the most known because they are part of the college curriculum. After that majorly we have python, html/css (even if they are not programming languages), java, js and sql.

languages known by students

OS used by students

Windows wins here

os used by students

There were many fields with personal information in the form, the information itself is useless for analysis but a simple filled or not filled is useful. The pie charts show that distribution.

documents, links and personal information

Time cast of the registration form

This graph shows the time cast of the registration form. The form was open for a week from 13 to 21. It shows a steady progress in entries over the days but a abnormal spike of entries from BTech 1st year on 16. Seems like some annoucements were made on that day by the faculty.

time cast of the registration form

Gender Distribution of students

The dataset is void of any gender column because the form didn’t have one, so I used a API which tells genders of people using there names. The registrations were heavily skewed towards male.

gender distribution of students

Heuristic selection

A weighted sum heuristic was used to score the students. Technical skills were given more weight. A threshold of 18 was used. 18 signifies the bare minimum of points that a student can get if the form is filled correctly and has the bare minimum of technical skills. A total of 42 students were selected from 154.

gender distribution of selected students

This graph shows the gender distribution of selected students. It is again highly skewed towards males with only 1 girl being selected.

Hence, a ranking of girls was calculated separately and the top 10 were chosen from it, hence, a total of 52 students were selected into the GDSC.


PyPI Downloads