
Data Scientist | Kaggle Master | Published Author | Topmate: https://topmate.io/raj_mehrotra
AI is analyzing your overall score…
Identifying your key strengths…
Evaluating your skill match against the job requirements…
Assessing your cultural and operational fit
Spooky-Author-Identification
January 19, 2019 – January 19, 2019
The notebook on famous Kaggle competition : Spooky Author Identification. The task is to identify the authors from their respective texts or work. I have first cleaned and pre-processed the text using standard NLP techniques like tokenization , stemming or lemmatization , stop-word removal etc.... I have also tried to create some meta features or hand-crafted features based on the author writing pattern. Then I have used the traditional BOW approach with TFIDF Vectorizer and the Count Vectorizer and then deployed ML algos like LogisticRegression and Naive Bayes which are well suited for text data. For me tfidf on count vectorizer gave best results till now ; My submission scored a multi-class log loss of 0.46 on kaggle private LB which is quite decent.
View ProjectTopic-Modelling-using-LDA-and-LSA-in-Sklearn
January 8, 2019 – January 8, 2019
I have performed topic modelling on the dataset : "A Million News Headlines' on the kaggle. I have first pre-processed and cleaned the data. Then I have used the implementations of the LDA and the LSA in the sklearn library. Also the distribution of words in a topic is shown.
View ProjectWord-Embeddings-in-Gensim-and-Keras
December 23, 2018 – December 25, 2018
A simple implementation of word embeddings in Gensim and Keras libraries. I have implemented famous Word2Vec in Gensim library. As an alternative I have also used Keras embedding layer to generate the word embeddings.
View ProjectMovie-Reviews-NLTK-Sentiment-Analysis-
December 16, 2018 – December 16, 2018
The Movie Reviews dataset. The dataset is imported from the NLTK libray. It has 1000 positive and 1000 negative reviews. I have first imported the dataset into a pandas data frame which makes it easier to do the processing. The next step is to analyze the (+) and ( - ) reviews. I have also preprocessed the dataset using Lemmatizing and other standard NLP techniques. To extract the features from the text I have used the Tfidf vectorizer from the scikit. Lastly I have used various modellig algos from scikit to train on this data.
View ProjectHousing-Prices-EDA-and-Regression-Models
December 9, 2018 – December 9, 2018
The famous Housing Price Advanced Regression competition on Kaggle. The dataset contains of training and testing sets each with about 1.46K rows and 81 features pertaining to a house. I have first performed an exhaustive EDA to identify the underlying trends in the data. I have also removed outliers to make the regression models more robust. Also proper missing values treatment has been done with imputation being done wherever needed. Lastly I have deployed various regression models like Lasso,Ridge etc... from scikit and have also tuned their parameters from the GridSearchCV module. Finally achieved a RMSE of little more than 0.12 which is pretty decent.
View ProjectFlower-Recognition-Kaggle-CNN-Keras
September 23, 2018 – October 4, 2019
The dataset is Flower Recognition on Kaggle. The dataset consists of 4232 images each of different pixel values. Each of the image can be classified into either of 5 types-> 'Daisy','Rose' etc... . I have trained Convolutional Neural Network written in Keras to predict the flower on the validation set. Also used ImageDataGenerator to augment the training set and avoid overfitting problem .
View ProjectCats-vs-Dogs-CNN-Keras
August 24, 2018 – August 24, 2018
The famous Cats-vs-Dogs dataset. I have used a self laid ConvNet to classify the image into 2 classes either a Dog or a Cat. The images used are of 100*100 pixels each. The images are first converted to the numpy array of pixel values using the python ZipFile module. The images are then divided into the training ,cross-validation,testing set containing 20000 , 5000 , 12500 images respectively. Also I have used data augmentation technique to avoid chances of overfitting the model. Finally I achieved a decent accuracy of about 88 % on the validation set.
View ProjectIBM-HR-Analytics-Employee-Attrition-Performance
August 2, 2018 – June 14, 2024
The IBM HR Analytics Employee Attrition & Performance dataset from the Kaggle. I have first performed Exploratory Data Analysis on the data using various libraries like pandas,seaborn,matplotlib etc.. Then I have plotted used feature selection techniques like RFE to select the features. The data is then oversampled using the SMOTE technique in order to deal with the imbalanced classes. Also the data is then scaled for better performance. Lastly I have trained many ML models from the scikit-learn library for predictive modelling and compared the performance using Precision, Recall and other metrics.
View ProjectRed-Wine-Quality-Accuracy-0.9175-
July 5, 2018 – July 6, 2018
The Red Wine Quality dataset from kaggle. Data is provided of the composition of the wine having different chemicals. I have used pandas to manipulate the data and seaborn to visualize the data. Finally I have made predictions on the wine quality by using various models from the scikit-learn.
View ProjectPokemon-Data-Exploration-Visualization
June 25, 2018 – June 25, 2018
Pokemon with stats.Data analysis and exploration is performed on the dataset. Visualization is done using the libraries seaborn,matplotlib. Bar plot,box plot,swarm plot,scatter plot,violin plot, heat map etc... were used to analyze the data.
View ProjectCultural Fit Analysis
The candidate's projects are exclusively personal and heavily focused on Kaggle datasets and competitions, which demonstrates self-motivation and a drive to learn. However, the lack of team projects, real-world business problem-solving, or diverse technology stacks beyond core data science libraries in Python suggests a potential gap in experience with collaborative, production-oriented environments. The target role 'Data Scientist' aligns well with the technical skills demonstrated, but the breadth of experience outside of academic/competition settings is limited, which might impact cultural fit in a fast-paced, cross-functional team.
Soft Skills & Operational Fit
The candidate's project descriptions indicate a methodical approach to problem-solving, starting with data exploration and preprocessing before moving to model training and evaluation. The consistent use of Jupyter Notebooks suggests a preference for iterative and exploratory development. However, there is no direct data to assess soft skills like teamwork, communication, or stress handling.