Raj Mehrotra

Data Scientist

https://talent.gravityer.com/raj-mehrotra

Data Scientist | Kaggle Master | Published Author | Topmate: https://topmate.io/raj_mehrotra

Hyderabad, Telangana, India

Key Strengths

Demonstrated proficiency in various machine learning algorithms (e.g., Logistic Regression, Naive Bayes, Lasso, Ridge, CNNs).
Extensive experience with data preprocessing techniques including EDA, feature selection (RFE), outlier removal, missing value imputation, and data scaling.
Strong understanding of natural language processing (NLP) techniques such as tokenization, stemming, lemmatization, stop-word removal, TF-IDF, Count Vectorizer, LDA, LSA, and Word Embeddings.
Proficient in using Python libraries like pandas, seaborn, matplotlib, scikit-learn, Keras, NLTK, and Gensim.
Experience with handling imbalanced datasets using techniques like SMOTE.
Familiarity with data augmentation to prevent overfitting in deep learning models.

Cultural & Operational Fit

Cultural Fit Analysis

The candidate's projects are exclusively personal and heavily focused on Kaggle datasets and competitions, which demonstrates self-motivation and a drive to learn. However, the lack of team projects, real-world business problem-solving, or diverse technology stacks beyond core data science libraries in Python suggests a potential gap in experience with collaborative, production-oriented environments. The target role 'Data Scientist' aligns well with the technical skills demonstrated, but the breadth of experience outside of academic/competition settings is limited, which might impact cultural fit in a fast-paced, cross-functional team.

Soft Skills & Operational Fit

The candidate's project descriptions indicate a methodical approach to problem-solving, starting with data exploration and preprocessing before moving to model training and evaluation. The consistent use of Jupyter Notebooks suggests a preference for iterative and exploratory development. However, there is no direct data to assess soft skills like teamwork, communication, or stress handling.

AI is analyzing your overall score…

Identifying your key strengths…

Evaluating your skill match against the job requirements…

Assessing your cultural and operational fit

Projects

Spooky-Author-Identification

January 19, 2019 – January 19, 2019

The notebook on famous Kaggle competition : Spooky Author Identification. The task is to identify the authors from their respective texts or work. I have first cleaned and pre-processed the text using standard NLP techniques like tokenization , stemming or lemmatization , stop-word removal etc.... I have also tried to create some meta features or hand-crafted features based on the author writing pattern. Then I have used the traditional BOW approach with TFIDF Vectorizer and the Count Vectorizer and then deployed ML algos like LogisticRegression and Naive Bayes which are well suited for text data. For me tfidf on count vectorizer gave best results till now ; My submission scored a multi-class log loss of 0.46 on kaggle private LB which is quite decent.

View Project

Topic-Modelling-using-LDA-and-LSA-in-Sklearn

January 8, 2019 – January 8, 2019

I have performed topic modelling on the dataset : "A Million News Headlines' on the kaggle. I have first pre-processed and cleaned the data. Then I have used the implementations of the LDA and the LSA in the sklearn library. Also the distribution of words in a topic is shown.

View Project

Word-Embeddings-in-Gensim-and-Keras

December 23, 2018 – December 25, 2018

A simple implementation of word embeddings in Gensim and Keras libraries. I have implemented famous Word2Vec in Gensim library. As an alternative I have also used Keras embedding layer to generate the word embeddings.

View Project

Movie-Reviews-NLTK-Sentiment-Analysis-

December 16, 2018 – December 16, 2018

The Movie Reviews dataset. The dataset is imported from the NLTK libray. It has 1000 positive and 1000 negative reviews. I have first imported the dataset into a pandas data frame which makes it easier to do the processing. The next step is to analyze the (+) and ( - ) reviews. I have also preprocessed the dataset using Lemmatizing and other standard NLP techniques. To extract the features from the text I have used the Tfidf vectorizer from the scikit. Lastly I have used various modellig algos from scikit to train on this data.

View Project

Housing-Prices-EDA-and-Regression-Models

December 9, 2018 – December 9, 2018

The famous Housing Price Advanced Regression competition on Kaggle. The dataset contains of training and testing sets each with about 1.46K rows and 81 features pertaining to a house. I have first performed an exhaustive EDA to identify the underlying trends in the data. I have also removed outliers to make the regression models more robust. Also proper missing values treatment has been done with imputation being done wherever needed. Lastly I have deployed various regression models like Lasso,Ridge etc... from scikit and have also tuned their parameters from the GridSearchCV module. Finally achieved a RMSE of little more than 0.12 which is pretty decent.

View Project

Flower-Recognition-Kaggle-CNN-Keras

September 23, 2018 – October 4, 2019

The dataset is Flower Recognition on Kaggle. The dataset consists of 4232 images each of different pixel values. Each of the image can be classified into either of 5 types-> 'Daisy','Rose' etc... . I have trained Convolutional Neural Network written in Keras to predict the flower on the validation set. Also used ImageDataGenerator to augment the training set and avoid overfitting problem .

View Project

Cats-vs-Dogs-CNN-Keras

August 24, 2018 – August 24, 2018

The famous Cats-vs-Dogs dataset. I have used a self laid ConvNet to classify the image into 2 classes either a Dog or a Cat. The images used are of 100*100 pixels each. The images are first converted to the numpy array of pixel values using the python ZipFile module. The images are then divided into the training ,cross-validation,testing set containing 20000 , 5000 , 12500 images respectively. Also I have used data augmentation technique to avoid chances of overfitting the model. Finally I achieved a decent accuracy of about 88 % on the validation set.

View Project

IBM-HR-Analytics-Employee-Attrition-Performance

August 2, 2018 – June 14, 2024

The IBM HR Analytics Employee Attrition & Performance dataset from the Kaggle. I have first performed Exploratory Data Analysis on the data using various libraries like pandas,seaborn,matplotlib etc.. Then I have plotted used feature selection techniques like RFE to select the features. The data is then oversampled using the SMOTE technique in order to deal with the imbalanced classes. Also the data is then scaled for better performance. Lastly I have trained many ML models from the scikit-learn library for predictive modelling and compared the performance using Precision, Recall and other metrics.

View Project

Red-Wine-Quality-Accuracy-0.9175-

July 5, 2018 – July 6, 2018

The Red Wine Quality dataset from kaggle. Data is provided of the composition of the wine having different chemicals. I have used pandas to manipulate the data and seaborn to visualize the data. Finally I have made predictions on the wine quality by using various models from the scikit-learn.

View Project

Pokemon-Data-Exploration-Visualization

June 25, 2018 – June 25, 2018

Pokemon with stats.Data analysis and exploration is performed on the dataset. Visualization is done using the libraries seaborn,matplotlib. Bar plot,box plot,swarm plot,scatter plot,violin plot, heat map etc... were used to analyze the data.

View Project

Key Strengths

Demonstrated proficiency in various machine learning algorithms (e.g., Logistic Regression, Naive Bayes, Lasso, Ridge, CNNs).
Extensive experience with data preprocessing techniques including EDA, feature selection (RFE), outlier removal, missing value imputation, and data scaling.
Strong understanding of natural language processing (NLP) techniques such as tokenization, stemming, lemmatization, stop-word removal, TF-IDF, Count Vectorizer, LDA, LSA, and Word Embeddings.
Proficient in using Python libraries like pandas, seaborn, matplotlib, scikit-learn, Keras, NLTK, and Gensim.
Experience with handling imbalanced datasets using techniques like SMOTE.
Familiarity with data augmentation to prevent overfitting in deep learning models.

Cultural & Operational Fit

Cultural Fit Analysis

Soft Skills & Operational Fit

Raj Mehrotra

Key Strengths

Cultural & Operational Fit

Top Skills

Skills

Projects

Key Strengths

Cultural & Operational Fit