I have data file, which looks like this,
["Arts & Entertainment", "Arts & Entertainment / Animation & Comics", "Arts & Entertainment / Books & Literature", "Arts & Entertainment / Celebrity/Gossip", "Arts & Entertainment / Fine Art", "Arts & Entertainment / Humor", "Arts & Entertainment / Movies", "Arts & Entertainment / Movies / Action", "Arts & Entertainment / Movies / Comedy", "Arts & Entertainment / Movies / Documentary", "Arts & Entertainment / Movies / Drama", "Arts & Entertainment / Movies / Horror", "Arts & Entertainment / Music", "Arts & Entertainment / Music / Alternative Music", "Arts & Entertainment / Music / Blues", "Arts & Entertainment / Music / Christian Music", "Arts & Entertainment / Music / Classic Rock", "Arts & Entertainment / Music / Classical Music", "Arts & Entertainment / Music / Country Music", "Arts & Entertainment / Music / Electronic Dance Music", "Arts & Entertainment / Music / Heavy Metal", "Arts & Entertainment / Music / Pop Music", "Arts & Entertainment / Music / Rap", "Arts & Entertainment / Radio Stations", "Arts & Entertainment / Television", "Arts & Entertainment / Television / Game Show", "Arts & Entertainment / Television / Kids", "Arts & Entertainment / Television / News", "Arts & Entertainment / Television / Reality", "Arts & Entertainment / Television / Science", "Arts & Entertainment / Television / Sitcom", "Arts & Entertainment / Television / Soap Opera", "Arts & Entertainment / Television / Talk Show", "Autos", "Autos / 4-Wheel Drive/SUVs", "Autos / Buying/Selling Cars", "Autos / Certified Pre-Owned", "Autos / Convertible", "Autos / Coupe", "Autos / Crossover", "Autos / Diesel", "Autos / Electric Vehicles", "Autos / Hatchback", "Autos / Hybrid", "Autos / Luxury", "Autos / Maintenance", "Autos / Maintenance / Parts", "Autos / Maintenance / Repair", "Autos / MiniVan", "Autos / Motorcycles", "Autos / Off-Road Vehicles", "Autos / Road-Side Assistance", "Autos / Sedan", "Autos / Trucks", "Autos / Trucks / Pickup", "Autos / Vintage Cars", "Autos / Wagon", "Business & Industry", "Business & Industry / Advertising", "Business & Industry / Agriculture", "Business & Industry / Biotech/Biomedical", "Business & Industry / Business Software", "Business & Industry / Construction", "Business & Industry / Construction / Composites & Plastics", "Business & Industry / Forestry", "Business & Industry / Government", "Business & Industry / Green Solutions", "Business & Industry / Human Resources", "Business & Industry / Logistics", "Business & Industry / Marketing", "Business & Industry / Metals", "Business & Industry / Non-Profit Organizations", "Business & Industry / Power Industry", "Business & Industry / Public Services", "Business & Industry / Public Services / Emergency Services", "Business & Industry / Public Services / Waste Management", "Business & Industry / Purchasing", "Business & Industry / Retail Industry", "Business & Industry / Small Business", "Business & Industry / Telecom", "Career", "Career / Career Planning", "Career / Job Search", "Career / Job Search / Resume Writing/Advice", "Career / Telecommuting", "Career / U.S. Military", "Education", "Education / Business School", "Education / College Education", "Education / College Education / Admissions", "Education / College Education / College Life", "Education / Continuing Education", "Education / Distance Learning", "Education / Financial Aid", "Education / Financial Aid / Scholarships", "Education / Graduate School", "Education / Homeschooling", "Education / Language Learning", "Education / Language Learning / English as a 2nd Language", "Education / Primary Education", "Education / Secondary Education", "Education / Special Education", "Finance & Money", "Finance & Money / Credit/Debt & Loans", "Finance & Money / Day Trading", "Finance & Money / Exchange Traded Funds", "Finance & Money / Financial News", "Finance & Money / Financial Planning", "Finance & Money / Financial Planning / Retirement Planning", "Finance & Money / Financial Planning / Tax Planning", "Finance & Money / Foreign Exchange Trading", "Finance & Money / Hedge Fund", "Finance & Money / Insurance", "Finance & Money / Investing", "Finance & Money / Mutual Funds", "Finance & Money / Options", "Finance & Money / Stocks", "Food & Drink", "Food & Drink / Barbecues & Grilling", "Food & Drink / Beverages", "Food & Drink / Beverages / Cocktails/Beer", "Food & Drink / Beverages / Coffee/Tea", "Food & Drink / Beverages / Wine", "Food & Drink / Cuisine-Specific", "Food & Drink / Cuisine-Specific / American Cusine", "Food & Drink / Cuisine-Specific / Cajun/Creole", "Food & Drink / Cuisine-Specific / Chinese Cuisine", "Food & Drink / Cuisine-Specific / French Cuisine", "Food & Drink / Cuisine-Specific / Italian Food", "Food & Drink / Cuisine-Specific / Japanese Food", "Food & Drink / Cuisine-Specific / Mexican Cuisine", "Food & Drink / Desserts & Baking", "Food & Drink / Health/LowFat Cooking", "Food & Drink / Organic Food", "Food & Drink / Vegetarian", "Health & Fitness", "Health & Fitness / A.D.D.", "Health & Fitness / AIDS/HIV", "Health & Fitness / Allergies", "Health & Fitness / Alternative Medicine", "Health & Fitness / Alzheimer\\'s Disease", "Health & Fitness / Arthritis", "Health & Fitness / Asthma", "Health & Fitness / Autism/PDD", "Health & Fitness / Bipolar Disorder", "Health & Fitness / Brain Tumor", "Health & Fitness / Cancer", "Health & Fitness / Cancer / Breast Cancer", "Health & Fitness / Cancer / Lung Cancer", "Health & Fitness / Cancer / Prostate Cancer", "Health & Fitness / Cholesterol", "Health & Fitness / Chronic Fatigue Syndrome", "Health & Fitness / Chronic Obstructive Pulmonary Disease", "Health & Fitness / Chronic Pain", "Health & Fitness / Cold & Flu", "Health & Fitness / Deafness", "Health & Fitness / Dental Care", "Health & Fitness / Depression", "Health & Fitness / Dermatology", "Health & Fitness / Diabetes", "Health & Fitness / Epilepsy", "Health & Fitness / Exercise", "Health & Fitness / GERD/Acid Reflux", "Health & Fitness / Headaches/Migraines", "Health & Fitness / Heart Disease", "Health & Fitness / Heart Disease / Women\\'s Heart Disease", "Health & Fitness / Hepatitis", "Health & Fitness / Herbs for Health", "Health & Fitness / Holistic Healing", "Health & Fitness / Hypertension", "Health & Fitness / IBS/Crohn\\'s Disease", "Health & Fitness / Incest/Abuse Support", "Health & Fitness / Incontinence", "Health & Fitness / Infertility", "Health & Fitness / Men\\'s Health", "Health & Fitness / Nursing", "Health & Fitness / Nutrition", "Health & Fitness / Orthopedics", "Health & Fitness / Orthopedics / Sports Medicine", "Health & Fitness / Panic/Anxiety Disorders", "Health & Fitness / Pediatrics", "Health & Fitness / Pharmaceutical", "Health & Fitness / Physical Therapy", "Health & Fitness / Psychology/Psychiatry", "Health & Fitness / Senior Health", "Health & Fitness / Sexuality", "Health & Fitness / Sleep Disorders", "Health & Fitness / Smoking Cessation", "Health & Fitness / Substance Abuse", "Health & Fitness / Substance Abuse / Alcoholism", "Health & Fitness / Thyroid Disease", "Health & Fitness / Weight Loss", "Health & Fitness / Women\\'s Health", "Hobbies & Games", "Hobbies & Games / Arts & Crafts", "Hobbies & Games / Arts & Crafts / Beadwork", "Hobbies & Games / Arts & Crafts / Drawing/Sketching", "Hobbies & Games / Arts & Crafts / Needlework", "Hobbies & Games / Arts & Crafts / Painting", "Hobbies & Games / Arts & Crafts / Photography", "Hobbies & Games / Arts & Crafts / Woodworking", "Hobbies & Games / Astrology", "Hobbies & Games / Birdwatching", "Hobbies & Games / BoardGames/Puzzles", "Hobbies & Games / Candle & Soap Making", "Hobbies & Games / Card Games", "Hobbies & Games / Chess", "Hobbies & Games / Cigars", "Hobbies & Games / Collecting", "Hobbies & Games / Collecting / Antiques", "Hobbies & Games / Collecting / Book Collecting", "Hobbies & Games / Collecting / Miniatures", "Hobbies & Games / Collecting / Stamps & Coins", "Hobbies & Games / Creative Writing", "Hobbies & Games / Getting Published", "Hobbies & Games / Home Recording", "Hobbies & Games / Inventors & Patents", "Hobbies & Games / Learning a Musical Instrument", "Hobbies & Games / Learning a Musical Instrument / Guitar", "Hobbies & Games / Magic & Illusion", "Hobbies & Games / Paranormal Phenomena", "Hobbies & Games / Sci-Fi & Fantasy", "Hobbies & Games / Video Games", "Hobbies & Games / Video Games / Nintendo", "Hobbies & Games / Video Games / PSP", "Hobbies & Games / Video Games / Playstation", "Hobbies & Games / Video Games / RPG", "Hobbies & Games / Video Games / Racing", "Hobbies & Games / Video Games / X-Box", "Home & Garden", "Home & Garden / Appliances", "Home & Garden / Environmental Safety", "Home & Garden / Gardening/Landscaping", "Home & Garden / Home Repair", "Home & Garden / Interior Decorating", "News & Current Affairs", "News & Current Affairs / Law & Politics", "News & Current Affairs / Law & Politics / Immigration", "News & Current Affairs / Law & Politics / Legal Issues", "News & Current Affairs / Law & Politics / U.S. Government Resources", "Parenting & Family", "Parenting & Family / Adoption", "Parenting & Family / Babies & Toddlers", "Parenting & Family / Daycare/Pre-School", "Parenting & Family / Parenting Children", "Parenting & Family / Parenting Teens", "Parenting & Family / Pregnancy", "Parenting & Family / Special Needs Kids", "Pets", "Pets / Aquariums", "Pets / Cats", "Pets / Dogs", "Pets / Veterinary Medicine", "Real Estate", "Real Estate / Apartments", "Real Estate / Architecture", "Real Estate / Buying/Selling Homes", "Religion", "Religion / Alternative Religions", "Religion / Atheism/Agnosticism", "Religion / Buddhism", "Religion / Catholicism", "Religion / Christianity", "Religion / Hinduism", "Religion / Islam", "Religion / Judaism", "Religion / Latter-Day Saints", "Religion / Pagan/Wiccan", "Science", "Science / Astronomy", "Science / Biology", "Science / Chemistry", "Science / Geology", "Science / Physics", "Sensitive Content", "Sensitive Content / Gambling", "Sensitive Content / Gambling / Sports Gambling", "Society", "Society / Dating", "Society / Divorce", "Society / Gay Life", "Society / Marriage", "Society / Senior Living", "Society / Weddings", "Sports & Recreation", "Sports & Recreation / Auto Racing", "Sports & Recreation / Auto Racing / NASCAR Racing", "Sports & Recreation / Baseball", "Sports & Recreation / Basketball", "Sports & Recreation / Bicycling", "Sports & Recreation / Bicycling / Mountain Biking", "Sports & Recreation / Bodybuilding", "Sports & Recreation / Boxing", "Sports & Recreation / Canoeing/Kayaking", "Sports & Recreation / Cheerleading", "Sports & Recreation / Climbing", "Sports & Recreation / College Sports", "Sports & Recreation / Cricket", "Sports & Recreation / Figure Skating", "Sports & Recreation / Fishing", "Sports & Recreation / Fishing / Fly Fishing", "Sports & Recreation / Fishing / Freshwater Fishing", "Sports & Recreation / Fishing / Game & Fish", "Sports & Recreation / Fishing / Saltwater Fishing", "Sports & Recreation / Football", "Sports & Recreation / Golf", "Sports & Recreation / Horses", "Sports & Recreation / Horses / Horse Racing", "Sports & Recreation / Hunting/Shooting", "Sports & Recreation / Ice Hockey", "Sports & Recreation / Inline Skating", "Sports & Recreation / Martial Arts", "Sports & Recreation / Olympics", "Sports & Recreation / Paintball", "Sports & Recreation / Rodeo", "Sports & Recreation / Rugby", "Sports & Recreation / Running/Walking", "Sports & Recreation / Sailing", "Sports & Recreation / Scuba Diving", "Sports & Recreation / Skateboarding", "Sports & Recreation / Skiing", "Sports & Recreation / Snowboarding", "Sports & Recreation / Soccer", "Sports & Recreation / Surfing/Bodyboarding", "Sports & Recreation / Swimming", "Sports & Recreation / Table Tennis/Ping-Pong", "Sports & Recreation / Tennis", "Sports & Recreation / Volleyball", "Sports & Recreation / Waterski/Wakeboard", "Sports & Recreation / Yachting", "Style & Fashion", "Style & Fashion / Body Art", "Style & Fashion / Cosmetics", "Style & Fashion / Fashion", "Style & Fashion / Jewelry", "Technology & Computing", "Technology & Computing / Cameras & Camcorders", "Technology & Computing / Cell Phones", "Technology & Computing / Computer Certification", "Technology & Computing / Computer Networking", "Technology & Computing / Computer Peripherals", "Technology & Computing / Computer Security", "Technology & Computing / Computer Security / Antivirus Software", "Technology & Computing / Computer Security / Network Security", "Technology & Computing / Databases", "Technology & Computing / Graphics", "Technology & Computing / Graphics / 3-D Graphics", "Technology & Computing / Graphics / Animation", "Technology & Computing / Graphics / Desktop Publishing", "Technology & Computing / Graphics / Desktop Video", "Technology & Computing / Graphics / Web Design/HTML", "Technology & Computing / Home Theater Systems", "Technology & Computing / Operating Systems", "Technology & Computing / Operating Systems / Linux", "Technology & Computing / Operating Systems / Mac OS", "Technology & Computing / Operating Systems / Unix", "Technology & Computing / Operating Systems / Windows", "Technology & Computing / Portable Device", "Technology & Computing / Programming", "Technology & Computing / Programming / C/C++", "Technology & Computing / Programming / Java", "Technology & Computing / Programming / JavaScript", "Technology & Computing / Programming / Visual Basic", "Travel", "Travel / Adventure Travel", "Travel / Africa", "Travel / Air Travel", "Travel / Asia", "Travel / Asia / Japan", "Travel / Australia & New Zealand", "Travel / Bed & Breakfasts", "Travel / Budget Travel", "Travel / Business Travel", "Travel / Camping", "Travel / Canada", "Travel / Caribbean", "Travel / Cruises", "Travel / Europe", "Travel / Europe / Eastern Europe", "Travel / Europe / France", "Travel / Europe / Greece", "Travel / Europe / Italy", "Travel / Europe / United Kingdom", "Travel / Honeymoons/Getaways", "Travel / Hotels", "Travel / Mexico & Central America", "Travel / National Parks", "Travel / South America", "Travel / Spas", "Travel / Theme Parks", "Travel / United States", "Travel / United States / California", "Travel / United States / Florida", "Travel / United States / Hawaii", "Travel / United States / Las Vegas, Nevada", "Travel / United States / Manhattan, New York", "Travel / United States / New England", "Travel / United States / Texas", "Travel / Weather"]
I clean up the data file and I split it, so that it looks something like this,
['Arts & Entertainment']
['Arts & Entertainment', 'Animation & Comics']
['Arts & Entertainment', 'Books & Literature']
['Arts & Entertainment', 'Celebrity Gossip']
['Arts & Entertainment', 'Fine Art']
['Arts & Entertainment', 'Humor']
['Arts & Entertainment', 'Movies']
['Arts & Entertainment', 'Movies', 'Action']
['Arts & Entertainment', 'Movies', 'Comedy']
['Arts & Entertainment', 'Movies', 'Documentary']
['Arts & Entertainment', 'Movies', 'Drama']
['Arts & Entertainment', 'Movies', 'Horror']
['Arts & Entertainment', 'Music']
['Arts & Entertainment', 'Music', 'Alternative Music']
['Arts & Entertainment', 'Music', 'Blues']
['Arts & Entertainment', 'Music', 'Christian Music']
['Arts & Entertainment', 'Music', 'Classic Rock']
['Arts & Entertainment', 'Music', 'Classical Music']
['Arts & Entertainment', 'Music', 'Country Music']
['Arts & Entertainment', 'Music', 'Electronic Dance Music']
['Arts & Entertainment', 'Music', 'Heavy Metal']
['Arts & Entertainment', 'Music', 'Pop Music']
['Arts & Entertainment', 'Music', 'Rap']
['Arts & Entertainment', 'Radio Stations']
['Arts & Entertainment', 'Television']
['Arts & Entertainment', 'Television', 'Game Show']
['Arts & Entertainment', 'Television', 'Kids']
['Arts & Entertainment', 'Television', 'News']
['Arts & Entertainment', 'Television', 'Reality']
['Arts & Entertainment', 'Television', 'Science']
['Arts & Entertainment', 'Television', 'Sitcom']
['Arts & Entertainment', 'Television', 'Soap Opera']
['Arts & Entertainment', 'Television', 'Talk Show']...
Now, I'm trying to convert the the list objects into a dictionary that looks like this,
{
"Arts & Entertainment": {
"Animation & Comics": {},
"Books & Literature": {},
"Celebrity Gossip": {},
"Fine Art": {},
"Humor": {},
"Movies": {
"Horror": {},
"Action": {},
"Comedy": {}, ...
}, ...
}
The problem is I can't figure out how to not override my subcategories, In the example above, the Movies sub key has three categories with it, however when I run my code, which is below it just has the key of "Horror" in it and that's because Horror is the last element in the last element of the last list in that category. Example of what I'm getting:
{
"Arts & Entertainment": {
"Animation & Comics": {},
"Books & Literature": {},
"Celebrity Gossip": {},
"Fine Art": {},
"Humor": {},
"Movies": {
"Horror": {} # notice there are no other categories in the movies section
}, ...
}
Code I've tried:
def cleanup_contextweb():
contextweb_file_path = directory_path + raw_file_names[1]
tree = {}
with open(contextweb_file_path, 'r') as contextweb_file:
cats = contextweb_file.read().replace('Manhattan, New York', 'Manhattan New York').replace('Las Vegas, Nevada', 'Las Vegas Nevada').replace('Celebrity/Gossip', 'Celebrity Gossip').replace('Atheism/Agnosticism', 'Atheism Agnosticism').replace('Pagan/Wiccan', 'Pagan Wiccan').split(',')
#cats = re.sub(r'"|\[|\]', '', cats)
cats = [map(str.strip, re.sub(r'"|\[|\]', '', cat).split('/')) for cat in cats]
cats = sorted(cats)
for cat in cats:
if len(cat) == 1:
tree[cat[0]] = {}
elif len(cat) == 2:
tree[cat[0]][cat[1]] = {}
elif len(cat) == 3:
tree[cat[0]][cat[1]] = {}
tree[cat[0]][cat[1]][cat[2]] = {}
elif len(cat) == 4:
tree[cat[0]][cat[1]] = {}
tree[cat[0]][cat[1]][cat[2]] = {}
tree[cat[0]][cat[1]][cat[2]][cat[3]] = {}
with open(directory_path + 'cleaned_' + raw_file_names[1], 'w') as contextweb_file_out:
json.dump(tree, contextweb_file_out, sort_keys=True, indent=4)
return json.dumps(tree, sort_keys=True, indent=4)
As you'll see I'm trying to build the dictionary I know how deep (how many keys I need) I am based on the length of the list passed in. Other things, I've tried, but erased, include, sorting the list of lists (cats
) by the length of the sub list and reversing it, so that all list with 4 elements would be iterated on first. I thought I could build the keys up that way because the key would exist for lower levels. That didn't really help.
Actually, a for-loop can produce quite a nice solution too:
>>> data
[['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 's', 'd'], ['a', 'b', 'c', 'd', 'e']]
>>> tree = {}
>>> for cats in data:
... curtree = tree
... for c in cats:
... curtree = curtree.setdefault(c, {})
...
>>> tree
{'a': {'s': {'d': {}}, 'b': {'c': {'d': {'e': {}}}}}}
The .setdefault()
method assures, that sub-dictionary is added if and only if key (category) has not existed before.
The curtree
starts from base dictionary tree
and traverses / builds the tree using the categories.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With