JAV database

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
Hi, just a quick update first on my end, then a general reply to your new posts since last time I posted, over a month ago.

It took just over 1 day to finish the coding of what I called my alpha, however I hit a stupid problem, and after 2 days of trying, I got pissed off and took a break... then Chinese New Year arrived and only yesterday I took a look at the problem again. And this time (as expected) it took me only couple hours to solve it. The problem was I couldn't get XBMC to display Japanese (Unicode) text which is something I did ages ago (on the old XBox even) and forgotten how... and some of the XBMC documentation is so horrendous that such a simple task is STILL not properly documented.

Anyway I got myself a second Android PC and installed XBMC on it, and took another crack at it and as I said, solved the Unicode display problem (and confirming my own program was correct all along) and so my little project is back on. To remind (re-confirm) you, my project (and philosophy) is to exploit as much as possible what's already out there and well-supported. So most of the things you guys are talking about I'm relying on XBMC to do for me, e.g. browse (or search or filter) my collection by title, genre, actress and DVD code, all these will be provided by XBMC, I only need to find the meta data and feed it to XBMC.

To re-cap, I have now a small program that read one or more plain text file containing the meta data (from DMM etc) and generate one or more .nfo files (in XML format) that can be readily imported by XBMC into it's video library. This is it, nothing more than a format translator (from plain text to XML) and a simple language translator (using super-simple dictionary look-up) which translate the field names (発売日, 収録時間, 出演者 etc) into English. Some variation in the input file is handled correctly, e.g. 発売日, 配信開始日, 商品発売日 are all treated as "premiered".

I've also created a simple dictionary which covers only the field names and genres. Translating actress names, studios etc would be super easy technically (adding literally less than 5 lines of code) but I don't have a list of actresses in both languages. Handling actresses with more than 1 aliases would be slightly more difficult, but again the main issue is data, not program.

Translating title from Japanese to English is of course technically difficult. There might be a way to interface with Google Translate and exploit that functionality. The result would be quite garbled but it beats nothing. Personally I'm not interested in machine (including Google) translation since I do manual translation (starting from Google Translate) for every movie that I download anyway (which is about 1 per week, about...)

Next I would like to ask for more meta data, esp. the 80,000 database films that CG already collected. I would be surprised if my modest JAV collection isn't already completely covered by that database.

As for new films that are released daily or weekly, it would be easy to code a "scrapper" if only there's a reliable source to scrap. DMM is very tricky to use for me, sometimes I pull off a cover image (DMM has the best quality cover scans) from Yahoo or Google's cached or translated version, but that's not 100% reliable. No, I haven't figured out how to proxy-access DMM.

Currently my biggest problem is that the actress names (in Japanese) and genres (in English) don't seem to get properly imported into XBMC library. I have to figure out if it's XBMC's config problem or my output format needs some tweaking.

As for the structure and content of the data, I strongly believe in keep it simple, initially. Also I want very high reliability and data quality, I want to trust only data directly or indirectly obtained from DMM or similar e-tailer or studio's official sites. So here is the nearly complete list of fields to be included:

title (English, if available), originaltitle (ie. Japanese title if English title is available), studio, series, runtime, genre, premiered (ie. release date), actors, director, plot (the textual description (hundreds of Japanese words) that accompany every product that DMM sells), id (DVD code)

For all of these, the field names that I gave is also exactly the same field names used by XBMC's library. For all except one, DVD code (=id in XBMC), my choice is IMHO pretty un-controversial. It's easy to think up many more, but I want to get these down before (if ever) taking on more.

One problem is that most of the un-censored video were not (presumably?) released as DVD so lacks a DVD code. But many (major) studio has their own consistent coding scheme, e.g. Carribbean.com use a code MMDDYY-NNN (e.g. 122713-508). In general, meta data for uncensored videos are a bit lower in completeness and/or quality.

When I get home, and has some free time, I will get some screen shots to post here.

That's it for my update, next I will reply some...
 

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
First I am super-excited that this topic is so active, I hope we can keep working on each of our project (not that, it seems, we have any real cooperation going yet) and share ideas, solutions and data, and encouragement. I am generally not that smart about compliments and encouragement, so I know I'm not producing enough positive energy here. Really I hope all of you will keep doing it and do it the way you see right.

And then... I will start saying somethings that you may not like to hear... it's absolutely nothing personally, and please please keep the positive energy (above) in mind and consider the following without taking offense.

I think many of you are going about it without a clear view of your own scope and purpose. Esp. you (variously) seems to want to build a website (cloning AniDB or IMDB?) or create your own version of media center (cloning iTune or XBMC) (or call it media library, media management, same story). In both case it's ill-advised, mainly you are making a huge project that will not reward you in a meaningful way.

For a website, do you have ambition to attract hundreds or even thousands of visitors each day? And once you achieve that, so what? Personal satisfaction? Financial reward (ad revenue?)? Will you be prepared to deal with the headaches then, too?

If you aim to build a simple website: browse and search for meta data and maybe cover images, thumbnails, screen shots and contact sheet (is that the right name for it?) then hosting and building the website is very simple. Technically I strongly suggest you use only HTML (maybe HTML5) and hosting it is also nearly trivial, I could host even it on my network attached appliance or anyone could host it on your home PC (assuming you are willing). Take my appliance for example, it costs under US$200 and runs Linux, it runs on less than 20W of power, is quiet and is relatively hacker-resistant. The only thing to add is to rent a DNS so you have a cool domain name (www.javdb.com?) instead of an IP address. (ooh javdb.com is already taken)

The cost and effort is so small, compared to the backend work of maintaining the database, I think it's hardly worthwhile to mention it here. Either you have the HTML know-how and artistic ability to make a pretty site, or you don't, either way discussing it here isn't helpful to yourself nor to most other Akiba members.

I want to mention that the amount of data (like CG said, about 10MB, actually well less than that, for textual meta data, about 10GB, certainly under 100GB for cover images plus screen shots) is absolutely no problem for my $200 appliance. And if the daily traffic is reasonably low, it's practically free to host off of any decent broadband home internet connection.

For the website, has there been mention of java and SWING? IMHO, I don't see the point of going beyond HTML (maybe HTML5). Even Flash is probably unadvisable. Messing with SWING when talking about website is just mad, with due respect.

Now for the homebrew media center, what really do you want to do, that can compete with any of the big boys, both technically and for market share? There's already iTune and Windows Media Player (since version 7) for the commercial ones, and XBMC and others that's open source. There also a ton of media management software out there that's less than full fledged media center but emphasize on the database management aspect, most of these support the aforementioned "mainstream" media centers. Most of these can be considered support software or outright add-on's (e.g. for XBMC).

With due respect, I think it's mad to try to compete with iTune, WMP and XBMC. In case these don't suit you, you should also take a look at MediaBrowser, Boxee, Netgear NeoTV, pyTiVo and more... Each of these had many man-years of development behind them, both for coding ingenuity and beautiful interfaces (most has multiple skins available).

So that's the path I take myself: somehow feed the meta data and cover images to XBMC and let it do the rest. And I want to put in the minimal amount of effort in this. Realistically I expect only to get the satisfaction of a powerful and smooth browsing experience on my home theater, and maybe 2% satisfaction for the technical challenge (my real job give me pretty serious technical challenge already). Perhaps my project can help others (esp. if CG and other Akiba gurus approve of my work), that would be my little contribution to Akiba. But like I said, I can't solve your problems, unless it happens to be the same as mine.

Like I said, I've managed to feed DMM-like textual data into XBMC, that's alpha, done. Beta will take a while longer as there are issues that are each not difficult but the number of things to sort out is large. And the more the alpha (or alpha-plus) version progress, the less motivated I will be to work on the progressively-less-important issues.

With regard to previous points: what I'm focussed on is the "back-end" stuff, collecting and maintaining the data, especially harmonizing the different format or alternate strings due to different data source (it sucks that it's so hard to get the data off of DMM, outside of Japan). I think this is the most important part. The part that's missing due to the lack of IMDB or AniDB for JAV. All the rest, scrapping, media browsing/management/playing/tagging etc etc, are already handled by mature and sophisticated software. And once you have the backend, building and hosting a decent website is super easy.

The biggest problem is in fact, in case the website actually become popular, then to deal with the copyright/moral/legal issue that might arrive.

I'll have more suggestions, on specific points about the database, to add later...
 
  • Like
Reactions: Blade Runner

cyberzen

New Member
Apr 8, 2010
64
21
While I like the idea of XBMC scraping metadata from the web, but I prefer to use mplayer to play my videos.

For me what's important is a simple UI that will show me a list of my videos in thumbnail, cover, screencap, actress view.
I want relevant tags for videos in my collection automatically.
I want an easy way to search through my collection by rating, title, actress and tags.
I want support for porn and hentai.

In my view there are several problems that an app has to overcome to deliver on these features.

1) video normalization (file hash)
2) tag normalization (synonyms)
3) tag voting (relevant tags will rise to the top)
4) either a central server that can coordinate all the metadata synced to client OR some kind of P2P system (this will be a lot more difficult)

When you have different users, each with different primary language ability, you will run into the same kind of problem that stackoverflow had. They solved it by adding tags synonyms http://blog.stackoverflow.com/2010/08/tag-folksonomy-and-tag-synonyms/.

Anyways so right now, I'm working on the server part. The client will also have an easy way to import your existing data and also to export that data to xml, json, csv.

Will you be releasing your program as open source?
 

cyberzen

New Member
Apr 8, 2010
64
21
No offense take :)

Primarily the reason why I started this project is to have a website for JAV, Porn, Hentai metadata. In my day job as an android developer I have a lot of experience with developing oauth service stacks, mostly node.js server and android client. I also have a lot of linux experience, so server administration for nginx, mysql, php is no problem for me. I even used to host quite a few small hentai sites like 10 years back :) The other reason was also to learn QT, and several other technologies that interest me (redis,mongodb,zeromq,cassandra)

Anyways I think developers like us have very strong opinions about how somethings should work sometimes, and thus also the reason for so many similar but different products and services :) why do you think we have so many database, webservers, application web stacks :)

Regarding XBMC, I have no issue with it, in fact if there was a reliable way for xbmc to scrape this information that good. But whenever I read scrape, the next thing I think about is fragile. It is so easy for a website to change it's HTML structure then your scraping regex / xpath will be broken. What you preferably want or need is a proper RPC / JSON API. If there is an existing site out there that provides this I would like to know :)

Then comes the problem of keeping this metadata up to date and relevant, preferably you would want to have the community to provide this information. It will be VERY time consuming and error prone to do this yourself. After you can be sure you got relevant metadata then you can tack on some full text search.

I don't forsee any legal issue, this is just metadata.

For example this is the data exchanged https://github.com/zenwong/tagu/wiki/Sync-Logic

There will be no sharing of actual videos
 
Last edited by a moderator:

cyberzen

New Member
Apr 8, 2010
64
21
Running a server on a home network is generally frowned upon by most ISPs unless specifically stated so, you can assume they prohibit any kind of server on their network. And secondly most home internet plans use dynamic IPs, so your IP will be changing constantly. You will need to use some kind of dynamic DNS to update your IP. Thirdly you will need to keep your computer on the whole time :) You will also need to open some ports on your computer, otherwise how are people going to connect to your computer?

Anyways server hosting in general is very affordable, you can get a VERY powerful dedicated server from hetzner for about $80 US with 20TB of bandwith a month. Starting out you probably won't need a dedicated server, any medium sized VPS you can get for about $10-$20 a month will suffice. Domain name is cheap, I already bought a cool one :) http://tagu.in it means tag in japanese according to google translate haha :)

I think when he mentioned Java Swing, he means the client, not the server. Also there's nothing wrong with using java on the server, lots of fortune 500 companies use java stack :) flash on the other hand....

The reason why I am discussing development here is to possibly share ideas and perhaps work if possible. And yes my end goal would be a website and client that you can use to look up information / metadata about JAV / Porn / Hentai. Since there is no existing solution for this, I am not competing with anyone.
 
Last edited by a moderator:

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
About tags...

I propose that we have two (or more) different fields, the one that I care deeply about is "ジャンル", katakana for "genre". It's a field that's very consistent between different data sources (DMM etc). So the others may be "tags" and/or "rating" etc.

It so happens to be the same field names in XBMC, but I won't be surprised if iTune, MPlayer etc all use the same field names.

Basically, genres should be pretty objective, in mainstream it might be scifi, romance, comedy etc. In JAV, genre data is practically official, as in, it's provided by the studio that releases the film. Ok... not quite 100%... many e-tailers add their own genres to each film mainly to communicate the medium or nature of business transaction, e.g. "dvd", "streaming", "second hand", "clearance item" etc. Basically the e-tailers are using genre to function as tags, so the shoppers can quickly find/browse "clearance item", for example. These are meaningless to us and will be filtered away.

Then there are tags which are subjective and personal, I propose that you don't overlap what's already in genres, e.g. "素人" amateur, "巨乳" busty. So exactly what would fit to be in tags... in theory one can think up a lot of useful labels that may be lacking in the official genres. Maybe ... hmm... the negative aspects, "sagging tits", "wrinkles", "fat", or something that's neglected? "big cock", "lots of sperm", "lots of close-ups", "poor lighting"? "hard sub", "watermark" etc, some might think those useful... but ehhh... seriously would I ever want to browse vids with sub? Yeah certain JAV's are much more fun/corny to watch with sub (in one case, I keep a subbed (Chinese) version and another version that lacks the subtitle, but gives much better video quality), but that's very rare. Likewise some user may want to tag the file as "HD" or "1080p", etc. Don't know if that works for you... Sounds to me more like you want to substitute your personal standards or classification to the genres. Sub-divide "巨乳" into "big tits" and "ginormous tits"? Sub-divide "熟女" into "MILF" and "granny". Well I can certainly sympathize with you, I used to write my own app and rating system for the photos that I collected (this was way before DVD or even VCD). But no more... at least for JAV and Western porn, I toss Western porn into a few board categories, one of which is "miscellaneous". My JAV is even less organized, I only have 3 categories: censored, uncensored and nude (idol/image/gravure) and I plan to use actress and genre to help me browse my collection. So yeah... while I certainly agree "巨乳" as supplied by DMM etc isn't precise enough for my taste, I'm not gonna put any time and effort to it, not even when there's an app to facilitate it.

Anyway, assuming we might end up agree to disagree on some things I still to convince you to support my view to keep the "official' genre data intact and separate from tags. That way we could help each other out at least on this aspect. That way every user of this database (yes it's much more about the data and database than about the app) can understand "busty" as "巨乳" exactly as categorized by DMM, rather than a subjective standard according to ding73ding or cyberzen or CG.

As for actual formats, we've seen at least 3 here: plain text as copied from DMM product page, XML-based NFO file used by XBMC and many major softwares, and cyberzen's format that looks a bit like a BibTex. I think whatever format doesn't matter much as long as the data is complete and high quality. It's easy enough to write import/export methods for any well-defined formats. For now I only have plain text to NFO, but once a standard format emerges I can whip up all the necessary convertors in no time.

My preference is to adopt the IMDB/XBMC standard but I'm quite willing to accept whatever the "community" comes up with. Some issues against the IMDB/XBMC standard is: lack of explicit dual/multi language support and lacking ways to deal with multiple aliases (actress). I think these are very minor issues.

The point is to provide data to the widest range of users, no matter what their preferred OS, hardware, media player/center app is.
 

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
While I like the idea of XBMC scraping metadata from the web, but I prefer to use mplayer to play my videos.

I hope we can be player-agnostic, whatever we are each and together doing, there's a common core (the back-end) that's open and useful to everyone. Anyway, as I understand XBMC (at least on Android?) fires up an external player, equivalent to your own app firing up mplayer.

For me what's important is a simple UI that will show me a list of my videos in thumbnail, cover, screencap, actress view. I want an easy way to search through my collection by rating, title, actress and tags.

All are already well implemented in XBMC (and various competitors), you only need to collect the info and pix and feed them to XBMC, some skin even plays a slideshow and/or trailer/clip as background for each individual film. The pix and clip can be stored on the local drive, on the LAN, or in the cloud (XBMC add-ons support for Google Drive and MS Cloud). For example CG can put his 10GB of cover image on the cloud somewhere and any XBMC user can access (stream) or scrape (download) them online.

I want relevant tags for videos in my collection automatically.

That depends on your definition of relevant and automatic. What I have now is sufficient relevant (DMM-like) and automatic (once I convince CG to release his 80,000 entries meta data) for me.

I want support for porn and hentai.

Media Center Master already support porn. We are just lacking JAV and hentai (anime).

In my view there are several problems that an app has to overcome to deliver on these features.
1) video normalization (file hash)


I propose, if you want to do this, as a stand-alone tool that can also be called from your main app. IMHO the average users aren't likely to use it. If I want to distinguish different files of the same JAV, between file size, extension (avi, mp4, mkv), encoding methods, resolution is way more than what I need. And I almost never download a full length file (500+ MB) without knowing the DVD code, so I am still puzzled over your need.

2) tag normalization (synonyms)

This is technically trivial. I already translate different equivalent strings to the same translation:
生中出し cream pie
中出し cream pie
女子校生 college girls
女子大生 college girls
these are snips from my dictionary file.

3) tag voting (relevant tags will rise to the top)

I really doubt we will get enough vote count to matter... but then... if I were really smart on such things (social network, crowd source) I would have invented Facebook not that Zucker guy.

4) either a central server that can coordinate all the metadata synced to client OR some kind of P2P system (this will be a lot more difficult)


Isn't that practically trivial? Just the HTTP protocol (with old school get/put CGI) is more than enough to address this, even with your user-account and voting system. I developed a website for a Japan-related hobby way back before FB and MySpace and had thousands of visitors per day, and we had a contest and had over 100 entries (I hand-coded a simple user registration, and entry submission system using perl-cgi and I had 8 judges from around the world who log in on the website and score each entry). Eventually I got sick of footing the webhosting bill (over 10GB/month of outgoing traffic, back in the early 00s that's a lot) and didn't want to turn "evil" (Double-Click) so we closed it down.

When you have different users, each with different primary language ability, you will run into the same kind of problem that stackoverflow had. They solved it by adding tags synonyms http://blog.stackoverflow.com/2010/08/tag-folksonomy-and-tag-synonyms/.

I think this is like step 6, and we haven't even started step 1. I propose we first take care of the hardcore objective info as available from DMM. Get all the titles, actress, DVD-code as the absolute basics, studio, series and genre are also useful. Then, I believe if we get more than 10 users (a big IF) we can get together and discuss ideas. We are still so far off from having the kind of headaches of stackoverflow.

Will you be releasing your program as open source?

I have no problems about that, but no one has expressed interest yet. I doubt it would be useful to most people as I write in Matlab. It's easy enough to translate into any language if needed. More likely it's much more useful to release the database, we could just put it up on ryushare and various file hosts and anyone can download it and import it locally. I would be perfectly willing to run my program daily and upload the incremental output to wherever the community decides. I think it's hardly necessary to run it more than once a day. It's also easy enough to turn it into a scheduled task.
 
Last edited by a moderator:
  • Like
Reactions: Blade Runner

cyberzen

New Member
Apr 8, 2010
64
21
Haha, your suggestion for a genre of tags was what I just changed my schema to accommodate :) I use sqlite for the client and mysql for the ser and here's the current schema for tags, it uses a category or tags group to group tags together:

CREATE TABLE Tags (
_id integer primary key autoincrement,
name text,
cid integer default 1,
synced integer default 0,
remote integer default 0,
foreign key(cid) references tags(cid),
unique(name) ON CONFLICT IGNORE
);

CREATE TABLE Category (
_id integer primary key autoincrement,
name text,
unique(name) ON CONFLICT IGNORE
);

under category I can have groupings of tags like bust size, body type, hair color, ethnicity, actress.

Ahh I did not know about Media Center Master, I will have to look into it when I'm back on my windows box. However it seems to only supports windows, I want something that I can run on my beloved arch linux :)

1) video normalization
right now when I import videos into the database, I'm recording the title (filename), hash (sha1). I will be using the hash for duplicate detection and also to upload the the server for video normalization. There is nothing to normalize on the client size, since it's unlikely you will 5 different versions on your computer, on the server however the chances are much more likely you will have 20+ hashes referring to the same video title.

2) tag normalization
removing pluralism can be handled by stemming. But this will not work for multi word tags like cum.in.mouth > oral.creampie. Each tag should have an official tag that other tag synonyms refer to. This will also help the users tag the official tag on the client.

It's quite critical to get normalization right, since they are all dependent on one another
title <> tags, recommendation, comments
tags <> title, recommendation
recommendation <> title, tags

even XBMC, expects your files to be named in a certain pattern for their scrapers to work, documented here
http://wiki.xbmc.org/?title=Scrapers
http://wiki.xbmc.org/index.php?title=Video_library/Contents

3) tag voting
are you on empornium.com ? their tag voting system is what I am trying to emulate. Each video will have a bunch of tags that users can add or vote on. They typically list the top 10 tags for each video, so obviously the more relevant tags will be shown. This might not be useful for videos that you have tagged yourself, but it will be useful for searching for videos that you are interested in downloading. I use it all the time on empornium :)

4) central server
layering some kind of reliable sync logic on top of http is not trivial to me at my current skill level. The other option would be to have some kind of DHT that maps video title > tags, tags > title. I'm still trying to read up more on this area since I don't have any experience developing p2p apps.

5) hosting the database on file host
not every user will know how to use this database :) you need some kind of UI for this to be useful for the vast majority of users.

Could you provide an example of the XML / .nfo you are using in your program. Also XBMC uses sqlite, so I guess it's quite easy for me to insert data in there as well. I downloaded the XBMC source to take a look at their schema :)
 
Last edited by a moderator:

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
1) video normalization
...
5 different versions on your computer, on the server however the chances are much more likely you will have 20+ hashes referring to the same video title.

My general point to your post is that you are adding or combining some pretty serious functionalities into this project, that I would not and could not support. Here the way I break it down is:

1. a (nearly) complete catalog (I used to call it database, but change the wording to be more specific) of JAV titles, which are identified uniquely by DVD code e.g. SDMT-908
Calling it a catalog emphasize it's about the title, not the file. And only the solid data (studio, actress etc) that's find from the catalogs of e-tailers.

2. a database that covers the different media files that maybe of the same title

3. a multi-user community that vote, tag and recommend (?) titles, like last.fm (I think you mentioned it before)

For various reasons, I'm only interested in developing or even using item 1 and only for single users. And hopefully I will keep my mouth shut about items 2 and 3.

2) tag normalization
...
refer to. This will also help the users tag the official tag on the client.


Like I said, I'm not going to comment on what you seem to be calling non-official tag. I just want to re-emphasize that IMHO, we need to keep official "genre" and user-applied "tag" true and well separated.

even XBMC, expects your files to be named in a certain pattern for their scrapers to work,

XBMC only adopt the de facto standard that evolved from the community. In practice, every single movie and TV show that I download from filehosts and BT are recognized automatically without me lifting a single finger. If I turn on the feature, XBMC, upon launch, would scan the BT folder, find any new file and scrap the metadata, cover art, fanarts etc from IMDB. I just have to click ok to confirm the choice offered by the scarper, or in case some film titles are ambiguous, I have to select a title other than the top choice, still just a few button press on my remote.

3) tag voting
are you on empornium.com ? their tag voting system is what I am trying to emulate.


Sounds interesting, but I can not and would not dev something on the same line.

4) central server
layering some kind of reliable sync logic on top of http is not trivial to me at my current skill
5) hosting the database on file host
not every user will know how to use this database :) you need some kind of UI for this to be


I have dip my toes in various niche hobby groups, such as Akiba. I think most JAV users (BT+filehost) are either too lazy to perform any management, or smart enough to handle d/l a file and perform local install/import. Like you hardly ever run into a n00b linux user.

I'm an engineer not a pro coder (hence Matlab). What I'm driving at is hope that our various efforts can benefit each other wherever it makes sense. I believe the scope I define for myself is the narrowest possible and also most fundamental (back end) to everyone else's bigger and more ambitious project.

Since you are linux fan, surely you know the power of small, efficient, limited programs like cron, grep, awk, sed, cut, paste; and create powerful programs from putting together shell scripts.

Also XBMC uses sqlite, so I guess it's quite easy for me to insert data in there as well. I downloaded the XBMC source to take a look at their schema

But there's no need to! It's better to just use it, rather than dev for it (or worse: fork it). It's powerful enough to do what any JAV consumer could hope for, only requiring someone to provide the meta data. E.g. it provides a simple and powerful scripting language for you to write or tweak a scraper, but there is even no need for that! Since we are preparing a database on our own, there's no guesswork or moving target problem (web site changing format). If you are really geeky about it, you could write add-ons for XBMC (and I presume any other media center software). Personally I see no reason at all to even inspect XBMC code. Same as I would never even inspect the kernel code of Linux.
 

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
Here are a couple of NFO files that's generated from my program. Also I'm attaching my still-growing-quickly genre dictionary.

The dictionary format is super simple, I keep 3 different dictionary files for now, each is applied in slightly different manner (e.g. field name translation is only applied if the match is from beginning of line). The attached dictionary is only applied to the field of "genre". If there's need, and if someone would provide the data, it's easy to translate actress names too.

Yes there can be quite a discussion about some of my choices (sometimes ignoring a more precise linguistic translation).

Code:
二穴同時挿入	double penetration
挿入	penetration
生中出し	cream pie
中出し	cream pie
人妻	wife
花嫁	bride
若妻	bride
熟女	mature lady
巨乳	big tits
美乳	beautiful tits
素敵なオッパイ	beautiful tits
...

Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<movie>
	<title>Nana2 Second Coming of Beautiful Tits Angel</title>
	<originaltitle>Nana2 再臨の美乳天使/小倉奈々 Nana Ogura</originaltitle>
	<aired>2012/09/06</aired>
	<runtime>60分</runtime>
	<actor>
		<name>小倉奈々</name>
		<role>herself</role>
		<order>0</order>
	</actor>
	<studio>REbecca</studio>
	<genre>HDTV</genre>
	<genre>idol /entertainer</genre>
	<genre>image video</genre>
	<genre>beautiful tits</genre>
	<id>REBDB-011</id>
	<plot>AVでは邪魔なモザイクや男優を完全排除し、女の子たちの素の部分に迫るREbeccaレーベルに、説明不要のトップアイドル/小倉奈々ちゃんが登場。“ヴィーナスの化身”と表現すべき完璧過ぎる裸体は一見の価値アリ!WMV 1.16GB</plot>
</movie>

Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<movie>
	<title>JUFD-303 Chichirich Boing Creampie Special</title>
	<originaltitle>チチリッチ ボインなあのコにドロッと中出しSPECIAL!</originaltitle>
	<aired>2013/09/01</aired>
	<runtime>240分</runtime>
	<actor>
		<name>尾上若葉</name>
		<role>herself</role>
		<order>0</order>
	</actor>
	<actor>
		<name>舞咲みくに</name>
		<role>herself</role>
		<order>1</order>
	</actor>
	<actor>
		<name>星海レイカ</name>
		<role>herself</role>
		<order>2</order>
	</actor>
	<actor>
		<name>野宮さとみ</name>
		<role>herself</role>
		<order>3</order>
	</actor>
	<actor>
		<name>芦名ユリア</name>
		<role>herself</role>
		<order>4</order>
	</actor>
	<director>南★波王</director>
	<studio>Fitch</studio>
	<genre>school girls</genre>
	<genre>slut</genre>
	<genre>big tits</genre>
	<genre>orgy</genre>
	<genre>cream pie</genre>
	<genre>デジモ</genre>
	<id>JUFD-303</id>
	<plot>人気の巨乳&爆乳女優が一挙集結!乳を揺らして腰を振るリッチボディのチアガール女子校生達と夢のような中出し学園生活!あるきっかけで女子校に転校し、チア部のマネージャーになってしまった僕。若葉&さとみに責められるヌルヌルオイル巨乳パイズリ!ユリア&みくに&レイカのドリームフェラ&ローションパイズリ!一人一人とドロっと中出しSEX!ラストはもちろん全員との乱交まで!チンポを包み込む滑らか快感!</plot>
</movie>
 

Attachments

  • dictgenr.txt
    4.2 KB · Views: 547

cyberzen

New Member
Apr 8, 2010
64
21
From what I gather from your requirements what you want is a system like this?

1) web crawler to scrape DMM / retailer
2) output metadata to XML and upload to filehost
3) XBMC plugin that can scrape that metadata and import into XBMC
4) repeat step 1 everyday for fresh metadata

is that accurate?

My point about XBMC using sqlite was I'm not sure what all that XML is for?
 

cyberzen

New Member
Apr 8, 2010
64
21
Can someone test the windows version? Please report any bugs
Tagu Windows Version

When you open the app, go to
1) menu general > options > add directories for your porn folders
2) menu database > import videos
 

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
From what I gather from your requirements what you want is a system like this?

I'm not even too keen to discuss detailed requirements before we establish some fundamentals.

1. we agree on a common database structure, it should be both transportable (a common, minimalist core structure) and extendable (allowing fancy extensions like recommendation)

2. we agree on one (or more?) file format for data exchange/sharing

3. we agree to actually share such data, in good faith. Note we should be careful not to get oneself or even bystanders into trouble. E.g. a retailer might consider a (buggy) webcrawler as DOS attack

4. to enhance sharing (and many other benefits), we should break up each of our (pet) project into well-defined modules and share the effort, if it makes sense and in good faith.

5. if we get some traction here, we can talk to Akiba's admins if they might let us host something (metadata, code, apps even pix) here.

To elaborate on database structure:
Each mainstream JAV title is uniquely keyed by its DVD code. I don't know if there are any mainstream title (digital-only?) that lacks a DVD code.
For uncensored titles from major studios (Caribbean, Red Hot etc), there seems to be a way to construct a code (081313-405-carib, RHJ-153). How to deal with studios that doesn't provide any code is TBD.

Bilingual issues: actress names are relatively straight forward since official English names can be found and verified (manually if need be)
Title translation is extremely troublesome, I proposed we provide Japanese title as formally supported, English translation is treated as unofficial and only based on best effort.
Genre translation is, IMHO, extremely important, but not too difficult. The total number of genres is finite and Google Translation plus manual checking is fine. I think the process (with continually update) is robust enough to use English for genres in the database. But I can foresee some spirited discussion about some translation choice.

Database standaloneness: I strongly believe that the JAV database should be human-readable on its own. So I respectfully oppose CodeGeek's proposal to enumerate categories, director, actress etc. Whatever benefit provided by enumeration should be provided by additional database(s) and apps. I know, some people consider it's important to deal with alternate aliases used by some actresses. But I propose in the database, if a DVD is released with the name "Miyabi", then that's the name to be put in the database under that title. Other titles list the name "Maria Ozawa", so you might complain there's no easy way to filter out all the titles performed by the same person. To do that we need to have a second database which helps to link up all the aliases for the same person.

Which segues nicely to consider developing and maintaining an actress database. We can deal with English and Japan names and aliases for each actress, link to portrait either locally or in cloud. A bit of imagination let you make some fancy queries: find all titles of Yuma Asami as a slave when she was 20-25 years old; find all titles of oral cumshot with an actress born in Hokkaido and is D-cup or larger.

Finally those of you who want to maintain file-specific info in a database, I propose you develop a third database, this one perhaps using your file hash as unique key.

So we may have a system like:
Code:
<javdb>
    <movie>
        <title>ギリギリモザイク 淫らな肉体</title>
        <id>ONED-409</id>
        <actor>小澤マリア</actor>
        ...
    </movie>
    <movie>
        ...
    </movie>
</javdb>

Code:
<actordb>
    <actor>
        <name>小澤マリア</name>
        <nameEng>Maria Ozawa</nameEng>
        <alias>Miyabi</alias>
        <alias>みやび</alias>
    </actor>
    ....
</actordb>

Code:
<filedb>
    <file>
        <hash>12345678ABCD</hash>
        <id>ONED-409</id>
        <codec>...</codec>
        <hardsub>none</hardsub>
        <softsub>English</softsub>
        <softsub>Japanese</softsub>
        ...
    </file>
    ....
</filedb>

1) web crawler to scrape DMM / retailer
I just learned that DMM has become partially open to my region, upon using it for a while, I found quite a few cons, to go with some tremendous pros, as a candidate for data source.

BTW, let's distinguish scraping and crawling, scraping is performed by each user, who pulls only the data he needs from the source. Crawling is performed by a host, e.g. Google, accessing (and presumably caching) all data indiscriminately from the source.

Anyway the key point is that I have no idea which site could serve as a good source for scraping and/or crawling. DMM has a fatal issue, AFAIK, the product code listed on their product pages is mangled from the official product code. I don't see a simple algorithm to extract the official product code from the mangled DMM-specific code. Also their search engine (at least from my region) doesn't take product codes. Also, when I access from a foreign region, their catalog seems to be incomplete. (only download and/or streaming is available, DVD product pages are not accessible)

Couple of huge factors for DMM: every single page on DMM site is available in both Japanese and English. The English page seem to suffer quite a bit from robotic translation (probably Google Translate) so e.g., some actress names get butchered by literal (instead of phonetic) translation. Another big plus: there's a pretty good portrait gallery for most JAV actresses.

Amazon Japan is even worse, their product code has no relation to JAV product code.

So far I scrap by hand, I start with sometimes a DVD code, sometimes a Japanese string (title) and even once in a while an actress name in English, then google or yahoo, and hope to find a Japanese hit with metadata and cover image (and screen caps if possible). Almost all of the time the textual data I found is roughly in the same format as DMM's so I've written a tool to parse such text.

2) output metadata to XML and upload to filehost

That's only one of the absolutely doable ways to share data. It's admittedly quite clumsy. I proposed it only to say hey we can do this yesterday, there's no barrier of any kind. Hopefully before long, we could find a suitable free (low cost) host. Say... you have a site of your own, would you host at least the metadata? At least temporarily?

3) XBMC plugin that can scrape that metadata and import into XBMC
There's no need to write plug-in or add-on for XBMC, it's just a simple import, or if scraping is really needed, I think scraping by anything other than DVD code will be problematic. I foresee it's much easier to write an app that scan my movie folder and generate a file for importing. Not as slick as an XBMC add-on, but I'm too lazy to learn a new thing that's useless for my day job.
 

cyberzen

New Member
Apr 8, 2010
64
21
There's no need to write plug-in or add-on for XBMC

What you don't seem to understand is for XBMC to import the metadata it has to do it through a plugin. For the IMDB scraper it has to scrape IMDB, it's not just adding the JAV metadata website to the IMDB scraper, it's won't know what to do with the scraped data. You will need code (a XBMC plugin) to transform the data from IMDB (or a JAV meatadata website) to your local XBMC database.

I strongly believe that the JAV database should be human-readable on its own
XML is a passable way to transfering data from one machine to another, but it is a VERY inefficient database. sqlite is a much better way to store relational data, you can't do any kind of normalization or any kind of meaningful cross referencing of your data. Can XML tell you, what are all the videos of Mariza Ozawa within the date range 1992 - 2000? Also since there is no kind of normalization and since it is plain text, your database will be enormous. My sqlite database which holds all my 20,000+ videos with tags and actress only takes up 1.5 mb of space.

Another consideration you seem to gloss over is that who is going to pay for the web server? Sure it's not going to cost much at first, but if it gains any kind of real popularity are you willing to fork over 1k a month? What plans do you have to sustain this for the long term? For me I plan to offer my client free of charge, and charge $1 a month if you want to access additional metadata from the server.
 

ding73ding

Akiba Citizen
Oct 25, 2009
2,331
2,065
I saw you video demo, and a very brief look at your code. It's technically quite impressive, let me say definitely you are more capable programmer than me.

But I also must state it's extremely unlikely I will ever use your app, even with all that you've promised plus 10x more. I've been an XBMC fan for a long time, and when I jump horse in the future (something will suppress XBMC, with 100% probability), it will be to another app that's even more XBMC than XBMC, if you know what I mean. Well it may turn out to be your app (many generations later) that suppress XBMC, but I don't think that's what you wish to do.

I'm not married to XBMC, but I'm married to a slick and polished home theatre experience. I can live with XBMC without metadata, but I can't live with an app that has advanced browsing, searching and tagging but looks like xwin.

So at best I can only give you some support on your project, but I ask you to consider designing or managing it in such a way that other projects could pick and choose some useful bits and modules and stuff.

See some screen shots of XBMC's interface after I imported some metadata and cover images. Would look even better if I import actress portraits.

Screenshot_2014-02-21-13-44-57.pngScreenshot_2014-02-21-13-48-00.pngScreenshot_2014-02-21-13-49-41.pngScreenshot_2014-02-21-13-50-35.pngScreenshot_2014-02-21-13-51-12.pngScreenshot_2014-02-21-13-51-25.png
 

cyberzen

New Member
Apr 8, 2010
64
21
Like I said, I did try XBMC out before and I don't like it. Not everyone has to like the same app right? And my intention with this app is not to replace XBMC.

Still I do realize that a lot of people do like XBMC, and providing a way for my program to export the data to XBMC could be one of my features that I will develop in the future. But currently that is not on my priority list right now. My focus right now is getting sync to work, and after that provide stable bug free apps for Windows, Linux and OSX.

All the data from my app is stored in sqlite, which is very easily viewable (and exportable) in any sqlite client. It does not get any easier than 'select * from LibraryView' that view will give a list of all your videos with tags and actress. And since XBMC is also using sqlite as it's database it will be very easy to import and export.

This is what the listing from 'select * from LibraryView' looks like

title rating tags acts path
------------------------------------------------------------------------------------------------
RBD-555 | 7 | teacher,r***,crying | Kaho Kasumi | /disk2/jav
JUX-234 | 6 | wife,r***,cuckold | Noa Imai | /disk2/jav
 
  • Like
Reactions: 1 person

iori11

Member
Nov 25, 2009
100
2
Can someone test the windows version? Please report any bugs
Tagu Windows Version

When you open the app, go to
1) menu general > options > add directories for your porn folders
2) menu database > import videos

First of all thanks for sharing your hard work, your app is simple and doing what i want (i don't want any xmbc plugin like wut.. do you see me inviting friends in my living room to watch Jav together, nope ;) )

so about the test i tryed on windows 7 and 8 has the same artefact as in the capture in all the views , bassically when i move over a movie a large portion of the list dessapear.

Hope it help keep up the good work.
 

Attachments

  • Capture.PNG
    Capture.PNG
    13.8 KB · Views: 195
Last edited by a moderator: