|
View:
New views
15 Messages
—
Rating Filter:
Alert me
|
|
|
[Building Sakai] TinyURL Service Impl IssueSummary: There is an ever increasing probability that the collision of
IDs in the TinyURL service will slow down the service considerably. In looking at the implementation of the TinyURL service, I've noticed something about the way URLs are generated. If I'm headed in the wrong direction with this, please correct me. The very general approach is: 1) generate a random ID 2) check that the ID doesn't already exist 3) if exists, go to 1. if not, record the ID and release it to the caller. The following code for generating a URL: Random generator = new Random(); int seed = Math.abs(generator.nextInt()); has an issue with collision which is addressed by checking for the existence of that ID before handing it out but the algorithm run time will continue to grow because of the growing chance of collision over time. Explanation of concern: The first generated ID will have a 0/n chance of collision where n is the total number of possible IDs. The next ID has a 1/n chance, the next 2/n, etc up to the last having a chance of collision being n-1/n. Considering n needs to be sufficiently large to have a low to zero rate of duplication over time, this check will take more time because of the number of collisions over time. Assuming that the ID column in the database is indexed, the lookup time is minimized but there's still a call to the database lookup on each generated ID. As the number of IDs generated reaches the max possible, there are possibly n-1 lookups before a valid ID is found. The current code recreates the Random each time which stops nextInt() from knowing history (ie. new random seed on each generation). Java implements a linear congruential generator to generate random numbers which requires knowledge of the previously generated ID to be effective. Math.random() creates the random seed once and reuses it for the life of the JVM process which is a better approach. I would suggest creating an LCG in the TinyURL service so that the proper range can be easily managed (better performance due to a narrower scope than Math.random()) and the seed can be restored after a restart. I have created a prototype in Python based on the rules laid out in the Wikipedia article on LCG's [1] and using a sufficiently large prime number. This still carries a concern of generating IDs across a cluster but is relatively easy to address. WDYT? [1] http://en.wikipedia.org/wiki/Linear_congruential_generator _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueCarl,
Sounds good, is there an existing LCG Java implementation we can leverage or should we roll our own? cheers, Steve On 6 Jul 2009, at 12:34, Carl Hall wrote: > Summary: There is an ever increasing probability that the collision of > IDs in the TinyURL service will slow down the service considerably. > > In looking at the implementation of the TinyURL service, I've noticed > something about the way URLs are generated. If I'm headed in the > wrong direction with this, please correct me. > > The very general approach is: > 1) generate a random ID > 2) check that the ID doesn't already exist > 3) if exists, go to 1. if not, record the ID and release it to the > caller. > > The following code for generating a URL: > > Random generator = new Random(); > int seed = Math.abs(generator.nextInt()); > > has an issue with collision which is addressed by checking for the > existence of that ID before handing it out but the algorithm run time > will continue to grow because of the growing chance of collision over > time. > > Explanation of concern: > The first generated ID will have a 0/n chance of collision where n is > the total number of possible IDs. The next ID has a 1/n chance, the > next 2/n, etc up to the last having a chance of collision being n-1/n. > Considering n needs to be sufficiently large to have a low to zero > rate of duplication over time, this check will take more time because > of the number of collisions over time. Assuming that the ID column in > the database is indexed, the lookup time is minimized but there's > still a call to the database lookup on each generated ID. As the > number of IDs generated reaches the max possible, there are possibly > n-1 lookups before a valid ID is found. > > The current code recreates the Random each time which stops nextInt() > from knowing history (ie. new random seed on each generation). Java > implements a linear congruential generator to generate random numbers > which requires knowledge of the previously generated ID to be > effective. Math.random() creates the random seed once and reuses it > for the life of the JVM process which is a better approach. > > I would suggest creating an LCG in the TinyURL service so that the > proper range can be easily managed (better performance due to a > narrower scope than Math.random()) and the seed can be restored after > a restart. I have created a prototype in Python based on the rules > laid out in the Wikipedia article on LCG's [1] and using a > sufficiently large prime number. This still carries a concern of > generating IDs across a cluster but is relatively easy to address. > > WDYT? > > [1] http://en.wikipedia.org/wiki/Linear_congruential_generator > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > with a subject of "unsubscribe" _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssuePlease take a look at the comments at the head of the "EighteenIDGenerator"
file for some details of the maths involved in computing how many documents will arise before a collision: https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java I wouldn't really recommend using a key space smaller than this one if you want to avoid collisions - although of course you will reprobe to be sure. Another issue with LCGs is that they are highly predictable - this is why I use a SecureRandom here: One of the requirements of the TinyURLService that I made for the work at CARET was that the ID sequence could not be predicted - to prevent the possibility of malicious "hijacking" of upcoming URL slots (this was actually a system designed to allocate small URLs that were plaintext signup URLs routed to a PHP system, perhaps more risk than is present in a typical application!) Note that a further issue you have with an LCG is running it in a cluster - even if you can lock on the particular instance you have, you can't guarantee that some other instance in the cluster has not stepped on past the slot you want. I think that the approach in my service (a SecureRandom plus a transactional check into persistence) is most likely to work reasonably well in a cluster - although it hasn't been tested there :P Cheers, Antranig. Quoting Steve Swinsburg <s.swinsburg@...>: > Carl, > > Sounds good, is there an existing LCG Java implementation we can > leverage or should we roll our own? > > cheers, > Steve > > > On 6 Jul 2009, at 12:34, Carl Hall wrote: > > > Summary: There is an ever increasing probability that the collision of > > IDs in the TinyURL service will slow down the service considerably. > > > > In looking at the implementation of the TinyURL service, I've noticed > > something about the way URLs are generated. If I'm headed in the > > wrong direction with this, please correct me. > > > > The very general approach is: > > 1) generate a random ID > > 2) check that the ID doesn't already exist > > 3) if exists, go to 1. if not, record the ID and release it to the > > caller. > > > > The following code for generating a URL: > > > > Random generator = new Random(); > > int seed = Math.abs(generator.nextInt()); > > > > has an issue with collision which is addressed by checking for the > > existence of that ID before handing it out but the algorithm run time > > will continue to grow because of the growing chance of collision over > > time. > > > > Explanation of concern: > > The first generated ID will have a 0/n chance of collision where n is > > the total number of possible IDs. The next ID has a 1/n chance, the > > next 2/n, etc up to the last having a chance of collision being n-1/n. > > Considering n needs to be sufficiently large to have a low to zero > > rate of duplication over time, this check will take more time because > > of the number of collisions over time. Assuming that the ID column in > > the database is indexed, the lookup time is minimized but there's > > still a call to the database lookup on each generated ID. As the > > number of IDs generated reaches the max possible, there are possibly > > n-1 lookups before a valid ID is found. > > > > The current code recreates the Random each time which stops nextInt() > > from knowing history (ie. new random seed on each generation). Java > > implements a linear congruential generator to generate random numbers > > which requires knowledge of the previously generated ID to be > > effective. Math.random() creates the random seed once and reuses it > > for the life of the JVM process which is a better approach. > > > > I would suggest creating an LCG in the TinyURL service so that the > > proper range can be easily managed (better performance due to a > > narrower scope than Math.random()) and the seed can be restored after > > a restart. I have created a prototype in Python based on the rules > > laid out in the Wikipedia article on LCG's [1] and using a > > sufficiently large prime number. This still carries a concern of > > generating IDs across a cluster but is relatively easy to address. > > > > WDYT? > > > > [1] http://en.wikipedia.org/wiki/Linear_congruential_generator > > _______________________________________________ > > sakai-dev mailing list > > sakai-dev@... > > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > > with a subject of "unsubscribe" > > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > with a subject of "unsubscribe" > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueYou could consider using 22 character uuids as autogenerated tinies in
the url. It might not be able to be called tinyurl anymore, but it would at least be short, linkable and (much more) collision free. http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string I'm not convinced it would be necessary to keep these id's from being unpredictable. The current security model should prevent anyone from accessing content they don't have permission to even guessing id's. I think eventually when a UI was designed for this it might even want to allow the user to specify a custom ID, similar to how the Site Aliases (on new site creation) works, allowing a user to specify anything they want for their site alias. (Like tinyurl.com) So this would almost guarantee collisions. So I wouldn't think that the possibility of a collision on an low use feature for an indexed column (when collisions will probably be featured in eventually anyway) likely won't be a big deal. -Matthew On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: > Please take a look at the comments at the head of the "EighteenIDGenerator" > file for some details of the maths involved in computing how many > documents will arise before a collision: > https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java > > I wouldn't really recommend using a key space smaller than this one > if you want to avoid collisions - although of course you will reprobe > to be sure. Another issue with LCGs is that they are highly predictable - > this is why I use a SecureRandom here: One of the requirements of the > TinyURLService that I made for the work at CARET was that the ID > sequence could not be predicted - to prevent the possibility of malicious > "hijacking" of upcoming URL slots (this was actually a system designed > to allocate small URLs that were plaintext signup URLs routed to a PHP > system, perhaps more risk than is present in a typical application!) > > Note that a further issue you have with an LCG is running it in a cluster - > even if you can lock on the particular instance you have, you can't > guarantee that some other instance in the cluster has not stepped on > past the slot you want. I think that the approach in my service > (a SecureRandom plus a transactional check into persistence) is most > likely to work reasonably well in a cluster - although it hasn't been > tested there :P > > Cheers, > Antranig. > > > Quoting Steve Swinsburg <s.swinsburg@...>: > >> Carl, >> >> Sounds good, is there an existing LCG Java implementation we can >> leverage or should we roll our own? >> >> cheers, >> Steve >> >> >> On 6 Jul 2009, at 12:34, Carl Hall wrote: >> >> > Summary: There is an ever increasing probability that the collision of >> > IDs in the TinyURL service will slow down the service considerably. >> > >> > In looking at the implementation of the TinyURL service, I've noticed >> > something about the way URLs are generated. If I'm headed in the >> > wrong direction with this, please correct me. >> > >> > The very general approach is: >> > 1) generate a random ID >> > 2) check that the ID doesn't already exist >> > 3) if exists, go to 1. if not, record the ID and release it to the >> > caller. >> > >> > The following code for generating a URL: >> > >> > Random generator = new Random(); >> > int seed = Math.abs(generator.nextInt()); >> > >> > has an issue with collision which is addressed by checking for the >> > existence of that ID before handing it out but the algorithm run time >> > will continue to grow because of the growing chance of collision over >> > time. >> > >> > Explanation of concern: >> > The first generated ID will have a 0/n chance of collision where n is >> > the total number of possible IDs. The next ID has a 1/n chance, the >> > next 2/n, etc up to the last having a chance of collision being n-1/n. >> > Considering n needs to be sufficiently large to have a low to zero >> > rate of duplication over time, this check will take more time because >> > of the number of collisions over time. Assuming that the ID column in >> > the database is indexed, the lookup time is minimized but there's >> > still a call to the database lookup on each generated ID. As the >> > number of IDs generated reaches the max possible, there are possibly >> > n-1 lookups before a valid ID is found. >> > >> > The current code recreates the Random each time which stops nextInt() >> > from knowing history (ie. new random seed on each generation). Java >> > implements a linear congruential generator to generate random numbers >> > which requires knowledge of the previously generated ID to be >> > effective. Math.random() creates the random seed once and reuses it >> > for the life of the JVM process which is a better approach. >> > >> > I would suggest creating an LCG in the TinyURL service so that the >> > proper range can be easily managed (better performance due to a >> > narrower scope than Math.random()) and the seed can be restored after >> > a restart. I have created a prototype in Python based on the rules >> > laid out in the Wikipedia article on LCG's [1] and using a >> > sufficiently large prime number. This still carries a concern of >> > generating IDs across a cluster but is relatively easy to address. >> > >> > WDYT? >> > >> > [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >> > _______________________________________________ >> > sakai-dev mailing list >> > sakai-dev@... >> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> > >> > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> > with a subject of "unsubscribe" >> >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> with a subject of "unsubscribe" >> > > > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" > sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueIs there a reason why the id generation could be more like tinyURL and
simply be sequential and thus very short (at least initially and probably for most institutions it would stay small). Just a thought. -AZ On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> wrote: > You could consider using 22 character uuids as autogenerated tinies in > the url. It might not be able to be called tinyurl anymore, but it > would at least be short, linkable and (much more) collision free. > > http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string > > I'm not convinced it would be necessary to keep these id's from being > unpredictable. The current security model should prevent anyone from > accessing content they don't have permission to even guessing id's. I > think eventually when a UI was designed for this it might even want to > allow the user to specify a custom ID, similar to how the Site Aliases > (on new site creation) works, allowing a user to specify anything they > want for their site alias. (Like tinyurl.com) So this would almost > guarantee collisions. > > So I wouldn't think that the possibility of a collision on an low use > feature for an indexed column (when collisions will probably be > featured in eventually anyway) likely won't be a big deal. > > -Matthew > > On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >> Please take a look at the comments at the head of the "EighteenIDGenerator" >> file for some details of the maths involved in computing how many >> documents will arise before a collision: >> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >> >> I wouldn't really recommend using a key space smaller than this one >> if you want to avoid collisions - although of course you will reprobe >> to be sure. Another issue with LCGs is that they are highly predictable - >> this is why I use a SecureRandom here: One of the requirements of the >> TinyURLService that I made for the work at CARET was that the ID >> sequence could not be predicted - to prevent the possibility of malicious >> "hijacking" of upcoming URL slots (this was actually a system designed >> to allocate small URLs that were plaintext signup URLs routed to a PHP >> system, perhaps more risk than is present in a typical application!) >> >> Note that a further issue you have with an LCG is running it in a cluster - >> even if you can lock on the particular instance you have, you can't >> guarantee that some other instance in the cluster has not stepped on >> past the slot you want. I think that the approach in my service >> (a SecureRandom plus a transactional check into persistence) is most >> likely to work reasonably well in a cluster - although it hasn't been >> tested there :P >> >> Cheers, >> Antranig. >> >> >> Quoting Steve Swinsburg <s.swinsburg@...>: >> >>> Carl, >>> >>> Sounds good, is there an existing LCG Java implementation we can >>> leverage or should we roll our own? >>> >>> cheers, >>> Steve >>> >>> >>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>> >>> > Summary: There is an ever increasing probability that the collision of >>> > IDs in the TinyURL service will slow down the service considerably. >>> > >>> > In looking at the implementation of the TinyURL service, I've noticed >>> > something about the way URLs are generated. If I'm headed in the >>> > wrong direction with this, please correct me. >>> > >>> > The very general approach is: >>> > 1) generate a random ID >>> > 2) check that the ID doesn't already exist >>> > 3) if exists, go to 1. if not, record the ID and release it to the >>> > caller. >>> > >>> > The following code for generating a URL: >>> > >>> > Random generator = new Random(); >>> > int seed = Math.abs(generator.nextInt()); >>> > >>> > has an issue with collision which is addressed by checking for the >>> > existence of that ID before handing it out but the algorithm run time >>> > will continue to grow because of the growing chance of collision over >>> > time. >>> > >>> > Explanation of concern: >>> > The first generated ID will have a 0/n chance of collision where n is >>> > the total number of possible IDs. The next ID has a 1/n chance, the >>> > next 2/n, etc up to the last having a chance of collision being n-1/n. >>> > Considering n needs to be sufficiently large to have a low to zero >>> > rate of duplication over time, this check will take more time because >>> > of the number of collisions over time. Assuming that the ID column in >>> > the database is indexed, the lookup time is minimized but there's >>> > still a call to the database lookup on each generated ID. As the >>> > number of IDs generated reaches the max possible, there are possibly >>> > n-1 lookups before a valid ID is found. >>> > >>> > The current code recreates the Random each time which stops nextInt() >>> > from knowing history (ie. new random seed on each generation). Java >>> > implements a linear congruential generator to generate random numbers >>> > which requires knowledge of the previously generated ID to be >>> > effective. Math.random() creates the random seed once and reuses it >>> > for the life of the JVM process which is a better approach. >>> > >>> > I would suggest creating an LCG in the TinyURL service so that the >>> > proper range can be easily managed (better performance due to a >>> > narrower scope than Math.random()) and the seed can be restored after >>> > a restart. I have created a prototype in Python based on the rules >>> > laid out in the Wikipedia article on LCG's [1] and using a >>> > sufficiently large prime number. This still carries a concern of >>> > generating IDs across a cluster but is relatively easy to address. >>> > >>> > WDYT? >>> > >>> > [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>> > _______________________________________________ >>> > sakai-dev mailing list >>> > sakai-dev@... >>> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> > >>> > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>> > with a subject of "unsubscribe" >>> >>> _______________________________________________ >>> sakai-dev mailing list >>> sakai-dev@... >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> >>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>> with a subject of "unsubscribe" >>> >> >> >> >> >> ---------------------------------------------------------------- >> This message was sent using IMP, the Internet Messaging Program. >> >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" >> > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" > -- Aaron Zeckoski (azeckoski (at) vt.edu) Senior Research Engineer - CARET - University of Cambridge https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueThe security is handled in the container anyway so you get a shortened
URL and it resolves into the full URL, redirects and Sakai takes over. Even if it was sequential it wouldn't be a security concern. With a keyspace of lowercase letters and numbers(36 chars) there are about 2 billion combinations, with uppercase as well there's 50 billion. Info here on the theoretical limits: http://bit.ly/3QG2uj Using RandomStringUtils from commons-lang, I wrote a short method to iterate over a million alphanumeric combinations using letters and numbers of length 6 chars and didn't run into a single collision. Using this method and a transactional check when adding into the database is probably the way to go. We can also filter for bad words, BBC has an interesting list (thanks Seth! :) WDYT? Steve On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: > Is there a reason why the id generation could be more like tinyURL and > simply be sequential and thus very short (at least initially and > probably for most institutions it would stay small). > > Just a thought. > -AZ > > > On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> > wrote: >> You could consider using 22 character uuids as autogenerated tinies >> in >> the url. It might not be able to be called tinyurl anymore, but it >> would at least be short, linkable and (much more) collision free. >> >> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >> >> I'm not convinced it would be necessary to keep these id's from being >> unpredictable. The current security model should prevent anyone from >> accessing content they don't have permission to even guessing id's. I >> think eventually when a UI was designed for this it might even want >> to >> allow the user to specify a custom ID, similar to how the Site >> Aliases >> (on new site creation) works, allowing a user to specify anything >> they >> want for their site alias. (Like tinyurl.com) So this would almost >> guarantee collisions. >> >> So I wouldn't think that the possibility of a collision on an low use >> feature for an indexed column (when collisions will probably be >> featured in eventually anyway) likely won't be a big deal. >> >> -Matthew >> >> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >>> Please take a look at the comments at the head of the >>> "EighteenIDGenerator" >>> file for some details of the maths involved in computing how many >>> documents will arise before a collision: >>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >>> >>> I wouldn't really recommend using a key space smaller than this one >>> if you want to avoid collisions - although of course you will >>> reprobe >>> to be sure. Another issue with LCGs is that they are highly >>> predictable - >>> this is why I use a SecureRandom here: One of the requirements of >>> the >>> TinyURLService that I made for the work at CARET was that the ID >>> sequence could not be predicted - to prevent the possibility of >>> malicious >>> "hijacking" of upcoming URL slots (this was actually a system >>> designed >>> to allocate small URLs that were plaintext signup URLs routed to a >>> PHP >>> system, perhaps more risk than is present in a typical application!) >>> >>> Note that a further issue you have with an LCG is running it in a >>> cluster - >>> even if you can lock on the particular instance you have, you can't >>> guarantee that some other instance in the cluster has not stepped on >>> past the slot you want. I think that the approach in my service >>> (a SecureRandom plus a transactional check into persistence) is most >>> likely to work reasonably well in a cluster - although it hasn't >>> been >>> tested there :P >>> >>> Cheers, >>> Antranig. >>> >>> >>> Quoting Steve Swinsburg <s.swinsburg@...>: >>> >>>> Carl, >>>> >>>> Sounds good, is there an existing LCG Java implementation we can >>>> leverage or should we roll our own? >>>> >>>> cheers, >>>> Steve >>>> >>>> >>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>>> >>>>> Summary: There is an ever increasing probability that the >>>>> collision of >>>>> IDs in the TinyURL service will slow down the service >>>>> considerably. >>>>> >>>>> In looking at the implementation of the TinyURL service, I've >>>>> noticed >>>>> something about the way URLs are generated. If I'm headed in the >>>>> wrong direction with this, please correct me. >>>>> >>>>> The very general approach is: >>>>> 1) generate a random ID >>>>> 2) check that the ID doesn't already exist >>>>> 3) if exists, go to 1. if not, record the ID and release it to >>>>> the >>>>> caller. >>>>> >>>>> The following code for generating a URL: >>>>> >>>>> Random generator = new Random(); >>>>> int seed = Math.abs(generator.nextInt()); >>>>> >>>>> has an issue with collision which is addressed by checking for the >>>>> existence of that ID before handing it out but the algorithm run >>>>> time >>>>> will continue to grow because of the growing chance of collision >>>>> over >>>>> time. >>>>> >>>>> Explanation of concern: >>>>> The first generated ID will have a 0/n chance of collision where >>>>> n is >>>>> the total number of possible IDs. The next ID has a 1/n chance, >>>>> the >>>>> next 2/n, etc up to the last having a chance of collision being >>>>> n-1/n. >>>>> Considering n needs to be sufficiently large to have a low to zero >>>>> rate of duplication over time, this check will take more time >>>>> because >>>>> of the number of collisions over time. Assuming that the ID >>>>> column in >>>>> the database is indexed, the lookup time is minimized but there's >>>>> still a call to the database lookup on each generated ID. As the >>>>> number of IDs generated reaches the max possible, there are >>>>> possibly >>>>> n-1 lookups before a valid ID is found. >>>>> >>>>> The current code recreates the Random each time which stops >>>>> nextInt() >>>>> from knowing history (ie. new random seed on each generation). >>>>> Java >>>>> implements a linear congruential generator to generate random >>>>> numbers >>>>> which requires knowledge of the previously generated ID to be >>>>> effective. Math.random() creates the random seed once and >>>>> reuses it >>>>> for the life of the JVM process which is a better approach. >>>>> >>>>> I would suggest creating an LCG in the TinyURL service so that the >>>>> proper range can be easily managed (better performance due to a >>>>> narrower scope than Math.random()) and the seed can be restored >>>>> after >>>>> a restart. I have created a prototype in Python based on the >>>>> rules >>>>> laid out in the Wikipedia article on LCG's [1] and using a >>>>> sufficiently large prime number. This still carries a concern of >>>>> generating IDs across a cluster but is relatively easy to address. >>>>> >>>>> WDYT? >>>>> >>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>>>> _______________________________________________ >>>>> sakai-dev mailing list >>>>> sakai-dev@... >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>> >>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>>> with a subject of "unsubscribe" >>>> >>>> _______________________________________________ >>>> sakai-dev mailing list >>>> sakai-dev@... >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>> >>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>> with a subject of "unsubscribe" >>>> >>> >>> >>> >>> >>> ---------------------------------------------------------------- >>> This message was sent using IMP, the Internet Messaging Program. >>> >>> _______________________________________________ >>> sakai-dev mailing list >>> sakai-dev@... >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> >>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>> with a subject of "unsubscribe" >>> >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> with a subject of "unsubscribe" >> > > > > -- > Aaron Zeckoski (azeckoski (at) vt.edu) > Senior Research Engineer - CARET - University of Cambridge > https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski > http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > with a subject of "unsubscribe" _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl Issuei love the bad words filter idea, once inserted a one-line filter in a
perl script (for displaying student comments on course surveys) to replace offending constructions with asterisks. easily the funniest line of code i ever constructed, especially with the regular expressions :) jim Steve Swinsburg wrote: > The security is handled in the container anyway so you get a shortened > URL and it resolves into the full URL, redirects and Sakai takes over. > Even if it was sequential it wouldn't be a security concern. > > With a keyspace of lowercase letters and numbers(36 chars) there are > about 2 billion combinations, with uppercase as well there's 50 > billion. Info here on the theoretical limits: > http://bit.ly/3QG2uj > > Using RandomStringUtils from commons-lang, I wrote a short method to > iterate over a million alphanumeric combinations using letters and > numbers of length 6 chars and didn't run into a single collision. > Using this method and a transactional check when adding into the > database is probably the way to go. > > We can also filter for bad words, BBC has an interesting list (thanks > Seth! :) > > WDYT? > > Steve > > > On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: > > >> Is there a reason why the id generation could be more like tinyURL and >> simply be sequential and thus very short (at least initially and >> probably for most institutions it would stay small). >> >> Just a thought. >> -AZ >> >> >> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> >> wrote: >> >>> You could consider using 22 character uuids as autogenerated tinies >>> in >>> the url. It might not be able to be called tinyurl anymore, but it >>> would at least be short, linkable and (much more) collision free. >>> >>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >>> >>> I'm not convinced it would be necessary to keep these id's from being >>> unpredictable. The current security model should prevent anyone from >>> accessing content they don't have permission to even guessing id's. I >>> think eventually when a UI was designed for this it might even want >>> to >>> allow the user to specify a custom ID, similar to how the Site >>> Aliases >>> (on new site creation) works, allowing a user to specify anything >>> they >>> want for their site alias. (Like tinyurl.com) So this would almost >>> guarantee collisions. >>> >>> So I wouldn't think that the possibility of a collision on an low use >>> feature for an indexed column (when collisions will probably be >>> featured in eventually anyway) likely won't be a big deal. >>> >>> -Matthew >>> >>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >>> >>>> Please take a look at the comments at the head of the >>>> "EighteenIDGenerator" >>>> file for some details of the maths involved in computing how many >>>> documents will arise before a collision: >>>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >>>> >>>> I wouldn't really recommend using a key space smaller than this one >>>> if you want to avoid collisions - although of course you will >>>> reprobe >>>> to be sure. Another issue with LCGs is that they are highly >>>> predictable - >>>> this is why I use a SecureRandom here: One of the requirements of >>>> the >>>> TinyURLService that I made for the work at CARET was that the ID >>>> sequence could not be predicted - to prevent the possibility of >>>> malicious >>>> "hijacking" of upcoming URL slots (this was actually a system >>>> designed >>>> to allocate small URLs that were plaintext signup URLs routed to a >>>> PHP >>>> system, perhaps more risk than is present in a typical application!) >>>> >>>> Note that a further issue you have with an LCG is running it in a >>>> cluster - >>>> even if you can lock on the particular instance you have, you can't >>>> guarantee that some other instance in the cluster has not stepped on >>>> past the slot you want. I think that the approach in my service >>>> (a SecureRandom plus a transactional check into persistence) is most >>>> likely to work reasonably well in a cluster - although it hasn't >>>> been >>>> tested there :P >>>> >>>> Cheers, >>>> Antranig. >>>> >>>> >>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>> >>>> >>>>> Carl, >>>>> >>>>> Sounds good, is there an existing LCG Java implementation we can >>>>> leverage or should we roll our own? >>>>> >>>>> cheers, >>>>> Steve >>>>> >>>>> >>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>>>> >>>>> >>>>>> Summary: There is an ever increasing probability that the >>>>>> collision of >>>>>> IDs in the TinyURL service will slow down the service >>>>>> considerably. >>>>>> >>>>>> In looking at the implementation of the TinyURL service, I've >>>>>> noticed >>>>>> something about the way URLs are generated. If I'm headed in the >>>>>> wrong direction with this, please correct me. >>>>>> >>>>>> The very general approach is: >>>>>> 1) generate a random ID >>>>>> 2) check that the ID doesn't already exist >>>>>> 3) if exists, go to 1. if not, record the ID and release it to >>>>>> the >>>>>> caller. >>>>>> >>>>>> The following code for generating a URL: >>>>>> >>>>>> Random generator = new Random(); >>>>>> int seed = Math.abs(generator.nextInt()); >>>>>> >>>>>> has an issue with collision which is addressed by checking for the >>>>>> existence of that ID before handing it out but the algorithm run >>>>>> time >>>>>> will continue to grow because of the growing chance of collision >>>>>> over >>>>>> time. >>>>>> >>>>>> Explanation of concern: >>>>>> The first generated ID will have a 0/n chance of collision where >>>>>> n is >>>>>> the total number of possible IDs. The next ID has a 1/n chance, >>>>>> the >>>>>> next 2/n, etc up to the last having a chance of collision being >>>>>> n-1/n. >>>>>> Considering n needs to be sufficiently large to have a low to zero >>>>>> rate of duplication over time, this check will take more time >>>>>> because >>>>>> of the number of collisions over time. Assuming that the ID >>>>>> column in >>>>>> the database is indexed, the lookup time is minimized but there's >>>>>> still a call to the database lookup on each generated ID. As the >>>>>> number of IDs generated reaches the max possible, there are >>>>>> possibly >>>>>> n-1 lookups before a valid ID is found. >>>>>> >>>>>> The current code recreates the Random each time which stops >>>>>> nextInt() >>>>>> from knowing history (ie. new random seed on each generation). >>>>>> Java >>>>>> implements a linear congruential generator to generate random >>>>>> numbers >>>>>> which requires knowledge of the previously generated ID to be >>>>>> effective. Math.random() creates the random seed once and >>>>>> reuses it >>>>>> for the life of the JVM process which is a better approach. >>>>>> >>>>>> I would suggest creating an LCG in the TinyURL service so that the >>>>>> proper range can be easily managed (better performance due to a >>>>>> narrower scope than Math.random()) and the seed can be restored >>>>>> after >>>>>> a restart. I have created a prototype in Python based on the >>>>>> rules >>>>>> laid out in the Wikipedia article on LCG's [1] and using a >>>>>> sufficiently large prime number. This still carries a concern of >>>>>> generating IDs across a cluster but is relatively easy to address. >>>>>> >>>>>> WDYT? >>>>>> >>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>>>>> _______________________________________________ >>>>>> sakai-dev mailing list >>>>>> sakai-dev@... >>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>> >>>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>>>> with a subject of "unsubscribe" >>>>>> >>>>> _______________________________________________ >>>>> sakai-dev mailing list >>>>> sakai-dev@... >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>> >>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>>> with a subject of "unsubscribe" >>>>> >>>>> >>>> >>>> >>>> ---------------------------------------------------------------- >>>> This message was sent using IMP, the Internet Messaging Program. >>>> >>>> _______________________________________________ >>>> sakai-dev mailing list >>>> sakai-dev@... >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>> >>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>> with a subject of "unsubscribe" >>>> >>>> >>> _______________________________________________ >>> sakai-dev mailing list >>> sakai-dev@... >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> >>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>> with a subject of "unsubscribe" >>> >>> >> >> -- >> Aaron Zeckoski (azeckoski (at) vt.edu) >> Senior Research Engineer - CARET - University of Cambridge >> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski >> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> with a subject of "unsubscribe" >> > > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" > _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueThe issue with collisions occurs when you are hosting URLs which are
not protected by the container, but held by some external resource - for example, a signup system "click here to validate your email to Xxxxx service". The implementation I made actually runs in two modes, you may request either "SECURE" or "SHORT" URLs by an option to the API. The short URLs start at only a couple of characters and are indeed sequential. Anyway, feel free to implement your own system, it is the Sakai way after all :) Cheers, Antranig. Quoting Steve Swinsburg <s.swinsburg@...>: > The security is handled in the container anyway so you get a shortened > URL and it resolves into the full URL, redirects and Sakai takes over. > Even if it was sequential it wouldn't be a security concern. > > With a keyspace of lowercase letters and numbers(36 chars) there are > about 2 billion combinations, with uppercase as well there's 50 > billion. Info here on the theoretical limits: > http://bit.ly/3QG2uj > > Using RandomStringUtils from commons-lang, I wrote a short method to > iterate over a million alphanumeric combinations using letters and > numbers of length 6 chars and didn't run into a single collision. > Using this method and a transactional check when adding into the > database is probably the way to go. > > We can also filter for bad words, BBC has an interesting list (thanks > Seth! :) > > WDYT? > > Steve > > > On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: > > > Is there a reason why the id generation could be more like tinyURL and > > simply be sequential and thus very short (at least initially and > > probably for most institutions it would stay small). > > > > Just a thought. > > -AZ > > > > > > On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> > > wrote: > >> You could consider using 22 character uuids as autogenerated tinies > >> in > >> the url. It might not be able to be called tinyurl anymore, but it > >> would at least be short, linkable and (much more) collision free. > >> > >> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string > >> > >> I'm not convinced it would be necessary to keep these id's from being > >> unpredictable. The current security model should prevent anyone from > >> accessing content they don't have permission to even guessing id's. I > >> think eventually when a UI was designed for this it might even want > >> to > >> allow the user to specify a custom ID, similar to how the Site > >> Aliases > >> (on new site creation) works, allowing a user to specify anything > >> they > >> want for their site alias. (Like tinyurl.com) So this would almost > >> guarantee collisions. > >> > >> So I wouldn't think that the possibility of a collision on an low use > >> feature for an indexed column (when collisions will probably be > >> featured in eventually anyway) likely won't be a big deal. > >> > >> -Matthew > >> > >> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: > >>> Please take a look at the comments at the head of the > >>> "EighteenIDGenerator" > >>> file for some details of the maths involved in computing how many > >>> documents will arise before a collision: > >>> > > >>> > >>> I wouldn't really recommend using a key space smaller than this one > >>> if you want to avoid collisions - although of course you will > >>> reprobe > >>> to be sure. Another issue with LCGs is that they are highly > >>> predictable - > >>> this is why I use a SecureRandom here: One of the requirements of > >>> the > >>> TinyURLService that I made for the work at CARET was that the ID > >>> sequence could not be predicted - to prevent the possibility of > >>> malicious > >>> "hijacking" of upcoming URL slots (this was actually a system > >>> designed > >>> to allocate small URLs that were plaintext signup URLs routed to a > >>> PHP > >>> system, perhaps more risk than is present in a typical application!) > >>> > >>> Note that a further issue you have with an LCG is running it in a > >>> cluster - > >>> even if you can lock on the particular instance you have, you can't > >>> guarantee that some other instance in the cluster has not stepped on > >>> past the slot you want. I think that the approach in my service > >>> (a SecureRandom plus a transactional check into persistence) is most > >>> likely to work reasonably well in a cluster - although it hasn't > >>> been > >>> tested there :P > >>> > >>> Cheers, > >>> Antranig. > >>> > >>> > >>> Quoting Steve Swinsburg <s.swinsburg@...>: > >>> > >>>> Carl, > >>>> > >>>> Sounds good, is there an existing LCG Java implementation we can > >>>> leverage or should we roll our own? > >>>> > >>>> cheers, > >>>> Steve > >>>> > >>>> > >>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: > >>>> > >>>>> Summary: There is an ever increasing probability that the > >>>>> collision of > >>>>> IDs in the TinyURL service will slow down the service > >>>>> considerably. > >>>>> > >>>>> In looking at the implementation of the TinyURL service, I've > >>>>> noticed > >>>>> something about the way URLs are generated. If I'm headed in the > >>>>> wrong direction with this, please correct me. > >>>>> > >>>>> The very general approach is: > >>>>> 1) generate a random ID > >>>>> 2) check that the ID doesn't already exist > >>>>> 3) if exists, go to 1. if not, record the ID and release it to > >>>>> the > >>>>> caller. > >>>>> > >>>>> The following code for generating a URL: > >>>>> > >>>>> Random generator = new Random(); > >>>>> int seed = Math.abs(generator.nextInt()); > >>>>> > >>>>> has an issue with collision which is addressed by checking for the > >>>>> existence of that ID before handing it out but the algorithm run > >>>>> time > >>>>> will continue to grow because of the growing chance of collision > >>>>> over > >>>>> time. > >>>>> > >>>>> Explanation of concern: > >>>>> The first generated ID will have a 0/n chance of collision where > >>>>> n is > >>>>> the total number of possible IDs. The next ID has a 1/n chance, > >>>>> the > >>>>> next 2/n, etc up to the last having a chance of collision being > >>>>> n-1/n. > >>>>> Considering n needs to be sufficiently large to have a low to zero > >>>>> rate of duplication over time, this check will take more time > >>>>> because > >>>>> of the number of collisions over time. Assuming that the ID > >>>>> column in > >>>>> the database is indexed, the lookup time is minimized but there's > >>>>> still a call to the database lookup on each generated ID. As the > >>>>> number of IDs generated reaches the max possible, there are > >>>>> possibly > >>>>> n-1 lookups before a valid ID is found. > >>>>> > >>>>> The current code recreates the Random each time which stops > >>>>> nextInt() > >>>>> from knowing history (ie. new random seed on each generation). > >>>>> Java > >>>>> implements a linear congruential generator to generate random > >>>>> numbers > >>>>> which requires knowledge of the previously generated ID to be > >>>>> effective. Math.random() creates the random seed once and > >>>>> reuses it > >>>>> for the life of the JVM process which is a better approach. > >>>>> > >>>>> I would suggest creating an LCG in the TinyURL service so that the > >>>>> proper range can be easily managed (better performance due to a > >>>>> narrower scope than Math.random()) and the seed can be restored > >>>>> after > >>>>> a restart. I have created a prototype in Python based on the > >>>>> rules > >>>>> laid out in the Wikipedia article on LCG's [1] and using a > >>>>> sufficiently large prime number. This still carries a concern of > >>>>> generating IDs across a cluster but is relatively easy to address. > >>>>> > >>>>> WDYT? > >>>>> > >>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator > >>>>> _______________________________________________ > >>>>> sakai-dev mailing list > >>>>> sakai-dev@... > >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > >>>>> > >>>>> TO UNSUBSCRIBE: send email to > sakai-dev-unsubscribe@... > >>>>> with a subject of "unsubscribe" > >>>> > >>>> _______________________________________________ > >>>> sakai-dev mailing list > >>>> sakai-dev@... > >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > >>>> > >>>> TO UNSUBSCRIBE: send email to > sakai-dev-unsubscribe@... > >>>> with a subject of "unsubscribe" > >>>> > >>> > >>> > >>> > >>> > >>> ---------------------------------------------------------------- > >>> This message was sent using IMP, the Internet Messaging Program. > >>> > >>> _______________________________________________ > >>> sakai-dev mailing list > >>> sakai-dev@... > >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > >>> > >>> TO UNSUBSCRIBE: send email to > sakai-dev-unsubscribe@... > >>> with a subject of "unsubscribe" > >>> > >> _______________________________________________ > >> sakai-dev mailing list > >> sakai-dev@... > >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > >> > >> TO UNSUBSCRIBE: send email to > sakai-dev-unsubscribe@... > >> with a subject of "unsubscribe" > >> > > > > > > > > -- > > Aaron Zeckoski (azeckoski (at) vt.edu) > > Senior Research Engineer - CARET - University of Cambridge > > https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski > > http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile > > _______________________________________________ > > sakai-dev mailing list > > sakai-dev@... > > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > > with a subject of "unsubscribe" > > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > with a subject of "unsubscribe" > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueYou realize, of course, that said bad word filter would need to be internationalized. Could be some of the more entertaining il8n work, though.
-Mike On Wed, Jul 8, 2009 at 4:51 PM, Jim Martino <jrm@...> wrote: i love the bad words filter idea, once inserted a one-line filter in a _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueI would say that the tinyurl service we are using at CARET is a good
starting point but that is just my opinion. It gives us 2 out of 3 things that we want and the 3rd (custom tiny url) is easy to add. -AZ On Wed, Jul 8, 2009 at 10:25 PM, <antranig@...> wrote: > The issue with collisions occurs when you are hosting URLs which are > not protected by the container, but held by some external resource - > for example, a signup system "click here to validate your email to > Xxxxx service". The implementation I made actually runs in two modes, > you may request either "SECURE" or "SHORT" URLs by an option to the > API. The short URLs start at only a couple of characters and > are indeed sequential. > Anyway, feel free to implement your own system, it is the Sakai way > after all :) > > Cheers, > Antranig. > > Quoting Steve Swinsburg <s.swinsburg@...>: > >> The security is handled in the container anyway so you get a shortened >> URL and it resolves into the full URL, redirects and Sakai takes over. >> Even if it was sequential it wouldn't be a security concern. >> >> With a keyspace of lowercase letters and numbers(36 chars) there are >> about 2 billion combinations, with uppercase as well there's 50 >> billion. Info here on the theoretical limits: >> http://bit.ly/3QG2uj >> >> Using RandomStringUtils from commons-lang, I wrote a short method to >> iterate over a million alphanumeric combinations using letters and >> numbers of length 6 chars and didn't run into a single collision. >> Using this method and a transactional check when adding into the >> database is probably the way to go. >> >> We can also filter for bad words, BBC has an interesting list (thanks >> Seth! :) >> >> WDYT? >> >> Steve >> >> >> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: >> >> > Is there a reason why the id generation could be more like tinyURL and >> > simply be sequential and thus very short (at least initially and >> > probably for most institutions it would stay small). >> > >> > Just a thought. >> > -AZ >> > >> > >> > On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> >> > wrote: >> >> You could consider using 22 character uuids as autogenerated tinies >> >> in >> >> the url. It might not be able to be called tinyurl anymore, but it >> >> would at least be short, linkable and (much more) collision free. >> >> >> >> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >> >> >> >> I'm not convinced it would be necessary to keep these id's from being >> >> unpredictable. The current security model should prevent anyone from >> >> accessing content they don't have permission to even guessing id's. I >> >> think eventually when a UI was designed for this it might even want >> >> to >> >> allow the user to specify a custom ID, similar to how the Site >> >> Aliases >> >> (on new site creation) works, allowing a user to specify anything >> >> they >> >> want for their site alias. (Like tinyurl.com) So this would almost >> >> guarantee collisions. >> >> >> >> So I wouldn't think that the possibility of a collision on an low use >> >> feature for an indexed column (when collisions will probably be >> >> featured in eventually anyway) likely won't be a big deal. >> >> >> >> -Matthew >> >> >> >> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >> >>> Please take a look at the comments at the head of the >> >>> "EighteenIDGenerator" >> >>> file for some details of the maths involved in computing how many >> >>> documents will arise before a collision: >> >>> >> > https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >> >>> >> >>> I wouldn't really recommend using a key space smaller than this one >> >>> if you want to avoid collisions - although of course you will >> >>> reprobe >> >>> to be sure. Another issue with LCGs is that they are highly >> >>> predictable - >> >>> this is why I use a SecureRandom here: One of the requirements of >> >>> the >> >>> TinyURLService that I made for the work at CARET was that the ID >> >>> sequence could not be predicted - to prevent the possibility of >> >>> malicious >> >>> "hijacking" of upcoming URL slots (this was actually a system >> >>> designed >> >>> to allocate small URLs that were plaintext signup URLs routed to a >> >>> PHP >> >>> system, perhaps more risk than is present in a typical application!) >> >>> >> >>> Note that a further issue you have with an LCG is running it in a >> >>> cluster - >> >>> even if you can lock on the particular instance you have, you can't >> >>> guarantee that some other instance in the cluster has not stepped on >> >>> past the slot you want. I think that the approach in my service >> >>> (a SecureRandom plus a transactional check into persistence) is most >> >>> likely to work reasonably well in a cluster - although it hasn't >> >>> been >> >>> tested there :P >> >>> >> >>> Cheers, >> >>> Antranig. >> >>> >> >>> >> >>> Quoting Steve Swinsburg <s.swinsburg@...>: >> >>> >> >>>> Carl, >> >>>> >> >>>> Sounds good, is there an existing LCG Java implementation we can >> >>>> leverage or should we roll our own? >> >>>> >> >>>> cheers, >> >>>> Steve >> >>>> >> >>>> >> >>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >> >>>> >> >>>>> Summary: There is an ever increasing probability that the >> >>>>> collision of >> >>>>> IDs in the TinyURL service will slow down the service >> >>>>> considerably. >> >>>>> >> >>>>> In looking at the implementation of the TinyURL service, I've >> >>>>> noticed >> >>>>> something about the way URLs are generated. If I'm headed in the >> >>>>> wrong direction with this, please correct me. >> >>>>> >> >>>>> The very general approach is: >> >>>>> 1) generate a random ID >> >>>>> 2) check that the ID doesn't already exist >> >>>>> 3) if exists, go to 1. if not, record the ID and release it to >> >>>>> the >> >>>>> caller. >> >>>>> >> >>>>> The following code for generating a URL: >> >>>>> >> >>>>> Random generator = new Random(); >> >>>>> int seed = Math.abs(generator.nextInt()); >> >>>>> >> >>>>> has an issue with collision which is addressed by checking for the >> >>>>> existence of that ID before handing it out but the algorithm run >> >>>>> time >> >>>>> will continue to grow because of the growing chance of collision >> >>>>> over >> >>>>> time. >> >>>>> >> >>>>> Explanation of concern: >> >>>>> The first generated ID will have a 0/n chance of collision where >> >>>>> n is >> >>>>> the total number of possible IDs. The next ID has a 1/n chance, >> >>>>> the >> >>>>> next 2/n, etc up to the last having a chance of collision being >> >>>>> n-1/n. >> >>>>> Considering n needs to be sufficiently large to have a low to zero >> >>>>> rate of duplication over time, this check will take more time >> >>>>> because >> >>>>> of the number of collisions over time. Assuming that the ID >> >>>>> column in >> >>>>> the database is indexed, the lookup time is minimized but there's >> >>>>> still a call to the database lookup on each generated ID. As the >> >>>>> number of IDs generated reaches the max possible, there are >> >>>>> possibly >> >>>>> n-1 lookups before a valid ID is found. >> >>>>> >> >>>>> The current code recreates the Random each time which stops >> >>>>> nextInt() >> >>>>> from knowing history (ie. new random seed on each generation). >> >>>>> Java >> >>>>> implements a linear congruential generator to generate random >> >>>>> numbers >> >>>>> which requires knowledge of the previously generated ID to be >> >>>>> effective. Math.random() creates the random seed once and >> >>>>> reuses it >> >>>>> for the life of the JVM process which is a better approach. >> >>>>> >> >>>>> I would suggest creating an LCG in the TinyURL service so that the >> >>>>> proper range can be easily managed (better performance due to a >> >>>>> narrower scope than Math.random()) and the seed can be restored >> >>>>> after >> >>>>> a restart. I have created a prototype in Python based on the >> >>>>> rules >> >>>>> laid out in the Wikipedia article on LCG's [1] and using a >> >>>>> sufficiently large prime number. This still carries a concern of >> >>>>> generating IDs across a cluster but is relatively easy to address. >> >>>>> >> >>>>> WDYT? >> >>>>> >> >>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >> >>>>> _______________________________________________ >> >>>>> sakai-dev mailing list >> >>>>> sakai-dev@... >> >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >>>>> >> >>>>> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >> >>>>> with a subject of "unsubscribe" >> >>>> >> >>>> _______________________________________________ >> >>>> sakai-dev mailing list >> >>>> sakai-dev@... >> >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >>>> >> >>>> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >> >>>> with a subject of "unsubscribe" >> >>>> >> >>> >> >>> >> >>> >> >>> >> >>> ---------------------------------------------------------------- >> >>> This message was sent using IMP, the Internet Messaging Program. >> >>> >> >>> _______________________________________________ >> >>> sakai-dev mailing list >> >>> sakai-dev@... >> >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >>> >> >>> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >> >>> with a subject of "unsubscribe" >> >>> >> >> _______________________________________________ >> >> sakai-dev mailing list >> >> sakai-dev@... >> >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> >> >> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >> >> with a subject of "unsubscribe" >> >> >> > >> > >> > >> > -- >> > Aaron Zeckoski (azeckoski (at) vt.edu) >> > Senior Research Engineer - CARET - University of Cambridge >> > https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski >> > http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile >> > _______________________________________________ >> > sakai-dev mailing list >> > sakai-dev@... >> > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> > >> > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> > with a subject of "unsubscribe" >> >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> with a subject of "unsubscribe" >> > > > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > > -- Aaron Zeckoski (azeckoski (at) vt.edu) Senior Research Engineer - CARET - University of Cambridge https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueOk I've added an alternative method to return a longer random id which
makes it less guessable. The number of permutations with this combo has 30 zeroes (on a-z and 0-9). Have some fun with the calculator! http://bit.ly/gJP1X cheers, Steve On 8 Jul 2009, at 17:25, antranig@... wrote: > The issue with collisions occurs when you are hosting URLs which are > not protected by the container, but held by some external resource - > for example, a signup system "click here to validate your email to > Xxxxx service". The implementation I made actually runs in two modes, > you may request either "SECURE" or "SHORT" URLs by an option to the > API. The short URLs start at only a couple of characters and > are indeed sequential. > Anyway, feel free to implement your own system, it is the Sakai way > after all :) > > Cheers, > Antranig. > > Quoting Steve Swinsburg <s.swinsburg@...>: > >> The security is handled in the container anyway so you get a >> shortened >> URL and it resolves into the full URL, redirects and Sakai takes >> over. >> Even if it was sequential it wouldn't be a security concern. >> >> With a keyspace of lowercase letters and numbers(36 chars) there are >> about 2 billion combinations, with uppercase as well there's 50 >> billion. Info here on the theoretical limits: >> http://bit.ly/3QG2uj >> >> Using RandomStringUtils from commons-lang, I wrote a short method to >> iterate over a million alphanumeric combinations using letters and >> numbers of length 6 chars and didn't run into a single collision. >> Using this method and a transactional check when adding into the >> database is probably the way to go. >> >> We can also filter for bad words, BBC has an interesting list (thanks >> Seth! :) >> >> WDYT? >> >> Steve >> >> >> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: >> >>> Is there a reason why the id generation could be more like tinyURL >>> and >>> simply be sequential and thus very short (at least initially and >>> probably for most institutions it would stay small). >>> >>> Just a thought. >>> -AZ >>> >>> >>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> >>> wrote: >>>> You could consider using 22 character uuids as autogenerated tinies >>>> in >>>> the url. It might not be able to be called tinyurl anymore, but it >>>> would at least be short, linkable and (much more) collision free. >>>> >>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >>>> >>>> I'm not convinced it would be necessary to keep these id's from >>>> being >>>> unpredictable. The current security model should prevent anyone >>>> from >>>> accessing content they don't have permission to even guessing >>>> id's. I >>>> think eventually when a UI was designed for this it might even want >>>> to >>>> allow the user to specify a custom ID, similar to how the Site >>>> Aliases >>>> (on new site creation) works, allowing a user to specify anything >>>> they >>>> want for their site alias. (Like tinyurl.com) So this would almost >>>> guarantee collisions. >>>> >>>> So I wouldn't think that the possibility of a collision on an low >>>> use >>>> feature for an indexed column (when collisions will probably be >>>> featured in eventually anyway) likely won't be a big deal. >>>> >>>> -Matthew >>>> >>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >>>>> Please take a look at the comments at the head of the >>>>> "EighteenIDGenerator" >>>>> file for some details of the maths involved in computing how many >>>>> documents will arise before a collision: >>>>> >> > https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >>>>> >>>>> I wouldn't really recommend using a key space smaller than this >>>>> one >>>>> if you want to avoid collisions - although of course you will >>>>> reprobe >>>>> to be sure. Another issue with LCGs is that they are highly >>>>> predictable - >>>>> this is why I use a SecureRandom here: One of the requirements of >>>>> the >>>>> TinyURLService that I made for the work at CARET was that the ID >>>>> sequence could not be predicted - to prevent the possibility of >>>>> malicious >>>>> "hijacking" of upcoming URL slots (this was actually a system >>>>> designed >>>>> to allocate small URLs that were plaintext signup URLs routed to a >>>>> PHP >>>>> system, perhaps more risk than is present in a typical >>>>> application!) >>>>> >>>>> Note that a further issue you have with an LCG is running it in a >>>>> cluster - >>>>> even if you can lock on the particular instance you have, you >>>>> can't >>>>> guarantee that some other instance in the cluster has not >>>>> stepped on >>>>> past the slot you want. I think that the approach in my service >>>>> (a SecureRandom plus a transactional check into persistence) is >>>>> most >>>>> likely to work reasonably well in a cluster - although it hasn't >>>>> been >>>>> tested there :P >>>>> >>>>> Cheers, >>>>> Antranig. >>>>> >>>>> >>>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>>> >>>>>> Carl, >>>>>> >>>>>> Sounds good, is there an existing LCG Java implementation we can >>>>>> leverage or should we roll our own? >>>>>> >>>>>> cheers, >>>>>> Steve >>>>>> >>>>>> >>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>>>>> >>>>>>> Summary: There is an ever increasing probability that the >>>>>>> collision of >>>>>>> IDs in the TinyURL service will slow down the service >>>>>>> considerably. >>>>>>> >>>>>>> In looking at the implementation of the TinyURL service, I've >>>>>>> noticed >>>>>>> something about the way URLs are generated. If I'm headed in >>>>>>> the >>>>>>> wrong direction with this, please correct me. >>>>>>> >>>>>>> The very general approach is: >>>>>>> 1) generate a random ID >>>>>>> 2) check that the ID doesn't already exist >>>>>>> 3) if exists, go to 1. if not, record the ID and release it to >>>>>>> the >>>>>>> caller. >>>>>>> >>>>>>> The following code for generating a URL: >>>>>>> >>>>>>> Random generator = new Random(); >>>>>>> int seed = Math.abs(generator.nextInt()); >>>>>>> >>>>>>> has an issue with collision which is addressed by checking for >>>>>>> the >>>>>>> existence of that ID before handing it out but the algorithm run >>>>>>> time >>>>>>> will continue to grow because of the growing chance of collision >>>>>>> over >>>>>>> time. >>>>>>> >>>>>>> Explanation of concern: >>>>>>> The first generated ID will have a 0/n chance of collision where >>>>>>> n is >>>>>>> the total number of possible IDs. The next ID has a 1/n chance, >>>>>>> the >>>>>>> next 2/n, etc up to the last having a chance of collision being >>>>>>> n-1/n. >>>>>>> Considering n needs to be sufficiently large to have a low to >>>>>>> zero >>>>>>> rate of duplication over time, this check will take more time >>>>>>> because >>>>>>> of the number of collisions over time. Assuming that the ID >>>>>>> column in >>>>>>> the database is indexed, the lookup time is minimized but >>>>>>> there's >>>>>>> still a call to the database lookup on each generated ID. As >>>>>>> the >>>>>>> number of IDs generated reaches the max possible, there are >>>>>>> possibly >>>>>>> n-1 lookups before a valid ID is found. >>>>>>> >>>>>>> The current code recreates the Random each time which stops >>>>>>> nextInt() >>>>>>> from knowing history (ie. new random seed on each generation). >>>>>>> Java >>>>>>> implements a linear congruential generator to generate random >>>>>>> numbers >>>>>>> which requires knowledge of the previously generated ID to be >>>>>>> effective. Math.random() creates the random seed once and >>>>>>> reuses it >>>>>>> for the life of the JVM process which is a better approach. >>>>>>> >>>>>>> I would suggest creating an LCG in the TinyURL service so that >>>>>>> the >>>>>>> proper range can be easily managed (better performance due to a >>>>>>> narrower scope than Math.random()) and the seed can be restored >>>>>>> after >>>>>>> a restart. I have created a prototype in Python based on the >>>>>>> rules >>>>>>> laid out in the Wikipedia article on LCG's [1] and using a >>>>>>> sufficiently large prime number. This still carries a concern >>>>>>> of >>>>>>> generating IDs across a cluster but is relatively easy to >>>>>>> address. >>>>>>> >>>>>>> WDYT? >>>>>>> >>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>>>>>> _______________________________________________ >>>>>>> sakai-dev mailing list >>>>>>> sakai-dev@... >>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>> >>>>>>> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >>>>>>> with a subject of "unsubscribe" >>>>>> >>>>>> _______________________________________________ >>>>>> sakai-dev mailing list >>>>>> sakai-dev@... >>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>> >>>>>> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >>>>>> with a subject of "unsubscribe" >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ---------------------------------------------------------------- >>>>> This message was sent using IMP, the Internet Messaging Program. >>>>> >>>>> _______________________________________________ >>>>> sakai-dev mailing list >>>>> sakai-dev@... >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>> >>>>> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >>>>> with a subject of "unsubscribe" >>>>> >>>> _______________________________________________ >>>> sakai-dev mailing list >>>> sakai-dev@... >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>> >>>> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... >>>> with a subject of "unsubscribe" >>>> >>> >>> >>> >>> -- >>> Aaron Zeckoski (azeckoski (at) vt.edu) >>> Senior Research Engineer - CARET - University of Cambridge >>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski >>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile >>> _______________________________________________ >>> sakai-dev mailing list >>> sakai-dev@... >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> >>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>> with a subject of "unsubscribe" >> >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> with a subject of "unsubscribe" >> > > > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueI have been a little curious why timeboxing is not used to decrease
collisions for this service.... we use the service in a custom guest access app we made to put registration urls etc in emails.... obviously we want to timebox access to the urls but a ttl can be xtremely effective in reducing an otherwise skewed data set to something resembling white noise... as it give a small window view of a larger distribution Maybe an api call could change the ttl and return the probability of collision during that time or some other silliness On Wed, Jul 8, 2009 at 2:52 PM, Steve Swinsburg<s.swinsburg@...> wrote: > Ok I've added an alternative method to return a longer random id which > makes it less guessable. The number of permutations with this combo > has 30 zeroes (on a-z and 0-9). > > Have some fun with the calculator! http://bit.ly/gJP1X > > cheers, > Steve > > > > On 8 Jul 2009, at 17:25, antranig@... wrote: > >> The issue with collisions occurs when you are hosting URLs which are >> not protected by the container, but held by some external resource - >> for example, a signup system "click here to validate your email to >> Xxxxx service". The implementation I made actually runs in two modes, >> you may request either "SECURE" or "SHORT" URLs by an option to the >> API. The short URLs start at only a couple of characters and >> are indeed sequential. >> Anyway, feel free to implement your own system, it is the Sakai way >> after all :) >> >> Cheers, >> Antranig. >> >> Quoting Steve Swinsburg <s.swinsburg@...>: >> >>> The security is handled in the container anyway so you get a >>> shortened >>> URL and it resolves into the full URL, redirects and Sakai takes >>> over. >>> Even if it was sequential it wouldn't be a security concern. >>> >>> With a keyspace of lowercase letters and numbers(36 chars) there are >>> about 2 billion combinations, with uppercase as well there's 50 >>> billion. Info here on the theoretical limits: >>> http://bit.ly/3QG2uj >>> >>> Using RandomStringUtils from commons-lang, I wrote a short method to >>> iterate over a million alphanumeric combinations using letters and >>> numbers of length 6 chars and didn't run into a single collision. >>> Using this method and a transactional check when adding into the >>> database is probably the way to go. >>> >>> We can also filter for bad words, BBC has an interesting list (thanks >>> Seth! :) >>> >>> WDYT? >>> >>> Steve >>> >>> >>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: >>> >>>> Is there a reason why the id generation could be more like tinyURL >>>> and >>>> simply be sequential and thus very short (at least initially and >>>> probably for most institutions it would stay small). >>>> >>>> Just a thought. >>>> -AZ >>>> >>>> >>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> >>>> wrote: >>>>> You could consider using 22 character uuids as autogenerated tinies >>>>> in >>>>> the url. It might not be able to be called tinyurl anymore, but it >>>>> would at least be short, linkable and (much more) collision free. >>>>> >>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >>>>> >>>>> I'm not convinced it would be necessary to keep these id's from >>>>> being >>>>> unpredictable. The current security model should prevent anyone >>>>> from >>>>> accessing content they don't have permission to even guessing >>>>> id's. I >>>>> think eventually when a UI was designed for this it might even want >>>>> to >>>>> allow the user to specify a custom ID, similar to how the Site >>>>> Aliases >>>>> (on new site creation) works, allowing a user to specify anything >>>>> they >>>>> want for their site alias. (Like tinyurl.com) So this would almost >>>>> guarantee collisions. >>>>> >>>>> So I wouldn't think that the possibility of a collision on an low >>>>> use >>>>> feature for an indexed column (when collisions will probably be >>>>> featured in eventually anyway) likely won't be a big deal. >>>>> >>>>> -Matthew >>>>> >>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >>>>>> Please take a look at the comments at the head of the >>>>>> "EighteenIDGenerator" >>>>>> file for some details of the maths involved in computing how many >>>>>> documents will arise before a collision: >>>>>> >>> >> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >>>>>> >>>>>> I wouldn't really recommend using a key space smaller than this >>>>>> one >>>>>> if you want to avoid collisions - although of course you will >>>>>> reprobe >>>>>> to be sure. Another issue with LCGs is that they are highly >>>>>> predictable - >>>>>> this is why I use a SecureRandom here: One of the requirements of >>>>>> the >>>>>> TinyURLService that I made for the work at CARET was that the ID >>>>>> sequence could not be predicted - to prevent the possibility of >>>>>> malicious >>>>>> "hijacking" of upcoming URL slots (this was actually a system >>>>>> designed >>>>>> to allocate small URLs that were plaintext signup URLs routed to a >>>>>> PHP >>>>>> system, perhaps more risk than is present in a typical >>>>>> application!) >>>>>> >>>>>> Note that a further issue you have with an LCG is running it in a >>>>>> cluster - >>>>>> even if you can lock on the particular instance you have, you >>>>>> can't >>>>>> guarantee that some other instance in the cluster has not >>>>>> stepped on >>>>>> past the slot you want. I think that the approach in my service >>>>>> (a SecureRandom plus a transactional check into persistence) is >>>>>> most >>>>>> likely to work reasonably well in a cluster - although it hasn't >>>>>> been >>>>>> tested there :P >>>>>> >>>>>> Cheers, >>>>>> Antranig. >>>>>> >>>>>> >>>>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>>>> >>>>>>> Carl, >>>>>>> >>>>>>> Sounds good, is there an existing LCG Java implementation we can >>>>>>> leverage or should we roll our own? >>>>>>> >>>>>>> cheers, >>>>>>> Steve >>>>>>> >>>>>>> >>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>>>>>> >>>>>>>> Summary: There is an ever increasing probability that the >>>>>>>> collision of >>>>>>>> IDs in the TinyURL service will slow down the service >>>>>>>> considerably. >>>>>>>> >>>>>>>> In looking at the implementation of the TinyURL service, I've >>>>>>>> noticed >>>>>>>> something about the way URLs are generated. If I'm headed in >>>>>>>> the >>>>>>>> wrong direction with this, please correct me. >>>>>>>> >>>>>>>> The very general approach is: >>>>>>>> 1) generate a random ID >>>>>>>> 2) check that the ID doesn't already exist >>>>>>>> 3) if exists, go to 1. if not, record the ID and release it to >>>>>>>> the >>>>>>>> caller. >>>>>>>> >>>>>>>> The following code for generating a URL: >>>>>>>> >>>>>>>> Random generator = new Random(); >>>>>>>> int seed = Math.abs(generator.nextInt()); >>>>>>>> >>>>>>>> has an issue with collision which is addressed by checking for >>>>>>>> the >>>>>>>> existence of that ID before handing it out but the algorithm run >>>>>>>> time >>>>>>>> will continue to grow because of the growing chance of collision >>>>>>>> over >>>>>>>> time. >>>>>>>> >>>>>>>> Explanation of concern: >>>>>>>> The first generated ID will have a 0/n chance of collision where >>>>>>>> n is >>>>>>>> the total number of possible IDs. The next ID has a 1/n chance, >>>>>>>> the >>>>>>>> next 2/n, etc up to the last having a chance of collision being >>>>>>>> n-1/n. >>>>>>>> Considering n needs to be sufficiently large to have a low to >>>>>>>> zero >>>>>>>> rate of duplication over time, this check will take more time >>>>>>>> because >>>>>>>> of the number of collisions over time. Assuming that the ID >>>>>>>> column in >>>>>>>> the database is indexed, the lookup time is minimized but >>>>>>>> there's >>>>>>>> still a call to the database lookup on each generated ID. As >>>>>>>> the >>>>>>>> number of IDs generated reaches the max possible, there are >>>>>>>> possibly >>>>>>>> n-1 lookups before a valid ID is found. >>>>>>>> >>>>>>>> The current code recreates the Random each time which stops >>>>>>>> nextInt() >>>>>>>> from knowing history (ie. new random seed on each generation). >>>>>>>> Java >>>>>>>> implements a linear congruential generator to generate random >>>>>>>> numbers >>>>>>>> which requires knowledge of the previously generated ID to be >>>>>>>> effective. Math.random() creates the random seed once and >>>>>>>> reuses it >>>>>>>> for the life of the JVM process which is a better approach. >>>>>>>> >>>>>>>> I would suggest creating an LCG in the TinyURL service so that >>>>>>>> the >>>>>>>> proper range can be easily managed (better performance due to a >>>>>>>> narrower scope than Math.random()) and the seed can be restored >>>>>>>> after >>>>>>>> a restart. I have created a prototype in Python based on the >>>>>>>> rules >>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a >>>>>>>> sufficiently large prime number. This still carries a concern >>>>>>>> of >>>>>>>> generating IDs across a cluster but is relatively easy to >>>>>>>> address. >>>>>>>> >>>>>>>> WDYT? >>>>>>>> >>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>>>>>>> _______________________________________________ >>>>>>>> sakai-dev mailing list >>>>>>>> sakai-dev@... >>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>> >>>>>>>> TO UNSUBSCRIBE: send email to >>> sakai-dev-unsubscribe@... >>>>>>>> with a subject of "unsubscribe" >>>>>>> >>>>>>> _______________________________________________ >>>>>>> sakai-dev mailing list >>>>>>> sakai-dev@... >>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>> >>>>>>> TO UNSUBSCRIBE: send email to >>> sakai-dev-unsubscribe@... >>>>>>> with a subject of "unsubscribe" >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ---------------------------------------------------------------- >>>>>> This message was sent using IMP, the Internet Messaging Program. >>>>>> >>>>>> _______________________________________________ >>>>>> sakai-dev mailing list >>>>>> sakai-dev@... >>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>> >>>>>> TO UNSUBSCRIBE: send email to >>> sakai-dev-unsubscribe@... >>>>>> with a subject of "unsubscribe" >>>>>> >>>>> _______________________________________________ >>>>> sakai-dev mailing list >>>>> sakai-dev@... >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>> >>>>> TO UNSUBSCRIBE: send email to >>> sakai-dev-unsubscribe@... >>>>> with a subject of "unsubscribe" >>>>> >>>> >>>> >>>> >>>> -- >>>> Aaron Zeckoski (azeckoski (at) vt.edu) >>>> Senior Research Engineer - CARET - University of Cambridge >>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski >>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile >>>> _______________________________________________ >>>> sakai-dev mailing list >>>> sakai-dev@... >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>> >>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>> with a subject of "unsubscribe" >>> >>> _______________________________________________ >>> sakai-dev mailing list >>> sakai-dev@... >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> >>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>> with a subject of "unsubscribe" >>> >> >> >> >> >> ---------------------------------------------------------------- >> This message was sent using IMP, the Internet Messaging Program. >> > > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" > -- Jon Gorrono PGP Key: 5434509D email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>+++++++.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<++++++++.>---.>>+.<<<----.-.>++.} http{ats.ucdavis.edu} _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueGood idea, I'll add that as an additional method where you specify the
TTL before the URL is no longer resolved. cheers, Steve --- Steve Swinsburg Portal Systems Developer Centre for e-Science Lancaster University Lancaster LA1 4YT email: s.swinsburg@... phone: +44 (0) 1524 594870 On 15 Jul 2009, at 04:27, Jon Gorrono wrote: > I have been a little curious why timeboxing is not used to decrease > collisions for this service.... we use the service in a custom guest > access app we made to put registration urls etc in emails.... > obviously we want to timebox access to the urls but a ttl can be > xtremely effective in reducing an otherwise skewed data set to > something resembling white noise... as it give a small window view of > a larger distribution > > Maybe an api call could change the ttl and return the probability of > collision during that time or some other silliness > > > > On Wed, Jul 8, 2009 at 2:52 PM, Steve > Swinsburg<s.swinsburg@...> wrote: >> Ok I've added an alternative method to return a longer random id >> which >> makes it less guessable. The number of permutations with this combo >> has 30 zeroes (on a-z and 0-9). >> >> Have some fun with the calculator! http://bit.ly/gJP1X >> >> cheers, >> Steve >> >> >> >> On 8 Jul 2009, at 17:25, antranig@... wrote: >> >>> The issue with collisions occurs when you are hosting URLs which are >>> not protected by the container, but held by some external resource - >>> for example, a signup system "click here to validate your email to >>> Xxxxx service". The implementation I made actually runs in two >>> modes, >>> you may request either "SECURE" or "SHORT" URLs by an option to the >>> API. The short URLs start at only a couple of characters and >>> are indeed sequential. >>> Anyway, feel free to implement your own system, it is the Sakai way >>> after all :) >>> >>> Cheers, >>> Antranig. >>> >>> Quoting Steve Swinsburg <s.swinsburg@...>: >>> >>>> The security is handled in the container anyway so you get a >>>> shortened >>>> URL and it resolves into the full URL, redirects and Sakai takes >>>> over. >>>> Even if it was sequential it wouldn't be a security concern. >>>> >>>> With a keyspace of lowercase letters and numbers(36 chars) there >>>> are >>>> about 2 billion combinations, with uppercase as well there's 50 >>>> billion. Info here on the theoretical limits: >>>> http://bit.ly/3QG2uj >>>> >>>> Using RandomStringUtils from commons-lang, I wrote a short method >>>> to >>>> iterate over a million alphanumeric combinations using letters and >>>> numbers of length 6 chars and didn't run into a single collision. >>>> Using this method and a transactional check when adding into the >>>> database is probably the way to go. >>>> >>>> We can also filter for bad words, BBC has an interesting list >>>> (thanks >>>> Seth! :) >>>> >>>> WDYT? >>>> >>>> Steve >>>> >>>> >>>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: >>>> >>>>> Is there a reason why the id generation could be more like tinyURL >>>>> and >>>>> simply be sequential and thus very short (at least initially and >>>>> probably for most institutions it would stay small). >>>>> >>>>> Just a thought. >>>>> -AZ >>>>> >>>>> >>>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> >>>>> wrote: >>>>>> You could consider using 22 character uuids as autogenerated >>>>>> tinies >>>>>> in >>>>>> the url. It might not be able to be called tinyurl anymore, but >>>>>> it >>>>>> would at least be short, linkable and (much more) collision free. >>>>>> >>>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >>>>>> >>>>>> I'm not convinced it would be necessary to keep these id's from >>>>>> being >>>>>> unpredictable. The current security model should prevent anyone >>>>>> from >>>>>> accessing content they don't have permission to even guessing >>>>>> id's. I >>>>>> think eventually when a UI was designed for this it might even >>>>>> want >>>>>> to >>>>>> allow the user to specify a custom ID, similar to how the Site >>>>>> Aliases >>>>>> (on new site creation) works, allowing a user to specify anything >>>>>> they >>>>>> want for their site alias. (Like tinyurl.com) So this would >>>>>> almost >>>>>> guarantee collisions. >>>>>> >>>>>> So I wouldn't think that the possibility of a collision on an low >>>>>> use >>>>>> feature for an indexed column (when collisions will probably be >>>>>> featured in eventually anyway) likely won't be a big deal. >>>>>> >>>>>> -Matthew >>>>>> >>>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >>>>>>> Please take a look at the comments at the head of the >>>>>>> "EighteenIDGenerator" >>>>>>> file for some details of the maths involved in computing how >>>>>>> many >>>>>>> documents will arise before a collision: >>>>>>> >>>> >>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >>>>>>> >>>>>>> I wouldn't really recommend using a key space smaller than this >>>>>>> one >>>>>>> if you want to avoid collisions - although of course you will >>>>>>> reprobe >>>>>>> to be sure. Another issue with LCGs is that they are highly >>>>>>> predictable - >>>>>>> this is why I use a SecureRandom here: One of the requirements >>>>>>> of >>>>>>> the >>>>>>> TinyURLService that I made for the work at CARET was that the ID >>>>>>> sequence could not be predicted - to prevent the possibility of >>>>>>> malicious >>>>>>> "hijacking" of upcoming URL slots (this was actually a system >>>>>>> designed >>>>>>> to allocate small URLs that were plaintext signup URLs routed >>>>>>> to a >>>>>>> PHP >>>>>>> system, perhaps more risk than is present in a typical >>>>>>> application!) >>>>>>> >>>>>>> Note that a further issue you have with an LCG is running it >>>>>>> in a >>>>>>> cluster - >>>>>>> even if you can lock on the particular instance you have, you >>>>>>> can't >>>>>>> guarantee that some other instance in the cluster has not >>>>>>> stepped on >>>>>>> past the slot you want. I think that the approach in my service >>>>>>> (a SecureRandom plus a transactional check into persistence) is >>>>>>> most >>>>>>> likely to work reasonably well in a cluster - although it hasn't >>>>>>> been >>>>>>> tested there :P >>>>>>> >>>>>>> Cheers, >>>>>>> Antranig. >>>>>>> >>>>>>> >>>>>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>>>>> >>>>>>>> Carl, >>>>>>>> >>>>>>>> Sounds good, is there an existing LCG Java implementation we >>>>>>>> can >>>>>>>> leverage or should we roll our own? >>>>>>>> >>>>>>>> cheers, >>>>>>>> Steve >>>>>>>> >>>>>>>> >>>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>>>>>>> >>>>>>>>> Summary: There is an ever increasing probability that the >>>>>>>>> collision of >>>>>>>>> IDs in the TinyURL service will slow down the service >>>>>>>>> considerably. >>>>>>>>> >>>>>>>>> In looking at the implementation of the TinyURL service, I've >>>>>>>>> noticed >>>>>>>>> something about the way URLs are generated. If I'm headed in >>>>>>>>> the >>>>>>>>> wrong direction with this, please correct me. >>>>>>>>> >>>>>>>>> The very general approach is: >>>>>>>>> 1) generate a random ID >>>>>>>>> 2) check that the ID doesn't already exist >>>>>>>>> 3) if exists, go to 1. if not, record the ID and release it >>>>>>>>> to >>>>>>>>> the >>>>>>>>> caller. >>>>>>>>> >>>>>>>>> The following code for generating a URL: >>>>>>>>> >>>>>>>>> Random generator = new Random(); >>>>>>>>> int seed = Math.abs(generator.nextInt()); >>>>>>>>> >>>>>>>>> has an issue with collision which is addressed by checking for >>>>>>>>> the >>>>>>>>> existence of that ID before handing it out but the algorithm >>>>>>>>> run >>>>>>>>> time >>>>>>>>> will continue to grow because of the growing chance of >>>>>>>>> collision >>>>>>>>> over >>>>>>>>> time. >>>>>>>>> >>>>>>>>> Explanation of concern: >>>>>>>>> The first generated ID will have a 0/n chance of collision >>>>>>>>> where >>>>>>>>> n is >>>>>>>>> the total number of possible IDs. The next ID has a 1/n >>>>>>>>> chance, >>>>>>>>> the >>>>>>>>> next 2/n, etc up to the last having a chance of collision >>>>>>>>> being >>>>>>>>> n-1/n. >>>>>>>>> Considering n needs to be sufficiently large to have a low to >>>>>>>>> zero >>>>>>>>> rate of duplication over time, this check will take more time >>>>>>>>> because >>>>>>>>> of the number of collisions over time. Assuming that the ID >>>>>>>>> column in >>>>>>>>> the database is indexed, the lookup time is minimized but >>>>>>>>> there's >>>>>>>>> still a call to the database lookup on each generated ID. As >>>>>>>>> the >>>>>>>>> number of IDs generated reaches the max possible, there are >>>>>>>>> possibly >>>>>>>>> n-1 lookups before a valid ID is found. >>>>>>>>> >>>>>>>>> The current code recreates the Random each time which stops >>>>>>>>> nextInt() >>>>>>>>> from knowing history (ie. new random seed on each generation). >>>>>>>>> Java >>>>>>>>> implements a linear congruential generator to generate random >>>>>>>>> numbers >>>>>>>>> which requires knowledge of the previously generated ID to be >>>>>>>>> effective. Math.random() creates the random seed once and >>>>>>>>> reuses it >>>>>>>>> for the life of the JVM process which is a better approach. >>>>>>>>> >>>>>>>>> I would suggest creating an LCG in the TinyURL service so that >>>>>>>>> the >>>>>>>>> proper range can be easily managed (better performance due >>>>>>>>> to a >>>>>>>>> narrower scope than Math.random()) and the seed can be >>>>>>>>> restored >>>>>>>>> after >>>>>>>>> a restart. I have created a prototype in Python based on the >>>>>>>>> rules >>>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a >>>>>>>>> sufficiently large prime number. This still carries a concern >>>>>>>>> of >>>>>>>>> generating IDs across a cluster but is relatively easy to >>>>>>>>> address. >>>>>>>>> >>>>>>>>> WDYT? >>>>>>>>> >>>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>>>>>>>> _______________________________________________ >>>>>>>>> sakai-dev mailing list >>>>>>>>> sakai-dev@... >>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>>> >>>>>>>>> TO UNSUBSCRIBE: send email to >>>> sakai-dev-unsubscribe@... >>>>>>>>> with a subject of "unsubscribe" >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> sakai-dev mailing list >>>>>>>> sakai-dev@... >>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>> >>>>>>>> TO UNSUBSCRIBE: send email to >>>> sakai-dev-unsubscribe@... >>>>>>>> with a subject of "unsubscribe" >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ---------------------------------------------------------------- >>>>>>> This message was sent using IMP, the Internet Messaging Program. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> sakai-dev mailing list >>>>>>> sakai-dev@... >>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>> >>>>>>> TO UNSUBSCRIBE: send email to >>>> sakai-dev-unsubscribe@... >>>>>>> with a subject of "unsubscribe" >>>>>>> >>>>>> _______________________________________________ >>>>>> sakai-dev mailing list >>>>>> sakai-dev@... >>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>> >>>>>> TO UNSUBSCRIBE: send email to >>>> sakai-dev-unsubscribe@... >>>>>> with a subject of "unsubscribe" >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Aaron Zeckoski (azeckoski (at) vt.edu) >>>>> Senior Research Engineer - CARET - University of Cambridge >>>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski >>>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile >>>>> _______________________________________________ >>>>> sakai-dev mailing list >>>>> sakai-dev@... >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>> >>>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>>> with a subject of "unsubscribe" >>>> >>>> _______________________________________________ >>>> sakai-dev mailing list >>>> sakai-dev@... >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>> >>>> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >>>> with a subject of "unsubscribe" >>>> >>> >>> >>> >>> >>> ---------------------------------------------------------------- >>> This message was sent using IMP, the Internet Messaging Program. >>> >> >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... >> with a subject of "unsubscribe" >> > > > > -- > Jon Gorrono > PGP Key: 5434509D > email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>++++++ > +.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<+++++++ > +.>---.>>+.<<<----.-.>++.} > http{ats.ucdavis.edu} > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > with a subject of "unsubscribe" _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueI'm sort of reeling over this.... are you sure?
unaccustomed and unprepared for this kind of response... :) On Wed, Jul 15, 2009 at 2:51 AM, Steve Swinsburg<s.swinsburg@...> wrote: > Good idea, I'll add that as an additional method where you specify the TTL > before the URL is no longer resolved. > > > cheers, > Steve > > --- > Steve Swinsburg > Portal Systems Developer > Centre for e-Science > Lancaster University > Lancaster > LA1 4YT > > email: s.swinsburg@... > phone: +44 (0) 1524 594870 > > > > > > > > On 15 Jul 2009, at 04:27, Jon Gorrono wrote: > >> I have been a little curious why timeboxing is not used to decrease >> collisions for this service.... we use the service in a custom guest >> access app we made to put registration urls etc in emails.... >> obviously we want to timebox access to the urls but a ttl can be >> xtremely effective in reducing an otherwise skewed data set to >> something resembling white noise... as it give a small window view of >> a larger distribution >> >> Maybe an api call could change the ttl and return the probability of >> collision during that time or some other silliness >> >> >> >> On Wed, Jul 8, 2009 at 2:52 PM, Steve >> Swinsburg<s.swinsburg@...> wrote: >>> >>> Ok I've added an alternative method to return a longer random id which >>> makes it less guessable. The number of permutations with this combo >>> has 30 zeroes (on a-z and 0-9). >>> >>> Have some fun with the calculator! http://bit.ly/gJP1X >>> >>> cheers, >>> Steve >>> >>> >>> >>> On 8 Jul 2009, at 17:25, antranig@... wrote: >>> >>>> The issue with collisions occurs when you are hosting URLs which are >>>> not protected by the container, but held by some external resource - >>>> for example, a signup system "click here to validate your email to >>>> Xxxxx service". The implementation I made actually runs in two modes, >>>> you may request either "SECURE" or "SHORT" URLs by an option to the >>>> API. The short URLs start at only a couple of characters and >>>> are indeed sequential. >>>> Anyway, feel free to implement your own system, it is the Sakai way >>>> after all :) >>>> >>>> Cheers, >>>> Antranig. >>>> >>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>> >>>>> The security is handled in the container anyway so you get a >>>>> shortened >>>>> URL and it resolves into the full URL, redirects and Sakai takes >>>>> over. >>>>> Even if it was sequential it wouldn't be a security concern. >>>>> >>>>> With a keyspace of lowercase letters and numbers(36 chars) there are >>>>> about 2 billion combinations, with uppercase as well there's 50 >>>>> billion. Info here on the theoretical limits: >>>>> http://bit.ly/3QG2uj >>>>> >>>>> Using RandomStringUtils from commons-lang, I wrote a short method to >>>>> iterate over a million alphanumeric combinations using letters and >>>>> numbers of length 6 chars and didn't run into a single collision. >>>>> Using this method and a transactional check when adding into the >>>>> database is probably the way to go. >>>>> >>>>> We can also filter for bad words, BBC has an interesting list (thanks >>>>> Seth! :) >>>>> >>>>> WDYT? >>>>> >>>>> Steve >>>>> >>>>> >>>>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: >>>>> >>>>>> Is there a reason why the id generation could be more like tinyURL >>>>>> and >>>>>> simply be sequential and thus very short (at least initially and >>>>>> probably for most institutions it would stay small). >>>>>> >>>>>> Just a thought. >>>>>> -AZ >>>>>> >>>>>> >>>>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> >>>>>> wrote: >>>>>>> >>>>>>> You could consider using 22 character uuids as autogenerated tinies >>>>>>> in >>>>>>> the url. It might not be able to be called tinyurl anymore, but it >>>>>>> would at least be short, linkable and (much more) collision free. >>>>>>> >>>>>>> >>>>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >>>>>>> >>>>>>> I'm not convinced it would be necessary to keep these id's from >>>>>>> being >>>>>>> unpredictable. The current security model should prevent anyone >>>>>>> from >>>>>>> accessing content they don't have permission to even guessing >>>>>>> id's. I >>>>>>> think eventually when a UI was designed for this it might even want >>>>>>> to >>>>>>> allow the user to specify a custom ID, similar to how the Site >>>>>>> Aliases >>>>>>> (on new site creation) works, allowing a user to specify anything >>>>>>> they >>>>>>> want for their site alias. (Like tinyurl.com) So this would almost >>>>>>> guarantee collisions. >>>>>>> >>>>>>> So I wouldn't think that the possibility of a collision on an low >>>>>>> use >>>>>>> feature for an indexed column (when collisions will probably be >>>>>>> featured in eventually anyway) likely won't be a big deal. >>>>>>> >>>>>>> -Matthew >>>>>>> >>>>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> wrote: >>>>>>>> >>>>>>>> Please take a look at the comments at the head of the >>>>>>>> "EighteenIDGenerator" >>>>>>>> file for some details of the maths involved in computing how many >>>>>>>> documents will arise before a collision: >>>>>>>> >>>>> >>>> >>>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >>>>>>>> >>>>>>>> I wouldn't really recommend using a key space smaller than this >>>>>>>> one >>>>>>>> if you want to avoid collisions - although of course you will >>>>>>>> reprobe >>>>>>>> to be sure. Another issue with LCGs is that they are highly >>>>>>>> predictable - >>>>>>>> this is why I use a SecureRandom here: One of the requirements of >>>>>>>> the >>>>>>>> TinyURLService that I made for the work at CARET was that the ID >>>>>>>> sequence could not be predicted - to prevent the possibility of >>>>>>>> malicious >>>>>>>> "hijacking" of upcoming URL slots (this was actually a system >>>>>>>> designed >>>>>>>> to allocate small URLs that were plaintext signup URLs routed to a >>>>>>>> PHP >>>>>>>> system, perhaps more risk than is present in a typical >>>>>>>> application!) >>>>>>>> >>>>>>>> Note that a further issue you have with an LCG is running it in a >>>>>>>> cluster - >>>>>>>> even if you can lock on the particular instance you have, you >>>>>>>> can't >>>>>>>> guarantee that some other instance in the cluster has not >>>>>>>> stepped on >>>>>>>> past the slot you want. I think that the approach in my service >>>>>>>> (a SecureRandom plus a transactional check into persistence) is >>>>>>>> most >>>>>>>> likely to work reasonably well in a cluster - although it hasn't >>>>>>>> been >>>>>>>> tested there :P >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Antranig. >>>>>>>> >>>>>>>> >>>>>>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>>>>>> >>>>>>>>> Carl, >>>>>>>>> >>>>>>>>> Sounds good, is there an existing LCG Java implementation we can >>>>>>>>> leverage or should we roll our own? >>>>>>>>> >>>>>>>>> cheers, >>>>>>>>> Steve >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>>>>>>>> >>>>>>>>>> Summary: There is an ever increasing probability that the >>>>>>>>>> collision of >>>>>>>>>> IDs in the TinyURL service will slow down the service >>>>>>>>>> considerably. >>>>>>>>>> >>>>>>>>>> In looking at the implementation of the TinyURL service, I've >>>>>>>>>> noticed >>>>>>>>>> something about the way URLs are generated. If I'm headed in >>>>>>>>>> the >>>>>>>>>> wrong direction with this, please correct me. >>>>>>>>>> >>>>>>>>>> The very general approach is: >>>>>>>>>> 1) generate a random ID >>>>>>>>>> 2) check that the ID doesn't already exist >>>>>>>>>> 3) if exists, go to 1. if not, record the ID and release it to >>>>>>>>>> the >>>>>>>>>> caller. >>>>>>>>>> >>>>>>>>>> The following code for generating a URL: >>>>>>>>>> >>>>>>>>>> Random generator = new Random(); >>>>>>>>>> int seed = Math.abs(generator.nextInt()); >>>>>>>>>> >>>>>>>>>> has an issue with collision which is addressed by checking for >>>>>>>>>> the >>>>>>>>>> existence of that ID before handing it out but the algorithm run >>>>>>>>>> time >>>>>>>>>> will continue to grow because of the growing chance of collision >>>>>>>>>> over >>>>>>>>>> time. >>>>>>>>>> >>>>>>>>>> Explanation of concern: >>>>>>>>>> The first generated ID will have a 0/n chance of collision where >>>>>>>>>> n is >>>>>>>>>> the total number of possible IDs. The next ID has a 1/n chance, >>>>>>>>>> the >>>>>>>>>> next 2/n, etc up to the last having a chance of collision being >>>>>>>>>> n-1/n. >>>>>>>>>> Considering n needs to be sufficiently large to have a low to >>>>>>>>>> zero >>>>>>>>>> rate of duplication over time, this check will take more time >>>>>>>>>> because >>>>>>>>>> of the number of collisions over time. Assuming that the ID >>>>>>>>>> column in >>>>>>>>>> the database is indexed, the lookup time is minimized but >>>>>>>>>> there's >>>>>>>>>> still a call to the database lookup on each generated ID. As >>>>>>>>>> the >>>>>>>>>> number of IDs generated reaches the max possible, there are >>>>>>>>>> possibly >>>>>>>>>> n-1 lookups before a valid ID is found. >>>>>>>>>> >>>>>>>>>> The current code recreates the Random each time which stops >>>>>>>>>> nextInt() >>>>>>>>>> from knowing history (ie. new random seed on each generation). >>>>>>>>>> Java >>>>>>>>>> implements a linear congruential generator to generate random >>>>>>>>>> numbers >>>>>>>>>> which requires knowledge of the previously generated ID to be >>>>>>>>>> effective. Math.random() creates the random seed once and >>>>>>>>>> reuses it >>>>>>>>>> for the life of the JVM process which is a better approach. >>>>>>>>>> >>>>>>>>>> I would suggest creating an LCG in the TinyURL service so that >>>>>>>>>> the >>>>>>>>>> proper range can be easily managed (better performance due to a >>>>>>>>>> narrower scope than Math.random()) and the seed can be restored >>>>>>>>>> after >>>>>>>>>> a restart. I have created a prototype in Python based on the >>>>>>>>>> rules >>>>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a >>>>>>>>>> sufficiently large prime number. This still carries a concern >>>>>>>>>> of >>>>>>>>>> generating IDs across a cluster but is relatively easy to >>>>>>>>>> address. >>>>>>>>>> >>>>>>>>>> WDYT? >>>>>>>>>> >>>>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>>>>>>>>> _______________________________________________ >>>>>>>>>> sakai-dev mailing list >>>>>>>>>> sakai-dev@... >>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>>>> >>>>>>>>>> TO UNSUBSCRIBE: send email to >>>>> >>>>> sakai-dev-unsubscribe@... >>>>>>>>>> >>>>>>>>>> with a subject of "unsubscribe" >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> sakai-dev mailing list >>>>>>>>> sakai-dev@... >>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>>> >>>>>>>>> TO UNSUBSCRIBE: send email to >>>>> >>>>> sakai-dev-unsubscribe@... >>>>>>>>> >>>>>>>>> with a subject of "unsubscribe" >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ---------------------------------------------------------------- >>>>>>>> This message was sent using IMP, the Internet Messaging Program. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> sakai-dev mailing list >>>>>>>> sakai-dev@... >>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>> >>>>>>>> TO UNSUBSCRIBE: send email to >>>>> >>>>> sakai-dev-unsubscribe@... >>>>>>>> >>>>>>>> with a subject of "unsubscribe" >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> sakai-dev mailing list >>>>>>> sakai-dev@... >>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>> >>>>>>> TO UNSUBSCRIBE: send email to >>>>> >>>>> sakai-dev-unsubscribe@... >>>>>>> >>>>>>> with a subject of "unsubscribe" >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Aaron Zeckoski (azeckoski (at) vt.edu) >>>>>> Senior Research Engineer - CARET - University of Cambridge >>>>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski >>>>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile >>>>>> _______________________________________________ >>>>>> sakai-dev mailing list >>>>>> sakai-dev@... >>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>> >>>>>> TO UNSUBSCRIBE: send email to >>>>>> sakai-dev-unsubscribe@... >>>>>> with a subject of "unsubscribe" >>>>> >>>>> _______________________________________________ >>>>> sakai-dev mailing list >>>>> sakai-dev@... >>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>> >>>>> TO UNSUBSCRIBE: send email to >>>>> sakai-dev-unsubscribe@... >>>>> with a subject of "unsubscribe" >>>>> >>>> >>>> >>>> >>>> >>>> ---------------------------------------------------------------- >>>> This message was sent using IMP, the Internet Messaging Program. >>>> >>> >>> _______________________________________________ >>> sakai-dev mailing list >>> sakai-dev@... >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> >>> TO UNSUBSCRIBE: send email to >>> sakai-dev-unsubscribe@... with a subject of >>> "unsubscribe" >>> >> >> >> >> -- >> Jon Gorrono >> PGP Key: 5434509D >> >> email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>+++++++.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<++++++++.>---.>>+.<<<----.-.>++.} >> http{ats.ucdavis.edu} >> _______________________________________________ >> sakai-dev mailing list >> sakai-dev@... >> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >> >> TO UNSUBSCRIBE: send email to >> sakai-dev-unsubscribe@... with a subject of >> "unsubscribe" > > -- Jon Gorrono PGP Key: 5434509D email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>+++++++.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<++++++++.>---.>>+.<<<----.-.>++.} http{ats.ucdavis.edu} Sent from Davis, CA, United States _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
|
|
Re: [Building Sakai] TinyURL Service Impl IssueWhat, that you had a good idea? Or someone would implement it :) Just
got a Jira space, I'll file it. cheers, Steve On 18/07/2009, at 2:56 AM, Jon Gorrono wrote: > I'm sort of reeling over this.... are you sure? > > unaccustomed and unprepared for this kind of response... > > :) > > > > On Wed, Jul 15, 2009 at 2:51 AM, Steve > Swinsburg<s.swinsburg@...> wrote: >> Good idea, I'll add that as an additional method where you specify >> the TTL >> before the URL is no longer resolved. >> >> >> cheers, >> Steve >> >> --- >> Steve Swinsburg >> Portal Systems Developer >> Centre for e-Science >> Lancaster University >> Lancaster >> LA1 4YT >> >> email: s.swinsburg@... >> phone: +44 (0) 1524 594870 >> >> >> >> >> >> >> >> On 15 Jul 2009, at 04:27, Jon Gorrono wrote: >> >>> I have been a little curious why timeboxing is not used to decrease >>> collisions for this service.... we use the service in a custom guest >>> access app we made to put registration urls etc in emails.... >>> obviously we want to timebox access to the urls but a ttl can be >>> xtremely effective in reducing an otherwise skewed data set to >>> something resembling white noise... as it give a small window view >>> of >>> a larger distribution >>> >>> Maybe an api call could change the ttl and return the probability of >>> collision during that time or some other silliness >>> >>> >>> >>> On Wed, Jul 8, 2009 at 2:52 PM, Steve >>> Swinsburg<s.swinsburg@...> wrote: >>>> >>>> Ok I've added an alternative method to return a longer random id >>>> which >>>> makes it less guessable. The number of permutations with this combo >>>> has 30 zeroes (on a-z and 0-9). >>>> >>>> Have some fun with the calculator! http://bit.ly/gJP1X >>>> >>>> cheers, >>>> Steve >>>> >>>> >>>> >>>> On 8 Jul 2009, at 17:25, antranig@... wrote: >>>> >>>>> The issue with collisions occurs when you are hosting URLs which >>>>> are >>>>> not protected by the container, but held by some external >>>>> resource - >>>>> for example, a signup system "click here to validate your email to >>>>> Xxxxx service". The implementation I made actually runs in two >>>>> modes, >>>>> you may request either "SECURE" or "SHORT" URLs by an option to >>>>> the >>>>> API. The short URLs start at only a couple of characters and >>>>> are indeed sequential. >>>>> Anyway, feel free to implement your own system, it is the Sakai >>>>> way >>>>> after all :) >>>>> >>>>> Cheers, >>>>> Antranig. >>>>> >>>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>>> >>>>>> The security is handled in the container anyway so you get a >>>>>> shortened >>>>>> URL and it resolves into the full URL, redirects and Sakai takes >>>>>> over. >>>>>> Even if it was sequential it wouldn't be a security concern. >>>>>> >>>>>> With a keyspace of lowercase letters and numbers(36 chars) >>>>>> there are >>>>>> about 2 billion combinations, with uppercase as well there's 50 >>>>>> billion. Info here on the theoretical limits: >>>>>> http://bit.ly/3QG2uj >>>>>> >>>>>> Using RandomStringUtils from commons-lang, I wrote a short >>>>>> method to >>>>>> iterate over a million alphanumeric combinations using letters >>>>>> and >>>>>> numbers of length 6 chars and didn't run into a single collision. >>>>>> Using this method and a transactional check when adding into the >>>>>> database is probably the way to go. >>>>>> >>>>>> We can also filter for bad words, BBC has an interesting list >>>>>> (thanks >>>>>> Seth! :) >>>>>> >>>>>> WDYT? >>>>>> >>>>>> Steve >>>>>> >>>>>> >>>>>> On 8 Jul 2009, at 12:10, Aaron Zeckoski wrote: >>>>>> >>>>>>> Is there a reason why the id generation could be more like >>>>>>> tinyURL >>>>>>> and >>>>>>> simply be sequential and thus very short (at least initially and >>>>>>> probably for most institutions it would stay small). >>>>>>> >>>>>>> Just a thought. >>>>>>> -AZ >>>>>>> >>>>>>> >>>>>>> On Wed, Jul 8, 2009 at 5:00 PM, Matthew Jones<jonespm@...> >>>>>>> wrote: >>>>>>>> >>>>>>>> You could consider using 22 character uuids as autogenerated >>>>>>>> tinies >>>>>>>> in >>>>>>>> the url. It might not be able to be called tinyurl anymore, >>>>>>>> but it >>>>>>>> would at least be short, linkable and (much more) collision >>>>>>>> free. >>>>>>>> >>>>>>>> >>>>>>>> http://stackoverflow.com/questions/772802/storing-uuid-as-base64-string >>>>>>>> >>>>>>>> I'm not convinced it would be necessary to keep these id's from >>>>>>>> being >>>>>>>> unpredictable. The current security model should prevent anyone >>>>>>>> from >>>>>>>> accessing content they don't have permission to even guessing >>>>>>>> id's. I >>>>>>>> think eventually when a UI was designed for this it might >>>>>>>> even want >>>>>>>> to >>>>>>>> allow the user to specify a custom ID, similar to how the Site >>>>>>>> Aliases >>>>>>>> (on new site creation) works, allowing a user to specify >>>>>>>> anything >>>>>>>> they >>>>>>>> want for their site alias. (Like tinyurl.com) So this would >>>>>>>> almost >>>>>>>> guarantee collisions. >>>>>>>> >>>>>>>> So I wouldn't think that the possibility of a collision on an >>>>>>>> low >>>>>>>> use >>>>>>>> feature for an indexed column (when collisions will probably be >>>>>>>> featured in eventually anyway) likely won't be a big deal. >>>>>>>> >>>>>>>> -Matthew >>>>>>>> >>>>>>>> On Tue, Jul 7, 2009 at 7:08 PM, <antranig@...> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Please take a look at the comments at the head of the >>>>>>>>> "EighteenIDGenerator" >>>>>>>>> file for some details of the maths involved in computing how >>>>>>>>> many >>>>>>>>> documents will arise before a collision: >>>>>>>>> >>>>>> >>>>> >>>>> https://source.caret.cam.ac.uk/rsf/projects/PonderUtilCore/trunk/src/uk/org/ponder/hashutil/EighteenIDGenerator.java >>>>>>>>> >>>>>>>>> I wouldn't really recommend using a key space smaller than >>>>>>>>> this >>>>>>>>> one >>>>>>>>> if you want to avoid collisions - although of course you will >>>>>>>>> reprobe >>>>>>>>> to be sure. Another issue with LCGs is that they are highly >>>>>>>>> predictable - >>>>>>>>> this is why I use a SecureRandom here: One of the >>>>>>>>> requirements of >>>>>>>>> the >>>>>>>>> TinyURLService that I made for the work at CARET was that >>>>>>>>> the ID >>>>>>>>> sequence could not be predicted - to prevent the possibility >>>>>>>>> of >>>>>>>>> malicious >>>>>>>>> "hijacking" of upcoming URL slots (this was actually a system >>>>>>>>> designed >>>>>>>>> to allocate small URLs that were plaintext signup URLs >>>>>>>>> routed to a >>>>>>>>> PHP >>>>>>>>> system, perhaps more risk than is present in a typical >>>>>>>>> application!) >>>>>>>>> >>>>>>>>> Note that a further issue you have with an LCG is running it >>>>>>>>> in a >>>>>>>>> cluster - >>>>>>>>> even if you can lock on the particular instance you have, you >>>>>>>>> can't >>>>>>>>> guarantee that some other instance in the cluster has not >>>>>>>>> stepped on >>>>>>>>> past the slot you want. I think that the approach in my >>>>>>>>> service >>>>>>>>> (a SecureRandom plus a transactional check into persistence) >>>>>>>>> is >>>>>>>>> most >>>>>>>>> likely to work reasonably well in a cluster - although it >>>>>>>>> hasn't >>>>>>>>> been >>>>>>>>> tested there :P >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Antranig. >>>>>>>>> >>>>>>>>> >>>>>>>>> Quoting Steve Swinsburg <s.swinsburg@...>: >>>>>>>>> >>>>>>>>>> Carl, >>>>>>>>>> >>>>>>>>>> Sounds good, is there an existing LCG Java implementation >>>>>>>>>> we can >>>>>>>>>> leverage or should we roll our own? >>>>>>>>>> >>>>>>>>>> cheers, >>>>>>>>>> Steve >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6 Jul 2009, at 12:34, Carl Hall wrote: >>>>>>>>>> >>>>>>>>>>> Summary: There is an ever increasing probability that the >>>>>>>>>>> collision of >>>>>>>>>>> IDs in the TinyURL service will slow down the service >>>>>>>>>>> considerably. >>>>>>>>>>> >>>>>>>>>>> In looking at the implementation of the TinyURL service, >>>>>>>>>>> I've >>>>>>>>>>> noticed >>>>>>>>>>> something about the way URLs are generated. If I'm headed >>>>>>>>>>> in >>>>>>>>>>> the >>>>>>>>>>> wrong direction with this, please correct me. >>>>>>>>>>> >>>>>>>>>>> The very general approach is: >>>>>>>>>>> 1) generate a random ID >>>>>>>>>>> 2) check that the ID doesn't already exist >>>>>>>>>>> 3) if exists, go to 1. if not, record the ID and release >>>>>>>>>>> it to >>>>>>>>>>> the >>>>>>>>>>> caller. >>>>>>>>>>> >>>>>>>>>>> The following code for generating a URL: >>>>>>>>>>> >>>>>>>>>>> Random generator = new Random(); >>>>>>>>>>> int seed = Math.abs(generator.nextInt()); >>>>>>>>>>> >>>>>>>>>>> has an issue with collision which is addressed by checking >>>>>>>>>>> for >>>>>>>>>>> the >>>>>>>>>>> existence of that ID before handing it out but the >>>>>>>>>>> algorithm run >>>>>>>>>>> time >>>>>>>>>>> will continue to grow because of the growing chance of >>>>>>>>>>> collision >>>>>>>>>>> over >>>>>>>>>>> time. >>>>>>>>>>> >>>>>>>>>>> Explanation of concern: >>>>>>>>>>> The first generated ID will have a 0/n chance of collision >>>>>>>>>>> where >>>>>>>>>>> n is >>>>>>>>>>> the total number of possible IDs. The next ID has a 1/n >>>>>>>>>>> chance, >>>>>>>>>>> the >>>>>>>>>>> next 2/n, etc up to the last having a chance of collision >>>>>>>>>>> being >>>>>>>>>>> n-1/n. >>>>>>>>>>> Considering n needs to be sufficiently large to have a low >>>>>>>>>>> to >>>>>>>>>>> zero >>>>>>>>>>> rate of duplication over time, this check will take more >>>>>>>>>>> time >>>>>>>>>>> because >>>>>>>>>>> of the number of collisions over time. Assuming that the ID >>>>>>>>>>> column in >>>>>>>>>>> the database is indexed, the lookup time is minimized but >>>>>>>>>>> there's >>>>>>>>>>> still a call to the database lookup on each generated ID. >>>>>>>>>>> As >>>>>>>>>>> the >>>>>>>>>>> number of IDs generated reaches the max possible, there are >>>>>>>>>>> possibly >>>>>>>>>>> n-1 lookups before a valid ID is found. >>>>>>>>>>> >>>>>>>>>>> The current code recreates the Random each time which stops >>>>>>>>>>> nextInt() >>>>>>>>>>> from knowing history (ie. new random seed on each >>>>>>>>>>> generation). >>>>>>>>>>> Java >>>>>>>>>>> implements a linear congruential generator to generate >>>>>>>>>>> random >>>>>>>>>>> numbers >>>>>>>>>>> which requires knowledge of the previously generated ID to >>>>>>>>>>> be >>>>>>>>>>> effective. Math.random() creates the random seed once and >>>>>>>>>>> reuses it >>>>>>>>>>> for the life of the JVM process which is a better approach. >>>>>>>>>>> >>>>>>>>>>> I would suggest creating an LCG in the TinyURL service so >>>>>>>>>>> that >>>>>>>>>>> the >>>>>>>>>>> proper range can be easily managed (better performance due >>>>>>>>>>> to a >>>>>>>>>>> narrower scope than Math.random()) and the seed can be >>>>>>>>>>> restored >>>>>>>>>>> after >>>>>>>>>>> a restart. I have created a prototype in Python based on >>>>>>>>>>> the >>>>>>>>>>> rules >>>>>>>>>>> laid out in the Wikipedia article on LCG's [1] and using a >>>>>>>>>>> sufficiently large prime number. This still carries a >>>>>>>>>>> concern >>>>>>>>>>> of >>>>>>>>>>> generating IDs across a cluster but is relatively easy to >>>>>>>>>>> address. >>>>>>>>>>> >>>>>>>>>>> WDYT? >>>>>>>>>>> >>>>>>>>>>> [1] http://en.wikipedia.org/wiki/Linear_congruential_generator >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> sakai-dev mailing list >>>>>>>>>>> sakai-dev@... >>>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>>>>> >>>>>>>>>>> TO UNSUBSCRIBE: send email to >>>>>> >>>>>> sakai-dev-unsubscribe@... >>>>>>>>>>> >>>>>>>>>>> with a subject of "unsubscribe" >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> sakai-dev mailing list >>>>>>>>>> sakai-dev@... >>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>>>> >>>>>>>>>> TO UNSUBSCRIBE: send email to >>>>>> >>>>>> sakai-dev-unsubscribe@... >>>>>>>>>> >>>>>>>>>> with a subject of "unsubscribe" >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ---------------------------------------------------------------- >>>>>>>>> This message was sent using IMP, the Internet Messaging >>>>>>>>> Program. >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> sakai-dev mailing list >>>>>>>>> sakai-dev@... >>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>>> >>>>>>>>> TO UNSUBSCRIBE: send email to >>>>>> >>>>>> sakai-dev-unsubscribe@... >>>>>>>>> >>>>>>>>> with a subject of "unsubscribe" >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> sakai-dev mailing list >>>>>>>> sakai-dev@... >>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>>> >>>>>>>> TO UNSUBSCRIBE: send email to >>>>>> >>>>>> sakai-dev-unsubscribe@... >>>>>>>> >>>>>>>> with a subject of "unsubscribe" >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Aaron Zeckoski (azeckoski (at) vt.edu) >>>>>>> Senior Research Engineer - CARET - University of Cambridge >>>>>>> https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski >>>>>>> http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile >>>>>>> _______________________________________________ >>>>>>> sakai-dev mailing list >>>>>>> sakai-dev@... >>>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>>> >>>>>>> TO UNSUBSCRIBE: send email to >>>>>>> sakai-dev-unsubscribe@... >>>>>>> with a subject of "unsubscribe" >>>>>> >>>>>> _______________________________________________ >>>>>> sakai-dev mailing list >>>>>> sakai-dev@... >>>>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>>>> >>>>>> TO UNSUBSCRIBE: send email to >>>>>> sakai-dev-unsubscribe@... >>>>>> with a subject of "unsubscribe" >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ---------------------------------------------------------------- >>>>> This message was sent using IMP, the Internet Messaging Program. >>>>> >>>> >>>> _______________________________________________ >>>> sakai-dev mailing list >>>> sakai-dev@... >>>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>>> >>>> TO UNSUBSCRIBE: send email to >>>> sakai-dev-unsubscribe@... with a subject of >>>> "unsubscribe" >>>> >>> >>> >>> >>> -- >>> Jon Gorrono >>> PGP Key: 5434509D >>> >>> email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>++++ >>> +++.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<+++++++ >>> +.>---.>>+.<<<----.-.>++.} >>> http{ats.ucdavis.edu} >>> _______________________________________________ >>> sakai-dev mailing list >>> sakai-dev@... >>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev >>> >>> TO UNSUBSCRIBE: send email to >>> sakai-dev-unsubscribe@... with a subject of >>> "unsubscribe" >> >> > > > > -- > Jon Gorrono > PGP Key: 5434509D > email{>+++++++++[>+++++++++++>++++++++++++>+++++++>+++++<<<<-]>++++++ > +.>++++.<---.>-.+++..---.-.+.>+.<++++++.<----.+.---.>+.<+++++++ > +.>---.>>+.<<<----.-.>++.} > http{ats.ucdavis.edu} > > Sent from Davis, CA, United States > _______________________________________________ > sakai-dev mailing list > sakai-dev@... > http://collab.sakaiproject.org/mailman/listinfo/sakai-dev > > TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... > with a subject of "unsubscribe" _______________________________________________ sakai-dev mailing list sakai-dev@... http://collab.sakaiproject.org/mailman/listinfo/sakai-dev TO UNSUBSCRIBE: send email to sakai-dev-unsubscribe@... with a subject of "unsubscribe" |
| Free embeddable forum powered by Nabble | Forum Help |